Author: Erin LeDell
Contact: erin@h2o.ai
This tutorial steps through a quick introduction to H2O's R API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from R.
Most of the functionality for R's data.frame
is exactly the same syntax for an H2OFrame
, so if you are comfortable with R, data frame manipulation will come naturally to you in H2O. The modeling syntax in the H2O R API may also remind you of other machine learning packages in R.
References: H2O R API documentation, the H2O Documentation landing page and H2O general documentation.
This tutorial assumes you have R installed. The h2o
R package has a few dependencies which can be installed using CRAN. The packages that are required (which also have their own dependencies) can be installed in R as follows:
pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}
Once the dependencies are installed, you can install H2O. We will use the latest stable version of the h2o
R package, which at the time of writing is H2O v3.8.0.4 (aka "Tukey-4"). The latest stable version can be installed using the commands on the H2O R Installation page.
After the R package is installed, we can start up an H2O cluster. In a R terminal, we load the h2o
package and start up an H2O cluster as follows:
library(h2o)
# Start an H2O Cluster on your local machine
h2o.init(nthreads = -1) #nthreads = -1 uses all cores on your machine
Connection successful! R is connected to the H2O cluster: H2O cluster uptime: 1 seconds 764 milliseconds H2O cluster version: 3.10.0.3 H2O cluster version age: 9 days H2O cluster name: H2O_started_from_R_laurend_syo488 H2O cluster total nodes: 1 H2O cluster total memory: 3.56 GB H2O cluster total cores: 8 H2O cluster allowed cores: 8 H2O cluster healthy: TRUE H2O Connection ip: localhost H2O Connection port: 54321 H2O Connection proxy: NA R Version: R version 3.3.1 (2016-06-21)
If you already have an H2O cluster running that you'd like to connect to (for example, in a multi-node Hadoop environment), then you can specify the IP and port of that cluster as follows:
# This will not actually do anything since it's a fake IP address
# h2o.init(ip="123.45.67.89", port=54321)
The following code downloads a copy of the EEG Eye State dataset. All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.
We can import the data directly into H2O using the import_file
method in the Python API. The import path can be a URL, a local path, a path to an HDFS file, or a file on Amazon S3.
#csv_url <- "http://www.stat.berkeley.edu/~ledell/data/eeg_eyestate_splits.csv"
csv_url <- "https://h2o-public-test-data.s3.amazonaws.com/smalldata/eeg/eeg_eyestate_splits.csv"
data <- h2o.importFile(csv_url)
|======================================================================| 100%
Once we have loaded the data, let's take a quick look. First the dimension of the frame:
dim(data)
Now let's take a look at the top of the frame:
head(data)
AF3 | F7 | F3 | FC5 | T7 | P7 | O1 | O2 | P8 | T8 | FC6 | F4 | F8 | AF4 | eyeDetection | split | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 4329.23 | 4009.23 | 4289.23 | 4148.21 | 4350.26 | 4586.15 | 4096.92 | 4641.03 | 4222.05 | 4238.46 | 4211.28 | 4280.51 | 4635.9 | 4393.85 | 0 | valid |
2 | 4324.62 | 4004.62 | 4293.85 | 4148.72 | 4342.05 | 4586.67 | 4097.44 | 4638.97 | 4210.77 | 4226.67 | 4207.69 | 4279.49 | 4632.82 | 4384.1 | 0 | test |
3 | 4327.69 | 4006.67 | 4295.38 | 4156.41 | 4336.92 | 4583.59 | 4096.92 | 4630.26 | 4207.69 | 4222.05 | 4206.67 | 4282.05 | 4628.72 | 4389.23 | 0 | train |
4 | 4328.72 | 4011.79 | 4296.41 | 4155.9 | 4343.59 | 4582.56 | 4097.44 | 4630.77 | 4217.44 | 4235.38 | 4210.77 | 4287.69 | 4632.31 | 4396.41 | 0 | train |
5 | 4326.15 | 4011.79 | 4292.31 | 4151.28 | 4347.69 | 4586.67 | 4095.9 | 4627.69 | 4210.77 | 4244.1 | 4212.82 | 4288.21 | 4632.82 | 4398.46 | 0 | train |
6 | 4321.03 | 4004.62 | 4284.1 | 4153.33 | 4345.64 | 4587.18 | 4093.33 | 4616.92 | 4202.56 | 4232.82 | 4209.74 | 4281.03 | 4628.21 | 4389.74 | 0 | train |
The first 14 columns are numeric values that represent EEG measurements from the headset. The "eyeDetection" column is the response. There is an additional column called "split" that was added (by me) in order to specify partitions of the data (so we can easily benchmark against other tools outside of H2O using the same splits). I randomly divided the dataset into three partitions: train (60%), valid (%20) and test (20%) and marked which split each row belongs to in the "split" column.
Let's take a look at the column names. The data contains derived features from the medical images of the tumors.
names(data)
To select a subset of the columns to look at, typical R data.frame indexing applies:
columns <- c('AF3', 'eyeDetection', 'split')
head(data[columns])
AF3 | eyeDetection | split | |
---|---|---|---|
1 | 4329.23 | 0 | valid |
2 | 4324.62 | 0 | test |
3 | 4327.69 | 0 | train |
4 | 4328.72 | 0 | train |
5 | 4326.15 | 0 | train |
6 | 4321.03 | 0 | train |
Now let's select a single column, for example -- the response column, and look at the data more closely:
y <- 'eyeDetection'
data[y]
eyeDetection 1 0 2 0 3 0 4 0 5 0 6 0 [14980 rows x 1 column]
It looks like a binary response, but let's validate that assumption:
h2o.unique(data[y])
C1 1 0 2 1 [2 rows x 1 column]
If you don't specify the column types when you import the file, H2O makes a guess at what your column types are. If there are 0's and 1's in a column, H2O will automatically parse that as numeric by default.
Therefore, we should convert the response column to a more efficient "factor" representation (called "enum" in Java) -- in this case it is a categorial variable with two levels, 0 and 1. If the only column in my data that is categorical is the response, I typically don't bother specifying the column type during the parse, and instead use this one-liner to convert it aftewards:
data[y] <- as.factor(data[y])
Now we can check that there are two levels in our response column:
h2o.nlevels(data[y])
We can query the categorical "levels" as well ('0' and '1' stand for "eye open" and "eye closed") to see what they are:
h2o.levels(data[y])
We may want to check if there are any missing values, so let's look for NAs in our dataset. For all the supervised H2O algorithms, H2O will handle missing values automatically, so it's not a problem if we are missing certain feature values. However, it is always a good idea to check to make sure that you are not missing any of the training labels.
To figure out which, if any, values are missing, we can use the h2o.nacnt
(NA count) method on any H2OFrame (or column). The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to an H2OFrame also apply to a single column.
h2o.nacnt(data[y])
Great, no missing labels. :-)
Out of curiosity, let's see if there is any missing data in any of the columsn of this frame:
h2o.nacnt(data)
Each column returns a zero, so there are no missing values in any of the columns.
The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an "imbalanace" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution:
h2o.table(data[y])
eyeDetection Count 1 0 8257 2 1 6723 [2 rows x 2 columns]
Ok, the data is not exactly evenly distributed between the two classes -- there are more 0's than 1's in the dataset. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below).
Let's calculate the percentage that each class represents:
n <- nrow(data) # Total number of training samples
h2o.table(data[y])['Count']/n
Count 1 0.5512016 2 0.4487984 [2 rows x 1 column]
So far we have explored the original dataset (all rows). For the machine learning portion of this tutorial, we will break the dataset into three parts: a training set, validation set and a test set.
If you want H2O to do the splitting for you, you can use the split_frame
method. However, we have explicit splits that we want (for reproducibility reasons), so we can just subset the Frame to get the partitions we want.
Subset the data
H2O Frame on the "split" column:
train <- data[data['split']=="train",]
nrow(train)
valid <- data[data['split']=="valid",]
nrow(valid)
test <- data[data['split']=="test",]
nrow(test)
We will do a quick demo of the H2O software using a Gradient Boosting Machine (GBM). The goal of this problem is to train a model to predict eye state (open vs closed) from EEG data.
In the steps above, we have already created the training set and validation set, so the next step is to specify the predictor set and response variable.
As with any machine learning algorithm, we need to specify the response and predictor columns in the training set.
The x
argument should be a vector of predictor names in the training frame, and y
specifies the response column. We have already set y <- "eyeDetector"
above, but we still need to specify x
.
names(train)
x <- setdiff(names(train), c("eyeDetection", "split")) #Remove the 13th and 14th columns
x
Now that we have specified x
and y
, we can train the GBM model using a few non-default model parameters. Since we are predicting a binary response, we set distribution = "bernoulli"
.
model <- h2o.gbm(x = x, y = y,
training_frame = train,
validation_frame = valid,
distribution = "bernoulli",
ntrees = 100,
max_depth = 4,
learn_rate = 0.1)
|======================================================================| 100%
The type of results shown when you print a model, are determined by the following:
training_frame
only, training_frame
and validation_frame
, or training_frame
and nfolds
)Below, we see a GBM Model Summary, as well as training and validation metrics since we supplied a validation_frame
. Since this a binary classification task, we are shown the relevant performance metrics, which inclues: MSE, R^2, LogLoss, AUC and Gini. Also, we are shown a Confusion Matrix, where the threshold for classification is chosen automatically (by H2O) as the threshold which maximizes the F1 score.
The scoring history is also printed, which shows the performance metrics over some increment such as "number of trees" in the case of GBM and RF.
Lastly, for tree-based methods (GBM and RF), we also print variable importance.
print(model)
Model Details: ============== H2OBinomialModel: gbm Model ID: GBM_model_R_1456125581863_170 Model Summary: number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves 1 100 23828 4 4 4.00000 12 max_leaves mean_leaves 1 16 15.17000 H2OBinomialMetrics: gbm ** Reported on training data. ** MSE: 0.1076065 R^2: 0.5657448 LogLoss: 0.3600893 AUC: 0.9464642 Gini: 0.8929284 Confusion Matrix for F1-optimal threshold: 0 1 Error Rate 0 4281 635 0.129170 =635/4916 1 537 3535 0.131876 =537/4072 Totals 4818 4170 0.130396 =1172/8988 Maximum Metrics: Maximum metrics at their respective thresholds metric threshold value idx 1 max f1 0.450886 0.857802 206 2 max f2 0.316901 0.899723 262 3 max f0point5 0.582904 0.882212 158 4 max accuracy 0.463161 0.870939 202 5 max precision 0.990029 1.000000 0 6 max recall 0.062219 1.000000 381 7 max specificity 0.990029 1.000000 0 8 max absolute_MCC 0.463161 0.739650 202 9 max min_per_class_accuracy 0.448664 0.868999 207 Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)` H2OBinomialMetrics: gbm ** Reported on validation data. ** MSE: 0.1200838 R^2: 0.5156133 LogLoss: 0.3894633 AUC: 0.9238635 Gini: 0.8477271 Confusion Matrix for F1-optimal threshold: 0 1 Error Rate 0 1328 307 0.187768 =307/1635 1 176 1185 0.129317 =176/1361 Totals 1504 1492 0.161215 =483/2996 Maximum Metrics: Maximum metrics at their respective thresholds metric threshold value idx 1 max f1 0.425963 0.830705 227 2 max f2 0.329543 0.887175 268 3 max f0point5 0.606576 0.850985 156 4 max accuracy 0.482265 0.846796 206 5 max precision 0.980397 1.000000 0 6 max recall 0.084627 1.000000 374 7 max specificity 0.980397 1.000000 0 8 max absolute_MCC 0.482265 0.690786 206 9 max min_per_class_accuracy 0.458183 0.839089 215 Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Once a model has been trained, you can also use it to make predictions on a test set. In the case above, we just ran the model once, so our validation set (passed as validation_frame
), could have also served as a "test set." We technically have already created test set predictions and evaluated test set performance.
However, when performing model selection over a variety of model parameters, it is common for users to train a variety of models (using different parameters) using the training set, train
, and a validation set, valid
. Once the user selects the best model (based on validation set performance), the true test of model performance is performed by making a final set of predictions on the held-out (never been used before) test set, test
.
You can use the model_performance
method to generate predictions on a new dataset. The results are stored in an object of class, "H2OBinomialMetrics"
.
perf <- h2o.performance(model = model, newdata = test)
class(perf)
Individual model performance metrics can be extracted using methods like r2
, auc
and mse
. In the case of binary classification, we may be most interested in evaluating test set Area Under the ROC Curve (AUC).
h2o.r2(perf)
h2o.auc(perf)
h2o.mse(perf)
To perform k-fold cross-validation, you use the same code as above, but you specify nfolds
as an integer greater than 1, or add a "fold_column" to your H2O Frame which indicates a fold ID for each row.
Unless you have a specific reason to manually assign the observations to folds, you will find it easiest to simply use the nfolds
argument.
When performing cross-validation, you can still pass a validation_frame
, but you can also choose to use the original dataset that contains all the rows. We will cross-validate a model below using the original H2O Frame which is called data
.
cvmodel <- h2o.gbm(x = x, y = y,
training_frame = train,
validation_frame = valid,
distribution = "bernoulli",
ntrees = 100,
max_depth = 4,
learn_rate = 0.1,
nfolds = 5)
|======================================================================| 100%
This time around, we will simply pull the training and cross-validation metrics out of the model. To do so, you use the auc
method again, and you can specify train
or xval
as TRUE
to get the correct metric.
print(h2o.auc(cvmodel, train = TRUE))
print(h2o.auc(cvmodel, xval = TRUE))
[1] 0.9464642 [1] 0.9218678
One way of evaluting models with different parameters is to perform a grid search over a set of parameter values. For example, in GBM, here are three model parameters that may be useful to search over:
ntrees
: Number of treesmax_depth
: Maximum depth of a treelearn_rate
: Learning rate in the GBMWe will define a grid as follows:
ntrees_opt <- c(5,50,100)
max_depth_opt <- c(2,3,5)
learn_rate_opt <- c(0.1,0.2)
hyper_params = list('ntrees' = ntrees_opt,
'max_depth' = max_depth_opt,
'learn_rate' = learn_rate_opt)
The h2o.grid
function can be used to train a "H2OGrid"
object for any of the H2O algorithms (specified by the "algorithm"
argument.
gs <- h2o.grid(algorithm = "gbm",
grid_id = "eeg_demo_gbm_grid",
hyper_params = hyper_params,
x = x, y = y,
training_frame = train,
validation_frame = valid)
|======================================================================| 100%
print(gs)
H2O Grid Details ================ Grid ID: eeg_demo_gbm_grid Used hyper parameters: - ntrees - max_depth - learn_rate Number of models: 18 Number of failed models: 0 Hyper-Parameter Search Summary: ordered by increasing logloss ntrees max_depth learn_rate model_ids logloss 1 100 5 0.2 eeg_demo_gbm_grid_model_17 0.24919767209732 2 50 5 0.2 eeg_demo_gbm_grid_model_16 0.321319350389403 3 100 5 0.1 eeg_demo_gbm_grid_model_8 0.325041939824682 4 100 3 0.2 eeg_demo_gbm_grid_model_14 0.398168927969941 5 50 5 0.1 eeg_demo_gbm_grid_model_7 0.402409215186705 6 50 3 0.2 eeg_demo_gbm_grid_model_13 0.455260965151754 7 100 3 0.1 eeg_demo_gbm_grid_model_5 0.463893147947061 8 50 3 0.1 eeg_demo_gbm_grid_model_4 0.51734929422505 9 100 2 0.2 eeg_demo_gbm_grid_model_11 0.530497456235128 10 5 5 0.2 eeg_demo_gbm_grid_model_15 0.548389974989351 11 50 2 0.2 eeg_demo_gbm_grid_model_10 0.561668599565429 12 100 2 0.1 eeg_demo_gbm_grid_model_2 0.564235794490373 13 50 2 0.1 eeg_demo_gbm_grid_model_1 0.594214675563477 14 5 5 0.1 eeg_demo_gbm_grid_model_6 0.600327168524549 15 5 3 0.2 eeg_demo_gbm_grid_model_12 0.610367851324487 16 5 3 0.1 eeg_demo_gbm_grid_model_3 0.642100038024138 17 5 2 0.2 eeg_demo_gbm_grid_model_9 0.647268487315379 18 5 2 0.1 eeg_demo_gbm_grid_model_0 0.663560995637836
By default, grids of models will return the grid results sorted by (increasing) logloss on the validation set. However, if we are interested in sorting on another model performance metric, we can do that using the h2o.getGrid
function as follows:
# print out the auc for all of the models
auc_table <- h2o.getGrid(grid_id = "eeg_demo_gbm_grid", sort_by = "auc", decreasing = TRUE)
print(auc_table)
H2O Grid Details ================ Grid ID: eeg_demo_gbm_grid Used hyper parameters: - ntrees - max_depth - learn_rate Number of models: 18 Number of failed models: 0 Hyper-Parameter Search Summary: ordered by decreasing auc ntrees max_depth learn_rate model_ids auc 1 100 5 0.2 eeg_demo_gbm_grid_model_17 0.967771493797284 2 50 5 0.2 eeg_demo_gbm_grid_model_16 0.949609591795923 3 100 5 0.1 eeg_demo_gbm_grid_model_8 0.94941792664595 4 50 5 0.1 eeg_demo_gbm_grid_model_7 0.922075196552274 5 100 3 0.2 eeg_demo_gbm_grid_model_14 0.913785959685157 6 50 3 0.2 eeg_demo_gbm_grid_model_13 0.887706691652792 7 100 3 0.1 eeg_demo_gbm_grid_model_5 0.884064379717198 8 5 5 0.2 eeg_demo_gbm_grid_model_15 0.851187402678818 9 50 3 0.1 eeg_demo_gbm_grid_model_4 0.848921799270639 10 5 5 0.1 eeg_demo_gbm_grid_model_6 0.825662907513139 11 100 2 0.2 eeg_demo_gbm_grid_model_11 0.812030639460551 12 50 2 0.2 eeg_demo_gbm_grid_model_10 0.785379521713437 13 100 2 0.1 eeg_demo_gbm_grid_model_2 0.78299280750123 14 5 3 0.2 eeg_demo_gbm_grid_model_12 0.774673686150002 15 50 2 0.1 eeg_demo_gbm_grid_model_1 0.754834657912535 16 5 3 0.1 eeg_demo_gbm_grid_model_3 0.749285131682721 17 5 2 0.2 eeg_demo_gbm_grid_model_9 0.692702793188135 18 5 2 0.1 eeg_demo_gbm_grid_model_0 0.676144542037133
The "best" model in terms of validation set AUC is listed first in auc_table.
best_model <- h2o.getModel(auc_table@model_ids[[1]])
h2o.auc(best_model, valid = TRUE) #Validation AUC for best model
The last thing we may want to do is generate predictions on the test set using the "best" model, and evaluate the test set AUC.
best_perf <- h2o.performance(model = best_model, newdata = test)
h2o.auc(best_perf)
The test set AUC is approximately 0.97. Not bad!!