Easy Ensemble Learning with h2oEnsemble¶

Introduction¶

In statistics and machine learning, ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models (Wikipredia, 2015). This notebook demonstrates an easy way to carry out ensemble learning with H2O models using h2oEnsemble.

Key Benefit¶

We give our users the ability to build, compare and stack different H2O, MXNet, TensorFlow and Caffe models quickly and easily using the H2O platform.

Setup¶

We need three R packages for this demo: h2o, h2oEnsemble and mlbench.

In [35]:

# Load R Packages
suppressPackageStartupMessages(library(h2o))
suppressPackageStartupMessages(library(mlbench))     # for Boston Housing Data

In [36]:

# Install h2oEnsemble from GitHub if needed
# Reference: https://github.com/h2oai/h2o-3/tree/master/h2o-r/ensemble
if (!require(h2oEnsemble)) {
    install.packages("https://h2o-release.s3.amazonaws.com/h2o-ensemble/R/h2oEnsemble_0.1.8.tar.gz", repos = NULL)
}
suppressPackageStartupMessages(library(h2oEnsemble)) # for model stacking

Boston Housing Data¶

The dataset used in this demo is Boston Housing from mlbench, it contains housing values in suburbs of Boston.

Reference: UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Housing)
Source: This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
Creator: Harrison, D. and Rubinfeld, D.L., 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.
Type: Regression
Dimensions: 506 instances, 13 numeric features and 1 numeric target.
13 Features:
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centres
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT: % lower status of the population
Target:
- MEDV: Median value of owner-occupied homes in $1000's (this is the value we want to predict)

In [38]:

# Import data
data(BostonHousing)
head(BostonHousing)
dim(BostonHousing)

crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
0.00632	18	2.31	0.538	6.575	65.2	4.0900	1	296	15.3	396.90	4.98	24.0
0.02731	0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
0.02729	0	7.07	0.469	7.185	61.1	4.9671	2	242	17.8	392.83	4.03	34.7
0.03237	0	2.18	0.458	6.998	45.8	6.0622	3	222	18.7	394.63	2.94	33.4
0.06905	0	2.18	0.458	7.147	54.2	6.0622	3	222	18.7	396.90	5.33	36.2
0.02985	0	2.18	0.458	6.430	58.7	6.0622	3	222	18.7	394.12	5.21	28.7

506
14

Splitting Data into Training/Test Set¶

We want to evaluate the predictive performance on a holdout dataset. The following code split the Boston Housing data randomly into:

Training: 400 instances
Test: 106 instances

In [39]:

# Split data
set.seed(1234)
row_train <- sample(1:nrow(BostonHousing), 400)
train <- BostonHousing[row_train,]
test <- BostonHousing[-row_train,]

In [40]:

# Training data - quick summary
dim(train)
head(train)
summary(train)

400
14

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
58	0.01432	100	1.32	0.411	6.816	40.5	8.3248	5	256	15.1	392.90	3.95	31.6
315	0.36920	0	9.90	0.544	6.567	87.3	3.6023	4	304	18.4	395.69	9.28	23.8
308	0.04932	33	2.18	0.472	6.849	70.3	3.1827	7	222	18.4	396.90	7.53	28.2
314	0.26938	0	9.90	0.544	6.266	82.8	3.2628	4	304	18.4	393.39	7.90	21.6
433	6.44405	0	18.10	0.584	6.425	74.8	2.2004	24	666	20.2	97.95	12.03	16.1
321	0.16760	0	7.38	0.493	6.426	52.3	4.5404	5	287	19.6	396.90	7.20	23.8

      crim                zn             indus       chas         nox        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:370   Min.   :0.3850  
 1st Qu.: 0.07782   1st Qu.:  0.00   1st Qu.: 5.13   1: 30   1st Qu.:0.4520  
 Median : 0.24751   Median :  0.00   Median : 8.56           Median :0.5380  
 Mean   : 3.33351   Mean   : 12.01   Mean   :10.98           Mean   :0.5549  
 3rd Qu.: 3.48946   3rd Qu.: 18.50   3rd Qu.:18.10           3rd Qu.:0.6258  
 Max.   :73.53410   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
       rm             age              dis              rad       
 Min.   :3.561   Min.   :  6.20   Min.   : 1.130   Min.   : 1.00  
 1st Qu.:5.883   1st Qu.: 47.08   1st Qu.: 2.103   1st Qu.: 4.00  
 Median :6.205   Median : 77.75   Median : 3.239   Median : 5.00  
 Mean   :6.273   Mean   : 69.25   Mean   : 3.824   Mean   : 9.44  
 3rd Qu.:6.626   3rd Qu.: 94.03   3rd Qu.: 5.234   3rd Qu.:24.00  
 Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
      tax           ptratio            b              lstat      
 Min.   :187.0   Min.   :12.60   Min.   :  2.52   Min.   : 1.73  
 1st Qu.:279.0   1st Qu.:17.40   1st Qu.:376.46   1st Qu.: 7.17  
 Median :330.0   Median :19.10   Median :391.99   Median :11.25  
 Mean   :404.8   Mean   :18.52   Mean   :359.94   Mean   :12.61  
 3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.54   3rd Qu.:16.43  
 Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97  
      medv      
 Min.   : 5.00  
 1st Qu.:17.27  
 Median :21.15  
 Mean   :22.51  
 3rd Qu.:24.85  
 Max.   :50.00

In [41]:

# Test data - quick summary
dim(test)
head(test)
summary(test)

106
14

	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
2	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2	242	17.8	396.90	9.14	21.6
10	0.17004	12.5	7.87	0.524	6.004	85.9	6.5921	5	311	15.2	386.71	17.10	18.9
13	0.09378	12.5	7.87	0.524	5.889	39.0	5.4509	5	311	15.2	390.50	15.71	21.7
18	0.78420	0.0	8.14	0.538	5.990	81.7	4.2579	4	307	21.0	386.75	14.67	17.5
24	0.98843	0.0	8.14	0.538	5.813	100.0	4.0952	4	307	21.0	394.54	19.88	14.5
28	0.95577	0.0	8.14	0.538	6.047	88.8	4.4534	4	307	21.0	306.38	17.28	14.8

      crim                zn             indus        chas         nox        
 Min.   : 0.00906   Min.   : 0.000   Min.   : 0.740   0:101   Min.   :0.4000  
 1st Qu.: 0.09535   1st Qu.: 0.000   1st Qu.: 5.945   1:  5   1st Qu.:0.4480  
 Median : 0.30770   Median : 0.000   Median :10.300           Median :0.5350  
 Mean   : 4.67018   Mean   : 8.929   Mean   :11.720           Mean   :0.5540  
 3rd Qu.: 4.86247   3rd Qu.: 0.000   3rd Qu.:18.100           3rd Qu.:0.6128  
 Max.   :88.97620   Max.   :95.000   Max.   :27.740           Max.   :0.8710  
       rm             age              dis             rad        
 Min.   :4.926   Min.   :  2.90   Min.   :1.202   Min.   : 1.000  
 1st Qu.:5.910   1st Qu.: 37.98   1st Qu.:2.084   1st Qu.: 4.000  
 Median :6.231   Median : 76.35   Median :3.117   Median : 5.000  
 Mean   :6.330   Mean   : 66.01   Mean   :3.686   Mean   : 9.962  
 3rd Qu.:6.562   3rd Qu.: 94.35   3rd Qu.:4.906   3rd Qu.:24.000  
 Max.   :8.398   Max.   :100.00   Max.   :9.188   Max.   :24.000  
      tax           ptratio            b              lstat       
 Min.   :193.0   Min.   :13.00   Min.   :  0.32   Min.   : 2.960  
 1st Qu.:287.5   1st Qu.:16.60   1st Qu.:368.61   1st Qu.: 6.758  
 Median :367.5   Median :18.40   Median :389.75   Median :11.690  
 Mean   :421.3   Mean   :18.23   Mean   :344.37   Mean   :12.806  
 3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:395.49   3rd Qu.:17.407  
 Max.   :711.0   Max.   :21.20   Max.   :396.90   Max.   :30.810  
      medv      
 Min.   : 5.00  
 1st Qu.:15.72  
 Median :21.45  
 Mean   :22.61  
 3rd Qu.:26.57  
 Max.   :50.00

Training Different Regression Models¶

We are now ready to train regression models using different algorithms in H2O.

First of all, we convert R data frames into H2O data frames.
Then, we define the names of features and target.
Finally, we train two different models: - H2O Gradient Boosting Machines (CPU) - H2O Distributed Random Forest (CPU)

Note 1: Although the three algorithms used in this example are different, the core parameters are consistent (see below). This allows H2O users to get quick and easy access to different existing (and future) algorithms with a very shallow learning curve. The core parameters are: - x = features - y = target - training_frame = h_train

Note 2: For model stacking, we need to generate holdout predictions from cross-validation. The parameters required for model stacking are: - nfolds = 5 - fold_assignment = 'Modulo' - keep_cross_validation_predictions = TRUE

In [42]:

# Convert R data frames into H2O data frames
h_train <- as.h2o(train)
h_test <- as.h2o(test)

In [43]:

# Regression - define features (x) and target (y)
target <- "medv"
features <- setdiff(colnames(train), target)
print(features)

 [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
 [8] "dis"     "rad"     "tax"     "ptratio" "b"       "lstat"

H2O GBM model¶

For more information, enter ?h2o.gbm in R to look at the full list of parameters.

In [45]:

# Train a H2O GBM model
model_gbm <- h2o.gbm(x = features, y = target,
                     training_frame = h_train,
                     model_id = "h2o_gbm",
                     learn_rate = 0.1,
                     learn_rate_annealing = 0.99,
                     sample_rate = 0.8,
                     col_sample_rate = 0.8,
                     nfolds = 5,
                     fold_assignment = "Modulo",
                     keep_cross_validation_predictions = TRUE,
                     ntrees = 100)

H2O DRF model¶

For more information, enter ?h2o.randomForest in R to look at the full list of parameters.

In [46]:

# Train a H2O DRF model
model_drf <- h2o.randomForest(x = features, y = target,
                              training_frame = h_train,
                              model_id = "h2o_drf",
                              nfolds = 5,
                              fold_assignment = "Modulo",
                              keep_cross_validation_predictions = TRUE,
                              ntrees = 100)

Model Stacking¶

Now we have three different models, we are ready to carry out model stacking.

In [47]:

# Create a list to include all the models for stacking
models <- list(model_dw, model_gbm, model_drf)

In [48]:

# Define a metalearner (one of the H2O supervised machine learning algorithms)
metalearner <- "h2o.glm.wrapper"

In [49]:

# Use h2o.stack() to carry out metalearning
stack <- h2o.stack(models = models, 
                   response_frame = h_train$medv,
                   metalearner = metalearner)

[1] "Metalearning"

In [50]:

# Finally, we evaluate the predictive performance on the ensemble as well as indiviudal models.
h2o.ensemble_performance(stack, newdata = h_test)

Base learner performance, sorted by specified metric:
        learner      MSE
1 h2o_deepwater 8.377644
2       h2o_gbm 8.106541
3       h2o_drf 7.443517


H2O Ensemble Performance on <newdata>:
----------------
Family: gaussian

Ensemble performance (MSE): 5.80436983051916

In [51]:

# Use the ensemble to make predictions
yhat_test <- predict(stack, h_test)