Author: Spencer Aiello
Contact: spencer@h2oai.com
This tutorial steps through a quick introduction to H2O's Python API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from Python. Also, to help those that are accustomed to Scikit Learn and Pandas, the demo will be specific call outs for differences between H2O and those packages; this is intended to help anyone that needs to do machine learning on really Big Data make the transition. It is not meant to be a tutorial on machine learning or algorithms.
Detailed documentation about H2O's and the Python API is available at http://docs.h2o.ai.
The following code creates two csv files using data from the Boston Housing dataset which is built into scikit-learn and adds them to the local directory
import pandas as pd
import numpy
from numpy.random import choice
from sklearn.datasets import load_boston
from h2o.estimators.random_forest import H2ORandomForestEstimator
import h2o
h2o.init()
No instance found at ip and port: localhost:54321. Trying to start local jar... JVM stdout: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpGp_6cd/h2o_me_started_from_python.out JVM stderr: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpCRsAhn/h2o_me_started_from_python.err Using ice_root: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpnsOsd7 Java Version: java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) Starting H2O JVM and connecting: .......... Connection successful!
H2O cluster uptime: | 958 milliseconds |
H2O cluster version: | 3.7.0.99999 |
H2O cluster name: | H2O_started_from_python |
H2O cluster total nodes: | 1 |
H2O cluster total memory: | 3.56 GB |
H2O cluster total cores: | 8 |
H2O cluster allowed cores: | 8 |
H2O cluster healthy: | True |
H2O Connection ip: | 127.0.0.1 |
H2O Connection port: | 54321 |
# transfer the boston data from pandas to H2O
boston_data = load_boston()
X = pd.DataFrame(data=boston_data.data, columns=boston_data.feature_names)
X["Median_value"] = boston_data.target
X = h2o.H2OFrame.from_python(X.to_dict("list"))
Parse Progress: [##################################################] 100%
# select 10% for valdation
r = X.runif(seed=123456789)
train = X[r < 0.9,:]
valid = X[r >= 0.9,:]
h2o.export_file(train, "Boston_housing_train.csv", force=True)
h2o.export_file(valid, "Boston_housing_test.csv", force=True)
Export File Progress: [##################################################] 100% Export File Progress: [##################################################] 100%
Enable inline plotting in the Jupyter Notebook
%matplotlib inline
import matplotlib.pyplot as plt
Read csv data into H2O. This loads the data into the H2O column compressed, in-memory, key-value store.
fr = h2o.import_file("Boston_housing_train.csv")
Parse Progress: [##################################################] 100%
View the top of the H2O frame.
fr.head()
CRIM | ZN | B | LSTAT | Median_value | AGE | TAX | RAD | CHAS | NOX | RM | INDUS | PTRATIO | DIS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.00632 | 18 | 396.9 | 4.98 | 24 | 65.2 | 296 | 1 | 0 | 0.538 | 6.575 | 2.31 | 15.3 | 4.09 |
0.02729 | 0 | 392.83 | 4.03 | 34.7 | 61.1 | 242 | 2 | 0 | 0.469 | 7.185 | 7.07 | 17.8 | 4.9671 |
0.03237 | 0 | 394.63 | 2.94 | 33.4 | 45.8 | 222 | 3 | 0 | 0.458 | 6.998 | 2.18 | 18.7 | 6.0622 |
0.06905 | 0 | 396.9 | 5.33 | 36.2 | 54.2 | 222 | 3 | 0 | 0.458 | 7.147 | 2.18 | 18.7 | 6.0622 |
0.02985 | 0 | 394.12 | 5.21 | 28.7 | 58.7 | 222 | 3 | 0 | 0.458 | 6.43 | 2.18 | 18.7 | 6.0622 |
0.08829 | 12.5 | 395.6 | 12.43 | 22.9 | 66.6 | 311 | 5 | 0 | 0.524 | 6.012 | 7.87 | 15.2 | 5.5605 |
0.14455 | 12.5 | 396.9 | 19.15 | 27.1 | 96.1 | 311 | 5 | 0 | 0.524 | 6.172 | 7.87 | 15.2 | 5.9505 |
0.21124 | 12.5 | 386.63 | 29.93 | 16.5 | 100 | 311 | 5 | 0 | 0.524 | 5.631 | 7.87 | 15.2 | 6.0821 |
0.17004 | 12.5 | 386.71 | 17.1 | 18.9 | 85.9 | 311 | 5 | 0 | 0.524 | 6.004 | 7.87 | 15.2 | 6.5921 |
0.22489 | 12.5 | 392.52 | 20.45 | 15 | 94.3 | 311 | 5 | 0 | 0.524 | 6.377 | 7.87 | 15.2 | 6.3467 |
View the bottom of the H2O Frame
fr.tail()
CRIM | ZN | B | LSTAT | Median_value | AGE | TAX | RAD | CHAS | NOX | RM | INDUS | PTRATIO | DIS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.2896 | 0 | 396.9 | 21.14 | 19.7 | 72.9 | 391 | 6 | 0 | 0.585 | 5.39 | 9.69 | 19.2 | 2.7986 |
0.26838 | 0 | 396.9 | 14.1 | 18.3 | 70.6 | 391 | 6 | 0 | 0.585 | 5.794 | 9.69 | 19.2 | 2.8927 |
0.23912 | 0 | 396.9 | 12.92 | 21.2 | 65.3 | 391 | 6 | 0 | 0.585 | 6.019 | 9.69 | 19.2 | 2.4091 |
0.17783 | 0 | 395.77 | 15.1 | 17.5 | 73.5 | 391 | 6 | 0 | 0.585 | 5.569 | 9.69 | 19.2 | 2.3999 |
0.22438 | 0 | 396.9 | 14.33 | 16.8 | 79.7 | 391 | 6 | 0 | 0.585 | 6.027 | 9.69 | 19.2 | 2.4982 |
0.06263 | 0 | 391.99 | 9.67 | 22.4 | 69.1 | 273 | 1 | 0 | 0.573 | 6.593 | 11.93 | 21 | 2.4786 |
0.04527 | 0 | 396.9 | 9.08 | 20.6 | 76.7 | 273 | 1 | 0 | 0.573 | 6.12 | 11.93 | 21 | 2.2875 |
0.06076 | 0 | 396.9 | 5.64 | 23.9 | 91 | 273 | 1 | 0 | 0.573 | 6.976 | 11.93 | 21 | 2.1675 |
0.10959 | 0 | 393.45 | 6.48 | 22 | 89.3 | 273 | 1 | 0 | 0.573 | 6.794 | 11.93 | 21 | 2.3889 |
0.04741 | 0 | 396.9 | 7.88 | 11.9 | 80.8 | 273 | 1 | 0 | 0.573 | 6.03 | 11.93 | 21 | 2.505 |
Select a column
fr["VAR_NAME"]
fr["CRIM"].head() # Tab completes
CRIM |
---|
0.00632 |
0.02729 |
0.03237 |
0.06905 |
0.02985 |
0.08829 |
0.14455 |
0.21124 |
0.17004 |
0.22489 |
Select a few columns
columns = ["CRIM", "RM", "RAD"]
fr[columns].head()
CRIM | RM | RAD |
---|---|---|
0.00632 | 6.575 | 1 |
0.02729 | 7.185 | 2 |
0.03237 | 6.998 | 3 |
0.06905 | 7.147 | 3 |
0.02985 | 6.43 | 3 |
0.08829 | 6.012 | 5 |
0.14455 | 6.172 | 5 |
0.21124 | 5.631 | 5 |
0.17004 | 6.004 | 5 |
0.22489 | 6.377 | 5 |
Select a subset of rows
Unlike in Pandas, columns may be identified by index or column name. Therefore, when subsetting by rows, you must also pass the column selection.
fr[2:7,:] # explicitly select all columns with :
CRIM | ZN | B | LSTAT | Median_value | AGE | TAX | RAD | CHAS | NOX | RM | INDUS | PTRATIO | DIS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.03237 | 0 | 394.63 | 2.94 | 33.4 | 45.8 | 222 | 3 | 0 | 0.458 | 6.998 | 2.18 | 18.7 | 6.0622 |
0.06905 | 0 | 396.9 | 5.33 | 36.2 | 54.2 | 222 | 3 | 0 | 0.458 | 7.147 | 2.18 | 18.7 | 6.0622 |
0.02985 | 0 | 394.12 | 5.21 | 28.7 | 58.7 | 222 | 3 | 0 | 0.458 | 6.43 | 2.18 | 18.7 | 6.0622 |
0.08829 | 12.5 | 395.6 | 12.43 | 22.9 | 66.6 | 311 | 5 | 0 | 0.524 | 6.012 | 7.87 | 15.2 | 5.5605 |
0.14455 | 12.5 | 396.9 | 19.15 | 27.1 | 96.1 | 311 | 5 | 0 | 0.524 | 6.172 | 7.87 | 15.2 | 5.9505 |
Key attributes: * columns, names, col_names * len, shape, dim, nrow, ncol * types
Note:
Since the data is not in local python memory there is no "values" attribute. If you want to pull all of the data into the local python memory then do so explicitly with h2o.export_file and reading the data into python memory from disk.
# The columns attribute is exactly like Pandas
print("Columns:", fr.columns, "\n")
print("Columns:", fr.names, "\n")
print("Columns:", fr.col_names, "\n")
# There are a number of attributes to get at the shape
print("length:", str( len(fr) ), "\n")
print("shape:", fr.shape, "\n")
print("dim:", fr.dim, "\n")
print("nrow:", fr.nrow, "\n")
print("ncol:", fr.ncol, "\n")
# Use the "types" attribute to list the column types
print("types:", fr.types, "\n")
Columns: [u'CRIM', u'ZN', u'B', u'LSTAT', u'Median_value', u'AGE', u'TAX', u'RAD', u'CHAS', u'NOX', u'RM', u'INDUS', u'PTRATIO', u'DIS'] Columns: [u'CRIM', u'ZN', u'B', u'LSTAT', u'Median_value', u'AGE', u'TAX', u'RAD', u'CHAS', u'NOX', u'RM', u'INDUS', u'PTRATIO', u'DIS'] Columns: [u'CRIM', u'ZN', u'B', u'LSTAT', u'Median_value', u'AGE', u'TAX', u'RAD', u'CHAS', u'NOX', u'RM', u'INDUS', u'PTRATIO', u'DIS'] length: 462 shape: (462, 14) dim: [462, 14] nrow: 462 ncol: 14 types: {u'CRIM': u'real', u'ZN': u'real', u'B': u'real', u'LSTAT': u'real', u'TAX': u'int', u'AGE': u'real', u'Median_value': u'real', u'RAD': u'int', u'CHAS': u'int', u'NOX': u'real', u'RM': u'real', u'INDUS': u'real', u'PTRATIO': u'real', u'DIS': u'real'}
Select rows based on value
fr.shape
(462, 14)
Boolean masks can be used to subselect rows based on a criteria.
mask = fr["CRIM"]>1
fr[mask,:].shape
(155, 14)
Get summary statistics of the data and additional data distribution information.
fr.describe()
Rows:462 Cols:14 Chunk compression summary:
chunk_type | chunk_name | count | count_percentage | size | size_percentage |
CBS | Bits | 1 | 7.1428576 | 128 B | 0.4 |
C1N | 1-Byte Integers (w/o NAs) | 1 | 7.1428576 | 530 B | 1.6260661 |
C2 | 2-Byte Integers | 1 | 7.1428576 | 992 B | 3.043505 |
C2S | 2-Byte Fractions | 1 | 7.1428576 | 1008 B | 3.0925937 |
CUD | Unique Reals | 4 | 28.57143 | 7.2 KB | 22.5563 |
C8D | 64-bit Reals | 6 | 42.857143 | 22.1 KB | 69.288826 |
Frame distribution summary:
size | number_of_rows | number_of_chunks_per_column | number_of_chunks | |
172.16.2.40:54321 | 31.8 KB | 462.0 | 1.0 | 14.0 |
mean | 31.8 KB | 462.0 | 1.0 | 14.0 |
min | 31.8 KB | 462.0 | 1.0 | 14.0 |
max | 31.8 KB | 462.0 | 1.0 | 14.0 |
stddev | 0 B | 0.0 | 0.0 | 0.0 |
total | 31.8 KB | 462.0 | 1.0 | 14.0 |
CRIM | ZN | B | LSTAT | Median_value | AGE | TAX | RAD | CHAS | NOX | RM | INDUS | PTRATIO | DIS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
type | real | real | real | real | real | real | int | int | int | real | real | real | real | real |
mins | 0.00632 | 0.0 | 0.32 | 1.73 | 5.0 | 6.0 | 187.0 | 1.0 | 0.0 | 0.385 | 3.561 | 0.46 | 12.6 | 1.1296 |
mean | 3.56810333333 | 11.0865800866 | 357.205194805 | 12.6678787879 | 22.6028138528 | 68.911038961 | 407.554112554 | 9.44155844156 | 0.0627705627706 | 0.555068831169 | 6.28235930736 | 11.1380519481 | 18.45995671 | 3.79056731602 |
maxs | 88.9762 | 100.0 | 396.9 | 37.97 | 50.0 | 100.0 | 711.0 | 24.0 | 1.0 | 0.871 | 8.78 | 27.74 | 22.0 | 12.1265 |
sigma | 8.68268014543 | 23.2086423052 | 90.7500779002 | 7.1419482934 | 9.21258527358 | 27.9631409743 | 167.460295078 | 8.64357146773 | 0.242812755044 | 0.115349440715 | 0.707139172922 | 6.85982058776 | 2.16522966932 | 2.11032018051 |
zeros | 0 | 343 | 0 | 0 | 0 | 0 | 0 | 0 | 433 | 0 | 0 | 0 | 0 | 0 |
missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0.00632 | 18.0 | 396.9 | 4.98 | 24.0 | 65.2 | 296.0 | 1.0 | 0.0 | 0.538 | 6.575 | 2.31 | 15.3 | 4.09 |
1 | 0.02729 | 0.0 | 392.83 | 4.03 | 34.7 | 61.1 | 242.0 | 2.0 | 0.0 | 0.469 | 7.185 | 7.07 | 17.8 | 4.9671 |
2 | 0.03237 | 0.0 | 394.63 | 2.94 | 33.4 | 45.8 | 222.0 | 3.0 | 0.0 | 0.458 | 6.998 | 2.18 | 18.7 | 6.0622 |
3 | 0.06905 | 0.0 | 396.9 | 5.33 | 36.2 | 54.2 | 222.0 | 3.0 | 0.0 | 0.458 | 7.147 | 2.18 | 18.7 | 6.0622 |
4 | 0.02985 | 0.0 | 394.12 | 5.21 | 28.7 | 58.7 | 222.0 | 3.0 | 0.0 | 0.458 | 6.43 | 2.18 | 18.7 | 6.0622 |
5 | 0.08829 | 12.5 | 395.6 | 12.43 | 22.9 | 66.6 | 311.0 | 5.0 | 0.0 | 0.524 | 6.012 | 7.87 | 15.2 | 5.5605 |
6 | 0.14455 | 12.5 | 396.9 | 19.15 | 27.1 | 96.1 | 311.0 | 5.0 | 0.0 | 0.524 | 6.172 | 7.87 | 15.2 | 5.9505 |
7 | 0.21124 | 12.5 | 386.63 | 29.93 | 16.5 | 100.0 | 311.0 | 5.0 | 0.0 | 0.524 | 5.631 | 7.87 | 15.2 | 6.0821 |
8 | 0.17004 | 12.5 | 386.71 | 17.1 | 18.9 | 85.9 | 311.0 | 5.0 | 0.0 | 0.524 | 6.004 | 7.87 | 15.2 | 6.5921 |
9 | 0.22489 | 12.5 | 392.52 | 20.45 | 15.0 | 94.3 | 311.0 | 5.0 | 0.0 | 0.524 | 6.377 | 7.87 | 15.2 | 6.3467 |
Set up the predictor and response column names
Using H2O algorithms, it's easier to reference predictor and response columns by name in a single frame (i.e., don't split up X and y)
x = fr.names[:]
y="Median_value"
x.remove(y)
H2O is a machine learning library built in Java with interfaces in Python, R, Scala, and Javascript. It is open source and well-documented.
Unlike Scikit-learn, H2O allows for categorical and missing data.
The basic work flow is as follows:
# Define and fit first 400 points
model = H2ORandomForestEstimator(seed=42)
model.train(x=x, y=y, training_frame=fr[:400,:])
drf Model Build Progress: [##################################################] 100%
model.predict(fr[400:fr.nrow,:]) # Predict the rest
predict |
---|
12.736 |
10.1 |
10.048 |
12.742 |
10.498 |
14.902 |
17.218 |
15.148 |
14.738 |
16.491 |
The performance of the model can be checked using the holdout dataset
perf = model.model_performance(fr[400:fr.nrow,:])
perf.r2() # get the r2 on the holdout data
perf.mse() # get the mse on the holdout data
perf # display the performance object
ModelMetricsRegression: drf ** Reported on test data. ** MSE: 13.4756382476 R^2: 0.405996106866 Mean Residual Deviance: 13.4756382476
Instead of taking the first 400 observations for training, we can use H2O to create a random test train split of the data.
r = fr.runif(seed=12345) # build random uniform column over [0,1]
train= fr[r<0.75,:] # perform a 75-25 split
test = fr[r>=0.75,:]
model = H2ORandomForestEstimator(seed=42)
model.train(x=x, y=y, training_frame=train, validation_frame=test)
perf = model.model_performance(test)
perf.r2()
drf Model Build Progress: [##################################################] 100%
0.8530416308371256
There was a massive jump in the R^2 value. This is because the original data is not shuffled.
H2O's machine learning algorithms take an optional parameter nfolds to specify the number of cross-validation folds to build. H2O's cross-validation uses an internal weight vector to build the folds in an efficient manner (instead of physically building the splits).
In conjunction with the nfolds parameter, a user may specify the way in which observations are assigned to each fold with the fold_assignment parameter, which can be set to either: * AUTO: Perform random assignment * Random: Each row has a equal (1/nfolds) chance of being in any fold. * Modulo: Observations are in/out of the fold based by modding on nfolds
model = H2ORandomForestEstimator(nfolds=10) # build a 10-fold cross-validated model
model.train(x=x, y=y, training_frame=fr)
drf Model Build Progress: [##################################################] 100%
scores = numpy.array([m.r2() for m in model.xvals]) # iterate over the xval models using the xvals attribute
print("Expected R^2: %.2f +/- %.2f \n" % (scores.mean(), scores.std()*1.96))
print("Scores:", scores.round(2))
Expected R^2: 0.86 +/- 0.03 Scores: [ 0.83 0.87 0.84 0.87 0.86 0.88 0.86 0.85 0.87 0.87]
However, you can still make use of the cross_val_score from Scikit-Learn
from sklearn.cross_validation import cross_val_score
from h2o.cross_validation import H2OKFold
from h2o.model.regression import h2o_r2_score
from sklearn.metrics.scorer import make_scorer
You still must use H2O to make the folds. Currently, there is no H2OStratifiedKFold. Additionally, the H2ORandomForestEstimator is similar to the scikit-learn RandomForestRegressor object with its own train
method.
model = H2ORandomForestEstimator(seed=42)
scorer = make_scorer(h2o_r2_score) # make h2o_r2_score into a scikit_learn scorer
custom_cv = H2OKFold(fr, n_folds=10, seed=42) # make a cv
scores = cross_val_score(model, fr[x], fr[y], scoring=scorer, cv=custom_cv)
print("Expected R^2: %.2f +/- %.2f \n" % (scores.mean(), scores.std()*1.96))
print("Scores:", scores.round(2))
drf Model Build Progress: [##################################################] 100% drf Model Build Progress: [##################################################] 100% drf Model Build Progress: [##################################################] 100% drf Model Build Progress: [##################################################] 100% drf Model Build Progress: [##################################################] 100% drf Model Build Progress: [##################################################] 100% drf Model Build Progress: [##################################################] 100% drf Model Build Progress: [##################################################] 100% drf Model Build Progress: [##################################################] 100% drf Model Build Progress: [##################################################] 100% Expected R^2: 0.87 +/- 0.10 Scores: [ 0.84 0.88 0.87 0.93 0.82 0.76 0.85 0.92 0.91 0.89]
There isn't much difference in the R^2 value since the fold strategy is exactly the same. However, there was a major difference in terms of computation time and memory usage.
Since the progress bar print out gets annoying let's disable that
h2o.__PROGRESS_BAR__=False
h2o.no_progress()
Grid search in H2O is still under active development and it will be available very soon. However, it is possible to make use of Scikit's grid search infrastructure (with some performance penalties)
from sklearn import __version__
sklearn_version = __version__
print(sklearn_version)
0.16.1
If you have 0.16.1, then your system can't handle complex randomized grid searches (it works in every other version of sklearn, including the soon to be released 0.16.2 and the older versions).
The steps to perform a randomized grid search:
All the steps will be repeated from above.
Because 0.16.1 is installed, we use scipy to define specific distributions
ADVANCED TIP:
Turn off reference counting for spawning jobs in parallel (n_jobs=-1, or n_jobs > 1). We'll turn it back on again in the aftermath of a Parallel job.
If you don't want to run jobs in parallel, don't turn off the reference counting.
Pattern is: >>> h2o.turn_off_ref_cnts() >>> .... parallel job .... >>> h2o.turn_on_ref_cnts()
%%time
from sklearn.grid_search import RandomizedSearchCV # Import grid search
from scipy.stats import randint, uniform
model = H2ORandomForestEstimator(seed=42) # Define model
params = {"ntrees": randint(20,30),
"max_depth": randint(1,10),
"min_rows": randint(1,10), # scikit's min_samples_leaf
"mtries": randint(2,fr[x].shape[1]),} # Specify parameters to test
scorer = make_scorer(h2o_r2_score) # make h2o_r2_score into a scikit_learn scorer
custom_cv = H2OKFold(fr, n_folds=5, seed=42) # make a cv
random_search = RandomizedSearchCV(model, params,
n_iter=10,
scoring=scorer,
cv=custom_cv,
random_state=42,
n_jobs=1) # Define grid search object
random_search.fit(fr[x], fr[y])
print("Best R^2:", random_search.best_score_, "\n")
print("Best params:", random_search.best_params_)
Best R^2: 0.834051920102 Best params: {'mtries': 3, 'ntrees': 36, 'min_rows': 1, 'max_depth': 6} CPU times: user 1min 11s, sys: 2.6 s, total: 1min 13s Wall time: 5min 48s
We might be tempted to think that we just had a large improvement; however we must be cautious. The function below creates a more detailed report.
def report_grid_score_detail(random_search, charts=True):
"""Input fit grid search estimator. Returns df of scores with details"""
df_list = []
for line in random_search.grid_scores_:
results_dict = dict(line.parameters)
results_dict["score"] = line.mean_validation_score
results_dict["std"] = line.cv_validation_scores.std()*1.96
df_list.append(results_dict)
result_df = pd.DataFrame(df_list)
result_df = result_df.sort("score", ascending=False)
if charts:
for col in get_numeric(result_df):
if col not in ["score", "std"]:
plt.scatter(result_df[col], result_df.score)
plt.title(col)
plt.show()
for col in list(result_df.columns[result_df.dtypes == "object"]):
cat_plot = result_df.score.groupby(result_df[col]).mean()[0]
cat_plot.sort()
cat_plot.plot(kind="barh", xlim=(.5, None), figsize=(7, cat_plot.shape[0]/2))
plt.show()
return result_df
def get_numeric(X):
"""Return list of numeric dtypes variables"""
return X.dtypes[X.dtypes.apply(lambda x: str(x).startswith(("float", "int", "bool")))].index.tolist()
report_grid_score_detail(random_search).head()
max_depth | min_rows | mtries | ntrees | score | std | |
---|---|---|---|---|---|---|
24 | 6 | 1 | 3 | 36 | 0.834052 | 0.140556 |
27 | 9 | 5 | 5 | 47 | 0.823304 | 0.163869 |
17 | 7 | 3 | 7 | 25 | 0.822285 | 0.162661 |
22 | 6 | 3 | 3 | 40 | 0.821114 | 0.134162 |
1 | 6 | 3 | 3 | 45 | 0.820230 | 0.160923 |
Based on the grid search report, we can narrow the parameters to search and rerun the analysis. The parameters below were chosen after a few runs:
%%time
params = {"ntrees": randint(30,35),
"max_depth": randint(5,8),
"mtries": randint(4,6),}
custom_cv = H2OKFold(fr, n_folds=5, seed=42) # In small datasets, the fold size can have a big
# impact on the std of the resulting scores. More
random_search = RandomizedSearchCV(model, params, # folds --> Less examples per fold --> higher
n_iter=5, # variation per sample
scoring=scorer,
cv=custom_cv,
random_state=43,
n_jobs=1)
random_search.fit(fr[x], fr[y])
print("Best R^2:", random_search.best_score_, "\n")
print("Best params:", random_search.best_params_)
report_grid_score_detail(random_search)
Best R^2: 0.847411248634 Best params: {'mtries': 9, 'ntrees': 34, 'max_depth': 6}
CPU times: user 12.1 s, sys: 419 ms, total: 12.5 s Wall time: 24.5 s
Rule of machine learning: Don't use your testing data to inform your training data. Unfortunately, this happens all the time when preparing a dataset for the final model. But on smaller datasets, you must be especially careful.
At the moment, there are no classes for managing data transformations. On the one hand, this requires the user to tote around some extra state, but on the other, it allows the user to be more explicit about transforming H2OFrames.
Basic steps:
First let's normalize the data using the means and standard deviations of the training data. Then let's perform a principal component analysis on the training data and select the top 5 components. Using these components, let's use them to reduce the train and test design matrices.
from h2o.transforms.preprocessing import H2OScaler
from h2o.transforms.decomposition import H2OPCA
y_train = train.pop("Median_value")
y_test = test.pop("Median_value")
norm = H2OScaler()
norm.fit(train)
X_train_norm = norm.transform(train)
X_test_norm = norm.transform(test)
print(X_test_norm.shape)
X_test_norm
(122, 13)
CRIM | ZN | B | LSTAT | AGE | TAX | RAD | CHAS | NOX | RM | INDUS | PTRATIO | DIS |
---|---|---|---|---|---|---|---|---|---|---|---|---|
-24.7362 | -246.063 | 3478.06 | -52.4412 | -413.735 | -30080.2 | -51.9628 | -0.0148904 | -0.011189 | 0.630504 | -63.7738 | 0.684709 | 4.8237 |
-23.6596 | 36.577 | 2566.04 | 126.336 | 883.755 | -15288.4 | -34.9094 | -0.0148904 | -0.00359264 | -0.476113 | -23.6871 | -7.11246 | 4.86543 |
-24.3696 | 36.577 | 3478.06 | 5.2618 | 399.321 | -15288.4 | -34.9094 | -0.0148904 | -0.00359264 | -0.200189 | -23.6871 | -7.11246 | 5.16867 |
-24.549 | 36.577 | 2909.71 | 22.9942 | -844.343 | -15288.4 | -34.9094 | -0.0148904 | -0.00359264 | -0.287784 | -23.6871 | -7.11246 | 3.54176 |
-20.4906 | -246.063 | 3478.06 | -31.1478 | -198.431 | -15953.2 | -43.4361 | -0.0148904 | -0.0019813 | -0.243987 | -21.7849 | 5.80856 | 1.98279 |
-15.927 | -246.063 | 3478.06 | 44.869 | 648.62 | -15953.2 | -43.4361 | -0.0148904 | -0.0019813 | -0.103105 | -21.7849 | 5.80856 | 0.450669 |
-19.5782 | -246.063 | 3249.84 | 27.282 | 716.611 | -15953.2 | -43.4361 | -0.0148904 | -0.0019813 | -0.262235 | -21.7849 | 5.80856 | 1.3371 |
-19.4061 | -246.063 | 2682.37 | 1.84613 | 725.11 | -15953.2 | -43.4361 | -0.0148904 | -0.0019813 | 0.154571 | -21.7849 | 5.80856 | 1.45265 |
-13.047 | -246.063 | -9717.44 | 56.6422 | 795.933 | -15953.2 | -43.4361 | -0.0148904 | -0.0019813 | -0.136683 | -21.7849 | 5.80856 | -0.00460617 |
-23.9336 | -246.063 | 3169.91 | -17.5578 | -1093.64 | -20606.8 | -34.9094 | -0.0148904 | -0.00647003 | -0.231577 | -37.1433 | 1.79859 | 0.178888 |
Then, we can apply PCA and keep the top 5 components. A user warning is expected here.
pca = H2OPCA(k=5)
pca.fit(X_train_norm)
X_train_norm_pca = pca.transform(X_train_norm)
X_test_norm_pca = pca.transform(X_test_norm)
/Library/Python/2.7/site-packages/h2o/transforms/decomposition.py:61: UserWarning: `fit` is not recommended outside of the sklearn framework. Use `train` instead. return super(H2OPCA, self).fit(X)
# prop of variance explained by top 5 components?
print(X_test_norm_pca.shape)
X_test_norm_pca[:5]
(122, 5)
PC1 | PC2 | PC3 | PC4 | PC5 |
---|---|---|---|---|
-30275.3 | -625.604 | 190.44 | 369.211 | 12.3892 |
-15481 | 465.946 | 1014.83 | -446.059 | 38.3513 |
-15611.4 | 1372.84 | 586.132 | -231.06 | -0.387572 |
-15551.9 | 817.878 | -530.932 | 323.576 | 21.2814 |
-16276.8 | 1286.24 | 185.615 | 287.705 | -4.8135 |
-16264.8 | 1280.6 | 945.465 | -89.9208 | 17.0235 |
-16233 | 1054.06 | 1003.81 | -119.775 | 5.92804 |
-16156.1 | 491.806 | 1005.79 | -122.634 | -4.7376 |
-14477.2 | -11794 | 962.697 | -136.698 | -1.80573 |
-20857.9 | 356.531 | -552.38 | 682.064 | 21.0776 |
model = H2ORandomForestEstimator(seed=42)
model.train(x=X_train_norm_pca.names, y=y_train.names, training_frame=X_train_norm_pca.cbind(y_train))
y_hat = model.predict(X_test_norm_pca)
h2o_r2_score(y_test,y_hat)
0.5344823408872756
Although this is MUCH simpler than keeping track of all of these transformations manually, it gets to be somewhat of a burden when you want to chain together multiple transformers.
"Tranformers unite!"
If your raw data is a mess and you have to perform several transformations before using it, use a pipeline to keep things simple.
Steps:
from h2o.transforms.preprocessing import H2OScaler
from h2o.transforms.decomposition import H2OPCA
from sklearn.pipeline import Pipeline # Import Pipeline <other imports not shown>
model = H2ORandomForestEstimator(seed=42)
pipe = Pipeline([("standardize", H2OScaler()), # Define pipeline as a series of steps
("pca", H2OPCA(k=5)),
("rf", model)]) # Notice the last step is an estimator
pipe.fit(train, y_train) # Fit training data
y_hat = pipe.predict(test) # Predict testing data (due to last step being an estimator)
h2o_r2_score(y_test, y_hat) # Notice the final score is identical to before
0.4977863701713895
This is so much easier!!!
But, wait a second, we did worse after applying these transformations! We might wonder how different hyperparameters for the transformations impact the final score.
"Yo dawg, I heard you like models, so I put models in your models to model models."
Steps:
pipe = Pipeline([("standardize", H2OScaler()),
("pca", H2OPCA()),
("rf", H2ORandomForestEstimator(seed=42))])
params = {"standardize__center": [True, False], # Parameters to test
"standardize__scale": [True, False],
"pca__k": randint(2, 6),
"rf__ntrees": randint(10,20),
"rf__max_depth": randint(4,10),
"rf__min_rows": randint(5,10), }
# "rf__mtries": randint(1,4),} # gridding over mtries is
# problematic with pca grid over
# k above
from sklearn.grid_search import RandomizedSearchCV
from h2o.cross_validation import H2OKFold
from h2o.model.regression import h2o_r2_score
from sklearn.metrics.scorer import make_scorer
custom_cv = H2OKFold(fr, n_folds=5, seed=42)
random_search = RandomizedSearchCV(pipe, params,
n_iter=5,
scoring=make_scorer(h2o_r2_score),
cv=custom_cv,
random_state=42,
n_jobs=1)
random_search.fit(fr[x],fr[y])
results = report_grid_score_detail(random_search)
results.head()
pca__k | rf__max_depth | rf__min_rows | rf__ntrees | score | standardize__center | standardize__scale | std | |
---|---|---|---|---|---|---|---|---|
18 | 5 | 8 | 5 | 69 | 0.376312 | False | True | 0.087629 |
10 | 5 | 8 | 6 | 51 | 0.370140 | False | True | 0.095910 |
8 | 5 | 9 | 8 | 69 | 0.363435 | False | False | 0.103074 |
17 | 5 | 6 | 5 | 71 | 0.360376 | False | True | 0.109905 |
9 | 3 | 9 | 6 | 52 | 0.358997 | False | True | 0.118261 |
Currently Under Development (drop-in scikit-learn pieces): * Richer set of transforms (only PCA and Scale are implemented) * Richer set of estimators (only RandomForest is available) * Full H2O Grid Search
It is useful to save constructed models to disk and reload them between H2O sessions. Here's how:
best_estimator = random_search.best_estimator_ # fetch the pipeline from the grid search
h2o_model = h2o.get_model(best_estimator._final_estimator._id) # fetch the model from the pipeline
save_path = h2o.save_model(h2o_model, path=".", force=True)
print(save_path)
/Users/ludirehak/h2o-3/DRF_model_python_1446227466951_6362
# assumes new session
my_model = h2o.load_model(path=save_path)
my_model.predict(X_test_norm_pca)
predict |
---|
21.1277 |
17.681 |
21.0579 |
22.335 |
22.0658 |
21.6363 |
22.0799 |
21.4214 |
21.6735 |
22.1625 |