This notebook is part of my Python data science curriculum
There are two major sets of documentation which are relevant for H2O:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/welcome.html
http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/intro.html
import pandas as pd
import altair as alt
alt.renderers.enable('notebook')
RendererRegistry.enable('notebook')
import h2o
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_111"; OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2~bpo8+1-b14); OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode) Starting server from /opt/pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpz21aee3a JVM stdout: /tmp/tmpz21aee3a/h2o_terran_started_from_python.out JVM stderr: /tmp/tmpz21aee3a/h2o_terran_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321... successful.
H2O cluster uptime: | 01 secs |
H2O cluster timezone: | America/New_York |
H2O data parsing timezone: | UTC |
H2O cluster version: | 3.22.0.2 |
H2O cluster version age: | 21 days, 13 hours and 13 minutes |
H2O cluster name: | H2O_from_python_terran_lnzuev |
H2O cluster total nodes: | 1 |
H2O cluster free memory: | 10.50 Gb |
H2O cluster total cores: | 24 |
H2O cluster allowed cores: | 24 |
H2O cluster status: | accepting new members, healthy |
H2O connection url: | http://127.0.0.1:54321 |
H2O connection proxy: | None |
H2O internal security: | False |
H2O API Extensions: | XGBoost, Algos, AutoML, Core V3, Core V4 |
Python version: | 3.6.5 final |
At first I thought you could only load from disk:
from plotnine.data import diamonds
diamonds.to_csv('/tmp/diamonds.csv')
h2o_diamonds = h2o.import_file('/tmp/diamonds.csv')
Parse progress: |█████████████████████████████████████████████████████████| 100%
But in fact you can transfer data from Python. The key piece of information is that you use the h2o.H2OFrame constructor to do it. See references:
http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/data.html#loading-data-from-a-python-object
http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
# The destination_frame argument is optional, but if you don't use it, you get a horrible hex name.
h2o_diamonds2 = h2o.H2OFrame(diamonds, destination_frame='diamonds2')
Parse progress: |█████████████████████████████████████████████████████████| 100%
h2o.ls()
key | |
---|---|
0 | diamonds.hex |
1 | diamonds2 |
h2o_diamonds2
carat | cut | color | clarity | depth | table | price | x | y | z |
---|---|---|---|---|---|---|---|---|---|
0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.2 | 4.23 | 2.63 |
0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
0.24 | Very Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 |
0.24 | Very Good | I | VVS1 | 62.3 | 57 | 336 | 3.95 | 3.98 | 2.47 |
0.26 | Very Good | H | SI1 | 61.9 | 55 | 337 | 4.07 | 4.11 | 2.53 |
0.22 | Fair | E | VS2 | 65.1 | 61 | 337 | 3.87 | 3.78 | 2.49 |
0.23 | Very Good | H | VS1 | 59.4 | 61 | 338 | 4 | 4.05 | 2.39 |
Let's fit a very simple model:
h2o_lm = h2o.estimators.H2OGeneralizedLinearEstimator(family='gaussian')
h2o_lm.train(x=['carat','cut','color','clarity'],y='price',training_frame=h2o_diamonds2)
glm Model Build progress: |███████████████████████████████████████████████| 100%
h2o_lm
Model Details ============= H2OGeneralizedLinearEstimator : Generalized Linear Modeling Model Key: GLM_model_python_1544750361877_1 ModelMetricsRegressionGLM: glm ** Reported on train data. ** MSE: 10733609.79576042 RMSE: 3276.2188259883405 MAE: 2430.585433525635 RMSLE: 0.9615249847776893 R^2: 0.3255806286420655 Mean Residual Deviance: 10733609.79576042 Null degrees of freedom: 53939 Residual degrees of freedom: 53919 Null deviance: 858473135517.3629 Residual deviance: 578970912383.317 AIC: 1026347.841393112 Scoring History:
timestamp | duration | iterations | negative_log_likelihood | objective | |
2018-12-13 20:19:30 | 0.000 sec | 0 | 858473135517.3977051 | 15915334.3625769 |
This is not right at all. R2 should be 0.91 for this model, not 0.35!
Aha, it appears the default model is regularized. This is not explicitly stated but it is implied by the available arguments.
# Note the lambda_=0 to turn off regularization
h2o_lm = h2o.estimators.H2OGeneralizedLinearEstimator(family='gaussian',lambda_=0)
h2o_lm.train(x=['carat','cut','color','clarity'],y='price',training_frame=h2o_diamonds2)
h2o_lm
glm Model Build progress: |███████████████████████████████████████████████| 100% Model Details ============= H2OGeneralizedLinearEstimator : Generalized Linear Modeling Model Key: GLM_model_python_1544750361877_2 ModelMetricsRegressionGLM: glm ** Reported on train data. ** MSE: 1337834.1891373708 RMSE: 1156.6478241614302 MAE: 803.6533760003574 RMSLE: NaN R^2: 0.9159405540179452 Mean Residual Deviance: 1337834.1891373708 Null degrees of freedom: 53939 Residual degrees of freedom: 53921 Null deviance: 858473135517.3629 Residual deviance: 72162776162.06978 AIC: 914023.0749361591 Scoring History:
timestamp | duration | iterations | negative_log_likelihood | objective | |
2018-12-13 20:19:31 | 0.000 sec | 0 | 858473135517.3977051 | 15915334.3625769 |
That's more like it!
h2o_lm.coef()
{'Intercept': -7362.802156302129, 'clarity.IF': 5419.646844614482, 'clarity.SI1': 3573.6879873533635, 'clarity.SI2': 2625.949986564772, 'clarity.VS1': 4534.878969577319, 'clarity.VS2': 4217.829101987275, 'clarity.VVS1': 5072.027644985769, 'clarity.VVS2': 4967.199410006727, 'color.E': -211.68248136947267, 'color.F': -303.3100325817252, 'color.G': -506.1995360406308, 'color.H': -978.697664842068, 'color.I': -1440.301901907411, 'color.J': -2325.2223602461086, 'cut.Good': 655.7674482639526, 'cut.Ideal': 998.254438325935, 'cut.Premium': 869.3959030779079, 'cut.Very Good': 848.7168776837254, 'carat': 8886.128882503353}
h2o.ls()
key | |
---|---|
0 | GLM_model_python_1544750361877_1 |
1 | GLM_model_python_1544750361877_2 |
2 | diamonds.hex |
3 | diamonds2 |
4 | modelmetrics_GLM_model_python_1544750361877_1@... |
5 | modelmetrics_GLM_model_python_1544750361877_2@... |
h2o_rf = h2o.estimators.random_forest.H2ORandomForestEstimator()
h2o_rf.train(x=['carat','cut','color','clarity'],y='price',training_frame=h2o_diamonds2)
h2o_rf
drf Model Build progress: |███████████████████████████████████████████████| 100% Model Details ============= H2ORandomForestEstimator : Distributed Random Forest Model Key: DRF_model_python_1544750361877_3 ModelMetricsRegression: drf ** Reported on train data. ** MSE: 2079779.0992632748 RMSE: 1442.143924600896 MAE: 966.3844774489775 RMSLE: 0.5464101031548934 Mean Residual Deviance: 2079779.0992632748 Scoring History:
timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
2018-12-13 20:19:31 | 0.099 sec | 0.0 | nan | nan | nan | |
2018-12-13 20:19:32 | 0.599 sec | 1.0 | 1572.8323757 | 936.4122636 | 2473801.6821759 | |
2018-12-13 20:19:32 | 0.750 sec | 2.0 | 1511.6251593 | 980.5250812 | 2285010.6223487 | |
2018-12-13 20:19:32 | 0.861 sec | 3.0 | 1689.9269547 | 1047.6371095 | 2855853.1122827 | |
2018-12-13 20:19:32 | 1.000 sec | 4.0 | 1492.0456369 | 896.8322222 | 2226200.1827169 | |
--- | --- | --- | --- | --- | --- | --- |
2018-12-13 20:19:34 | 2.717 sec | 46.0 | 1440.3415218 | 966.2892714 | 2074583.6993443 | |
2018-12-13 20:19:34 | 2.741 sec | 47.0 | 1462.3473237 | 981.7449913 | 2138459.6951667 | |
2018-12-13 20:19:34 | 2.766 sec | 48.0 | 1465.3928902 | 983.4891060 | 2147376.3225565 | |
2018-12-13 20:19:34 | 2.796 sec | 49.0 | 1447.4702824 | 972.0903774 | 2095170.2184324 | |
2018-12-13 20:19:34 | 2.825 sec | 50.0 | 1442.1439246 | 966.3844774 | 2079779.0992633 |
See the whole table with table.as_data_frame() Variable Importances:
variable | relative_importance | scaled_importance | percentage |
carat | 18256318431232.0000000 | 1.0 | 0.9027340 |
clarity | 931802841088.0000000 | 0.0510400 | 0.0460756 |
color | 737371488256.0000000 | 0.0403899 | 0.0364614 |
cut | 297871540224.0000000 | 0.0163161 | 0.0147291 |
h2o_rf.varimp()
[('carat', 18256318431232.0, 1.0, 0.9027339941905617), ('clarity', 931802841088.0, 0.05104001908149883, 0.04607556028900392), ('color', 737371488256.0, 0.04038993354731048, 0.03646136603625495), ('cut', 297871540224.0, 0.01631607935334959, 0.014729079484179432)]
We will split the data into training and test, use cross-validation on the training data to tune the hyperparameters, and then evaluate on the test data. This is a standard workflow for high-variance ML models.
First split the data
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/splitting-datasets.html
diamonds_split = h2o_diamonds2.split_frame(ratios=[0.75], destination_frames=['diamonds_train','diamonds_test'])
diamonds_split[0].dim
[40444, 10]
Then fit a model with cross-validation by specifying nfolds
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/cross-validation.html
h2o_gb = h2o.estimators.gbm.H2OGradientBoostingEstimator(nfolds=5)
h2o_gb.train(x=['carat','cut','color','clarity'],y='price',training_frame=diamonds_split[0])
gbm Model Build progress: |███████████████████████████████████████████████| 100%
h2o_gb
Model Details ============= H2OGradientBoostingEstimator : Gradient Boosting Machine Model Key: GBM_model_python_1544750361877_4 ModelMetricsRegression: gbm ** Reported on train data. ** MSE: 291289.0591014654 RMSE: 539.7120149685992 MAE: 291.15811039439154 RMSLE: 0.11403758077295047 Mean Residual Deviance: 291289.0591014654 ModelMetricsRegression: gbm ** Reported on cross-validation data. ** MSE: 318952.6407328581 RMSE: 564.7589226677682 MAE: 300.514499690523 RMSLE: 0.11638301817429214 Mean Residual Deviance: 318952.6407328581 Cross-Validation Metrics Summary:
mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | |
mae | 300.5128 | 3.677444 | 306.593 | 302.6331 | 299.6562 | 291.09317 | 302.58844 |
mean_residual_deviance | 318975.66 | 9070.769 | 337070.2 | 326859.25 | 310532.75 | 300005.06 | 320411.0 |
mse | 318975.66 | 9070.769 | 337070.2 | 326859.25 | 310532.75 | 300005.06 | 320411.0 |
r2 | 0.9800078 | 0.0004752 | 0.9794658 | 0.9795043 | 0.9807221 | 0.9809305 | 0.9794166 |
residual_deviance | 318975.66 | 9070.769 | 337070.2 | 326859.25 | 310532.75 | 300005.06 | 320411.0 |
rmse | 564.6648 | 8.041395 | 580.57745 | 571.71606 | 557.25464 | 547.7272 | 566.0486 |
rmsle | 0.1163747 | 0.0008166 | 0.1151984 | 0.1153711 | 0.1163005 | 0.1165711 | 0.1184322 |
Scoring History:
timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | |
2018-12-13 20:19:43 | 6.711 sec | 0.0 | 3994.3964930 | 3033.6648252 | 15955203.3430623 | |
2018-12-13 20:19:43 | 6.769 sec | 1.0 | 3619.8152636 | 2739.5361668 | 13103062.5423542 | |
2018-12-13 20:19:43 | 6.827 sec | 2.0 | 3283.5998197 | 2476.0971627 | 10782027.7762236 | |
2018-12-13 20:19:43 | 6.884 sec | 3.0 | 2983.6510957 | 2238.4966037 | 8902173.8607495 | |
2018-12-13 20:19:43 | 6.928 sec | 4.0 | 2715.2776594 | 2026.2305075 | 7372732.7677644 | |
--- | --- | --- | --- | --- | --- | --- |
2018-12-13 20:19:44 | 7.437 sec | 46.0 | 547.1342616 | 296.8247327 | 299355.9001715 | |
2018-12-13 20:19:44 | 7.449 sec | 47.0 | 544.7315694 | 295.1905176 | 296732.4826661 | |
2018-12-13 20:19:44 | 7.460 sec | 48.0 | 543.2680914 | 293.7298975 | 295140.2191057 | |
2018-12-13 20:19:44 | 7.470 sec | 49.0 | 541.3042959 | 292.4713330 | 293010.3407167 | |
2018-12-13 20:19:44 | 7.479 sec | 50.0 | 539.7120150 | 291.1581104 | 291289.0591015 |
See the whole table with table.as_data_frame() Variable Importances:
variable | relative_importance | scaled_importance | percentage |
carat | 2981761384448.0000000 | 1.0 | 0.8942768 |
clarity | 226965176320.0000000 | 0.0761178 | 0.0680704 |
color | 115138985984.0000000 | 0.0386144 | 0.0345320 |
cut | 10405691392.0000000 | 0.0034898 | 0.0031208 |
That didn't print well; let's try this instead:
h2o_gb.cross_validation_metrics_summary().as_data_frame()
mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | ||
---|---|---|---|---|---|---|---|---|
0 | mae | 300.5128 | 3.677444 | 306.593 | 302.6331 | 299.6562 | 291.09317 | 302.58844 |
1 | mean_residual_deviance | 318975.66 | 9070.769 | 337070.2 | 326859.25 | 310532.75 | 300005.06 | 320411.0 |
2 | mse | 318975.66 | 9070.769 | 337070.2 | 326859.25 | 310532.75 | 300005.06 | 320411.0 |
3 | r2 | 0.9800078 | 4.7522815E-4 | 0.9794658 | 0.9795043 | 0.9807221 | 0.98093045 | 0.97941655 |
4 | residual_deviance | 318975.66 | 9070.769 | 337070.2 | 326859.25 | 310532.75 | 300005.06 | 320411.0 |
5 | rmse | 564.6648 | 8.041395 | 580.57745 | 571.71606 | 557.25464 | 547.7272 | 566.0486 |
6 | rmsle | 0.11637469 | 8.165836E-4 | 0.11519845 | 0.115371145 | 0.11630051 | 0.116571106 | 0.11843221 |
results=pd.DataFrame()
for lr in [0.02,.05,0.1,0.2,0.5]:
for ntrees in [5,50,500]:
h2o_gb = h2o.estimators.gbm.H2OGradientBoostingEstimator(nfolds=5,learn_rate=lr,ntrees=ntrees)
h2o_gb.train(x=['carat','cut','color','clarity','x','y','z','depth','table'],y='price',training_frame=diamonds_split[0])
tmp=h2o_gb.cross_validation_metrics_summary().as_data_frame()
tmp['lr']=lr
tmp['ntrees']=ntrees
results=pd.concat([results,tmp])
gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100% gbm Model Build progress: |███████████████████████████████████████████████| 100%
tmp=results[lambda x: x.iloc[:,0]=='rmse'].copy()
# For some reason all the data is strings
tmp['mean']=tmp['mean'].astype('double')
tmp.sort_values('mean').head()
mean | sd | cv_1_valid | cv_2_valid | cv_3_valid | cv_4_valid | cv_5_valid | lr | ntrees | ||
---|---|---|---|---|---|---|---|---|---|---|
5 | rmse | 531.83060 | 10.007129 | 540.29706 | 535.80707 | 516.2322 | 551.65186 | 515.16504 | 0.10 | 500 |
5 | rmse | 531.89950 | 7.004916 | 533.2855 | 548.17737 | 533.898 | 518.42834 | 525.70825 | 0.05 | 500 |
5 | rmse | 540.66590 | 3.400645 | 546.65967 | 545.441 | 540.56244 | 535.2106 | 535.45575 | 0.02 | 500 |
5 | rmse | 546.05880 | 11.197579 | 538.4264 | 552.35156 | 547.7864 | 521.8549 | 569.87463 | 0.20 | 500 |
5 | rmse | 550.65186 | 14.800012 | 583.1655 | 560.83276 | 547.0235 | 542.37396 | 519.86346 | 0.20 | 50 |
c=alt.Chart(tmp[tmp.iloc[:,0]=='rmse'])
c.mark_point().encode(x='lr:Q',color='ntrees:N',y='mean').interactive()
# I didn't end up using this because it was too dense to see on the chart
#tmp['sd']=tmp['sd'].astype('double')
#tmp['low']=tmp['mean'] - 2*tmp.sd
#tmp['high']=tmp['mean'] + 2*tmp.sd
h2o_gb_best = h2o.estimators.gbm.H2OGradientBoostingEstimator(learn_rate=0.05,ntrees=500)
h2o_gb_best.train(
x=['carat','cut','color','clarity','x','y','z','depth','table'],y='price',
training_frame=diamonds_split[0],validation_frame=diamonds_split[1])
h2o_gb_best
gbm Model Build progress: |███████████████████████████████████████████████| 100% Model Details ============= H2OGradientBoostingEstimator : Gradient Boosting Machine Model Key: GBM_model_python_1544750361877_20 ModelMetricsRegression: gbm ** Reported on train data. ** MSE: 199491.27996210408 RMSE: 446.64446706760407 MAE: 245.20373751829243 RMSLE: 0.09238790883975073 Mean Residual Deviance: 199491.27996210408 ModelMetricsRegression: gbm ** Reported on validation data. ** MSE: 282715.38209518313 RMSE: 531.7098664640173 MAE: 272.9169159379671 RMSLE: 0.10024026821592041 Mean Residual Deviance: 282715.38209518313 Scoring History:
timestamp | duration | number_of_trees | training_rmse | training_mae | training_deviance | validation_rmse | validation_mae | validation_deviance | |
2018-12-13 20:22:13 | 0.012 sec | 0.0 | 3994.3964930 | 3033.6648252 | 15955203.3430623 | 3974.4125294 | 3030.7853958 | 15795954.9541093 | |
2018-12-13 20:22:13 | 0.046 sec | 1.0 | 3807.5820000 | 2886.4265493 | 14497680.6866169 | 3788.0669152 | 2883.1127535 | 14349450.9542388 | |
2018-12-13 20:22:13 | 0.080 sec | 2.0 | 3630.4535985 | 2746.4540468 | 13180193.3309136 | 3611.2379138 | 2742.5377061 | 13041039.2699111 | |
2018-12-13 20:22:13 | 0.114 sec | 3.0 | 3461.9263567 | 2613.9418380 | 11984934.0991557 | 3442.9996808 | 2609.8667122 | 11854246.8019348 | |
2018-12-13 20:22:13 | 0.148 sec | 4.0 | 3302.6116250 | 2488.2785488 | 10907243.5457967 | 3283.7466748 | 2484.0030167 | 10782992.2244687 | |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
2018-12-13 20:22:17 | 3.911 sec | 136.0 | 509.3181942 | 274.0300863 | 259405.0229416 | 548.3043298 | 285.9402525 | 300637.6381091 | |
2018-12-13 20:22:17 | 3.937 sec | 137.0 | 508.9517086 | 273.7642643 | 259031.8416455 | 548.1878262 | 285.7466014 | 300509.8927846 | |
2018-12-13 20:22:17 | 3.967 sec | 138.0 | 508.5194376 | 273.4900699 | 258592.0184162 | 548.0704907 | 285.5630409 | 300381.2628092 | |
2018-12-13 20:22:17 | 3.995 sec | 139.0 | 507.9360878 | 273.2626132 | 257999.0692650 | 547.7861315 | 285.3970975 | 300069.6458349 | |
2018-12-13 20:22:21 | 7.678 sec | 500.0 | 446.6444671 | 245.2037375 | 199491.2799621 | 531.7098665 | 272.9169159 | 282715.3820952 |
See the whole table with table.as_data_frame() Variable Importances:
variable | relative_importance | scaled_importance | percentage |
y | 3249070145536.0000000 | 1.0 | 0.4971292 |
carat | 2369863811072.0000000 | 0.7293976 | 0.3626048 |
clarity | 426141483008.0000000 | 0.1311580 | 0.0652025 |
color | 226389131264.0000000 | 0.0696781 | 0.0346390 |
x | 138247340032.0000000 | 0.0425498 | 0.0211528 |
z | 106740916224.0000000 | 0.0328528 | 0.0163321 |
cut | 10334444544.0000000 | 0.0031807 | 0.0015812 |
depth | 6291180544.0000000 | 0.0019363 | 0.0009626 |
table | 2587166464.0000000 | 0.0007963 | 0.0003959 |
Alternatively, you could try H2OGridSearch
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html#grid-search-example-in-python
We can also run the models and get the results back in Pandas objects:
gb_predictions=h2o_gb_best.predict(diamonds_split[1])
lm_predictions=h2o_lm.predict(diamonds_split[1])
gbm prediction progress: |████████████████████████████████████████████████| 100% glm prediction progress: |████████████████████████████████████████████████| 100%
pred=diamonds_split[1].as_data_frame().assign(
model='gb',pred=gb_predictions.as_data_frame()).append(
diamonds_split[1].as_data_frame().assign(
model='lm',pred=lm_predictions.as_data_frame())).sample(1000)
alt.Chart(pred).mark_circle(opacity=0.25).encode(
x='price',y='pred',color='model'
).interactive()