In [1]:

# This is a demo of H2O's GLM function
# It imports a data set, parses it, and prints a summary
# Then, it runs GLM with a binomial link function
import h2o

In [2]:

h2o.init()

H2O cluster uptime:	5 minutes 14 seconds 128 milliseconds
H2O cluster version:	3.1.0.99999
H2O cluster name:	ece
H2O cluster total nodes:	1
H2O cluster total memory:	4.44 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321

In [3]:

air = h2o.upload_file(path=h2o.locate("smalldata/airlines/AirlinesTrain.csv.zip"))

Parse Progress: [##################################################] 100%
Uploaded py6f514d4e-23da-4051-9994-ddb299009665 into cluster with 24421 rows and 12 cols

In [4]:

r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]

In [5]:

myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"

In [6]:

rf_no_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
                              validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=False)
rf_no_bal.show()

drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Distributed RF
Model Key:  DRFModel__81f49ff2c23a04a37e910bcc58fb4215

Model Summary:

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	10.0	149878.0	20.0	20.0	20.0	879.0	1140.0	1023.3


ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.228111120262
R^2: 0.080357095838
LogLoss: 0.841536443761
AUC: 0.686450906072
Gini: 0.372901812144

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.286623619698:

	NO	YES	Error	Rate
NO	1569.0	7236.0	0.8218	(7236.0/8805.0)
YES	651.0	9877.0	0.0618	(651.0/10528.0)
Total	2220.0	17113.0	0.8836	(0.8836/19333.0)

Maximum Metrics:

metric	threshold	value	idx
f1	0.286623619698	0.714663000615	325.0
f2	8.00260088661e-05	0.856701114818	399.0
f0point5	0.618131936925	0.67159241288	173.0
accuracy	0.510547936844	0.641028293591	224.0
precision	0.932283611338	0.819430814524	23.0
absolute_MCC	0.620255349514	0.282106612071	172.0
min_per_class_accuracy	0.574981335833	0.636493161094	194.0
tns	1.0	8708.0	0.0
fns	1.0	10192.0	0.0
fps	8.00260088661e-05	8805.0	399.0
tps	8.00260088661e-05	10528.0	399.0
tnr	1.0	0.988983532084	0.0
fnr	1.0	0.968085106383	0.0
fpr	8.00260088661e-05	1.0	399.0
tpr	8.00260088661e-05	1.0	399.0

ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.215203503359
R^2: 0.128107873875
LogLoss: 0.626040757587
AUC: 0.710222541095
Gini: 0.42044508219

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.351204755018:

	NO	YES	Error	Rate
NO	557.0	1608.0	0.7427	(1608.0/2165.0)
YES	216.0	2504.0	0.0794	(216.0/2720.0)
Total	773.0	4112.0	0.8221	(0.8221/4885.0)

Maximum Metrics:

metric	threshold	value	idx
f1	0.351204755018	0.733021077283	322.0
f2	0.1770355165	0.863265826009	385.0
f0point5	0.642643034239	0.696284032377	169.0
accuracy	0.515855323573	0.659160696008	238.0
precision	0.957237901508	0.95	10.0
absolute_MCC	0.642643034239	0.316924653321	169.0
min_per_class_accuracy	0.574910362384	0.655889145497	205.0
tns	0.998260494554	2164.0	0.0
fns	0.998260494554	2717.0	0.0
fps	0.104244194428	2165.0	399.0
tps	0.104244194428	2720.0	399.0
tnr	0.998260494554	0.999538106236	0.0
fnr	0.998260494554	0.998897058824	0.0
fpr	0.104244194428	1.0	399.0
tpr	0.104244194428	1.0	399.0

Scoring History:

timestamp	duration	number_of_trees	training_MSE	training_logloss	training_AUC	training_classification_error	validation_MSE	validation_logloss	validation_AUC	validation_classification_error
2015-05-22 13:26:25	0.334 sec	1.0	0.257064819084	1.92745589155	0.645868499693	0.428229328074	0.400629273877	2.39647969232	0.655043642168	0.411463664278
2015-05-22 13:26:25	0.440 sec	2.0	0.253175703449	1.99065654625	0.653143456319	0.420079146593	0.363333128699	1.40768739382	0.67876443418	0.386489252815
2015-05-22 13:26:25	0.552 sec	3.0	0.250745679307	1.78042186773	0.653824627008	0.416884366836	0.329983010029	1.05156077229	0.692085654123	0.38792221085
2015-05-22 13:26:25	0.636 sec	4.0	0.244652116515	1.55450013636	0.663639012346	0.417550274223	0.300502917089	0.87597034314	0.700898315446	0.373797338792
2015-05-22 13:26:26	0.727 sec	5.0	0.240277797376	1.37589402537	0.669836126819	0.403455748175	0.276923048874	0.785013264077	0.698065904768	0.37011258956
2015-05-22 13:26:26	0.823 sec	6.0	0.23645204618	1.17572966227	0.675103136318	0.404289161913	0.255862667171	0.726255343312	0.701924840375	0.378915046059
2015-05-22 13:26:26	0.923 sec	7.0	0.23381891121	1.05318282497	0.677629857572	0.407652843095	0.239395686254	0.684070706264	0.704263771906	0.362128966223
2015-05-22 13:26:26	1.023 sec	8.0	0.231583272153	0.975679020094	0.680780516942	0.391148954063	0.226867426377	0.653676735705	0.707648332428	0.367656090072
2015-05-22 13:26:26	1.129 sec	9.0	0.22954871378	0.892928560467	0.684517249362	0.407442102524	0.219018616188	0.635320181722	0.708369107458	0.381166837257
2015-05-22 13:26:26	1.242 sec	10.0	0.228111120262	0.841536443761	0.686450906072	0.407955309574	0.215203503359	0.626040757587	0.710222541095	0.373387922211

Variable Importances:

variable	relative_importance	scaled_importance	percentage
Origin	5074.82666016	1.0	0.332755556342
fDayofMonth	3977.50952148	0.783772488766	0.260804650545
Dest	2911.67407227	0.573748477978	0.19091799399
UniqueCarrier	1222.79492188	0.240953042096	0.080178463575
fDayOfWeek	1001.97784424	0.197440801694	0.0656995238122
Distance	992.55279541	0.195583585781	0.0650815248979
fMonth	69.5790481567	0.0137106255674	0.00456228683847

In [7]:

rf_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
                               validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=True)
rf_bal.show()

drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2OBinomialModel :  Distributed RF
Model Key:  DRFModel__924d4015f4c523d250c26449b16945f4

Model Summary:

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	10.0	161279.0	20.0	20.0	20.0	1027.0	1201.0	1095.7


ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.227050947395
R^2: 0.084631242813
LogLoss: 0.78142590824
AUC: 0.704638202454
Gini: 0.409276404907

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.401939962958:

	NO	YES	Error	Rate
NO	3873.0	6648.0	0.6319	(6648.0/10521.0)
YES	1460.0	9070.0	0.1387	(1460.0/10530.0)
Total	5333.0	15718.0	0.7706	(0.7706/21051.0)

Maximum Metrics:

metric	threshold	value	idx
f1	0.401939962958	0.691100274307	264.0
f2	0.0	0.833452058698	399.0
f0point5	0.605764139792	0.653672952435	170.0
accuracy	0.586745397134	0.652368058525	180.0
precision	0.978060912989	0.836501901141	5.0
absolute_MCC	0.586745397134	0.304762140786	180.0
min_per_class_accuracy	0.582231845092	0.650603554795	182.0
tns	1.0	10452.0	0.0
fns	1.0	10199.0	0.0
fps	0.0	10521.0	399.0
tps	0.0	10530.0	399.0
tnr	1.0	0.993441688052	0.0
fnr	1.0	0.968566001899	0.0
fpr	0.0	1.0	399.0
tpr	0.0	1.0	399.0

ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.216562495458
R^2: 0.122601948124
LogLoss: 0.623880060288
AUC: 0.708433551827
Gini: 0.416867103654

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.405045186133:

	NO	YES	Error	Rate
NO	731.0	1434.0	0.6624	(1434.0/2165.0)
YES	295.0	2425.0	0.1085	(295.0/2720.0)
Total	1026.0	3859.0	0.7709	(0.7709/4885.0)

Maximum Metrics:

metric	threshold	value	idx
f1	0.405045186133	0.737194102447	289.0
f2	0.135270716497	0.863416804373	386.0
f0point5	0.600421656558	0.690548294549	192.0
accuracy	0.505934494591	0.657932446264	243.0
precision	0.996011784004	1.0	0.0
absolute_MCC	0.620419824493	0.301529439421	180.0
min_per_class_accuracy	0.591067527012	0.648161764706	197.0
tns	0.996011784004	2165.0	0.0
fns	0.996011784004	2719.0	0.0
fps	0.0505929587793	2165.0	399.0
tps	0.0684239245551	2720.0	398.0
tnr	0.996011784004	1.0	0.0
fnr	0.996011784004	0.999632352941	0.0
fpr	0.0505929587793	1.0	399.0
tpr	0.0684239245551	1.0	398.0

Scoring History:

timestamp	duration	number_of_trees	training_MSE	training_logloss	training_AUC	training_classification_error	validation_MSE	validation_logloss	validation_AUC	validation_classification_error
2015-05-22 13:26:27	0.104 sec	1.0	0.255954985197	2.08023432225	0.662981914014	0.461290738117	0.405418554538	2.54355159204	0.645635358647	0.412282497441
2015-05-22 13:26:27	0.155 sec	2.0	0.25468472297	1.86352558991	0.661525255155	0.423780968913	0.370926635584	1.420635165	0.676132573699	0.414738996929
2015-05-22 13:26:27	0.233 sec	3.0	0.249598393724	1.57169101175	0.668207041164	0.432992295061	0.338066318176	1.03915856891	0.691084601277	0.367860798362
2015-05-22 13:26:27	0.328 sec	4.0	0.244778275953	1.39647444431	0.674642688382	0.430043687689	0.309120172949	0.890395753219	0.697113758321	0.380757420676
2015-05-22 13:26:27	0.424 sec	5.0	0.240958193613	1.29019435107	0.681163088793	0.407940914567	0.284701124531	0.812070685409	0.696648128651	0.371136131013
2015-05-22 13:26:27	0.525 sec	6.0	0.236190410816	1.10655931453	0.688713356476	0.403375314861	0.262138591924	0.736384904024	0.6989946169	0.367656090072
2015-05-22 13:26:27	0.632 sec	7.0	0.233016124122	0.970814521363	0.693609936492	0.394177426481	0.244223116357	0.689816810281	0.702110956392	0.368884339816
2015-05-22 13:26:27	0.745 sec	8.0	0.230513359481	0.901192589552	0.698361749513	0.385131547188	0.230494394698	0.657428244182	0.705417827062	0.362128966223
2015-05-22 13:26:27	0.866 sec	9.0	0.22875793945	0.84306688264	0.701440488461	0.382718409482	0.221628858232	0.636504280496	0.705721114658	0.372569089048
2015-05-22 13:26:27	0.989 sec	10.0	0.227050947395	0.78142590824	0.704638202454	0.385159849888	0.216562495458	0.623880060288	0.708433551827	0.353940634596

Variable Importances:

variable	relative_importance	scaled_importance	percentage
Origin	5626.28417969	1.0	0.327272142551
fDayofMonth	4230.62353516	0.751939184023	0.246088747823
Dest	3676.58178711	0.653465354698	0.213861006715
UniqueCarrier	1316.92211914	0.234066050893	0.0766032979742
fDayOfWeek	1163.80688477	0.206851777763	0.0676968244989
Distance	1089.97692871	0.193729448051	0.063402251539
fMonth	87.2591629028	0.0155091993429	0.00507572889822

In [8]:

air_test = h2o.import_frame(path=h2o.locate("smalldata/airlines/AirlinesTest.csv.zip"))

Parse Progress: [##################################################] 100%
Imported  /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTest.csv.zip . Parsed 2,691 rows and 12 cols

In [9]:

def model(model_object, test):
        #predicting on test file
        pred = model_object.predict(test)
        pred.head()
        #Building confusion matrix for test set
        perf = model_object.model_performance(test)
        perf.show()
        print(perf.confusion_matrix())
        print(perf.precision())
        print(perf.accuracy())
        print(perf.auc())

In [10]:

print("\n\nWITHOUT CLASS BALANCING\n")
model(rf_no_bal, air_test)


WITHOUT CLASS BALANCING

First 10 rows and first 3 columns:

Row ID	predict	NO	YES
1	YES	0.2999211110174656	0.7000788889825345
2	YES	0.3735275126993656	0.6264724873006344
3	YES	0.22238414585590363	0.7776158541440964
4	YES	0.3962472975254059	0.6037527024745941
5	YES	0.6098413661122322	0.39015863388776784
6	YES	0.4950307622551918	0.5049692377448082
7	NO	0.6746981769800187	0.32530182301998134
8	YES	0.48598509430885317	0.5140149056911468
9	NO	0.6735334724187851	0.32646652758121486
10	NO	0.7184682190418243	0.28153178095817566

ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.209662417539
R^2: 0.153945177679
LogLoss: 0.618837191629
AUC: 0.731046158615
Gini: 0.462092317229

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.403470018009:

	NO	YES	Error	Rate
NO	408.0	809.0	0.6647	(809.0/1217.0)
YES	129.0	1345.0	0.0875	(129.0/1474.0)
Total	537.0	2154.0	0.7522	(0.7522/2691.0)

Maximum Metrics:

metric	threshold	value	idx
f1	0.403470018009	0.741455347299	293.0
f2	0.131483560801	0.858774178513	397.0
f0point5	0.577017590124	0.706999149901	203.0
accuracy	0.545709063964	0.678558156819	219.0
precision	0.949203286087	0.970588235294	13.0
absolute_MCC	0.545709063964	0.348789541022	219.0
min_per_class_accuracy	0.579830584209	0.672998643148	201.0
tns	1.0	1216.0	0.0
fns	1.0	1474.0	0.0
fps	0.119230582317	1217.0	399.0
tps	0.131483560801	1474.0	397.0
tnr	1.0	0.999178307313	0.0
fnr	1.0	1.0	0.0
fpr	0.119230582317	1.0	399.0
tpr	0.131483560801	1.0	397.0

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.403470018009:

	NO	YES	Error	Rate
NO	408.0	809.0	0.6647	(809.0/1217.0)
YES	129.0	1345.0	0.0875	(129.0/1474.0)
Total	537.0	2154.0	0.7522	(0.7522/2691.0)

[[0.9492032860871406, 0.9705882352941176]]
[[0.5457090639642307, 0.6785581568190264]]
0.731046158615

In [11]:

print("\n\nWITH CLASS BALANCING\n")
model(rf_bal, air_test)


WITH CLASS BALANCING

First 10 rows and first 3 columns:

Row ID	predict	NO	YES
1	YES	0.25423263730284795	0.7457673626971522
2	YES	0.3061814057479045	0.6938185942520956
3	YES	0.29582113197078996	0.7041788680292099
4	YES	0.24460687396796132	0.7553931260320388
5	YES	0.5550336349109918	0.44496636508900816
6	YES	0.5633564660627113	0.4366435339372887
7	NO	0.6514019680463551	0.3485980319536449
8	YES	0.41344391039693884	0.5865560896030612
9	NO	0.7010735205237005	0.2989264794762995
10	NO	0.5986760058318191	0.40132399416818093

ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.215031073082
R^2: 0.13228093778
LogLoss: 0.623177900509
AUC: 0.716619152687
Gini: 0.433238305373

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.431652753871:

	NO	YES	Error	Rate
NO	456.0	761.0	0.6253	(761.0/1217.0)
YES	172.0	1302.0	0.1167	(172.0/1474.0)
Total	628.0	2063.0	0.742	(0.742/2691.0)

Maximum Metrics:

metric	threshold	value	idx
f1	0.431652753871	0.736217133164	279.0
f2	0.168528514213	0.859950859951	380.0
f0point5	0.604570530568	0.697193500739	189.0
accuracy	0.566351616261	0.669267930137	209.0
precision	0.997537422127	1.0	0.0
absolute_MCC	0.566351616261	0.330678623344	209.0
min_per_class_accuracy	0.593757553889	0.661462612983	195.0
tns	0.997537422127	1217.0	0.0
fns	0.997537422127	1473.0	0.0
fps	0.0562118133373	1217.0	399.0
tps	0.115491940237	1474.0	394.0
tnr	0.997537422127	1.0	0.0
fnr	0.997537422127	0.999321573948	0.0
fpr	0.0562118133373	1.0	399.0
tpr	0.115491940237	1.0	394.0

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.431652753871:

	NO	YES	Error	Rate
NO	456.0	761.0	0.6253	(761.0/1217.0)
YES	172.0	1302.0	0.1167	(172.0/1474.0)
Total	628.0	2063.0	0.742	(0.742/2691.0)

[[0.997537422127463, 1.0]]
[[0.5663516162614709, 0.6692679301374953]]
0.716619152687