# This is a demo of H2O's GLM function
# It imports a data set, parses it, and prints a summary
# Then, it runs GLM with a binomial link function
import h2o
h2o.init()
H2O cluster uptime: | 5 minutes 14 seconds 128 milliseconds |
H2O cluster version: | 3.1.0.99999 |
H2O cluster name: | ece |
H2O cluster total nodes: | 1 |
H2O cluster total memory: | 4.44 GB |
H2O cluster total cores: | 8 |
H2O cluster allowed cores: | 8 |
H2O cluster healthy: | True |
H2O Connection ip: | 127.0.0.1 |
H2O Connection port: | 54321 |
air = h2o.upload_file(path=h2o.locate("smalldata/airlines/AirlinesTrain.csv.zip"))
Parse Progress: [##################################################] 100% Uploaded py6f514d4e-23da-4051-9994-ddb299009665 into cluster with 24421 rows and 12 cols
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]
myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"
rf_no_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=False)
rf_no_bal.show()
drf Model Build Progress: [##################################################] 100% Model Details ============= H2OBinomialModel : Distributed RF Model Key: DRFModel__81f49ff2c23a04a37e910bcc58fb4215 Model Summary:
number_of_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
10.0 | 149878.0 | 20.0 | 20.0 | 20.0 | 879.0 | 1140.0 | 1023.3 |
ModelMetricsBinomial: drf ** Reported on train data. ** MSE: 0.228111120262 R^2: 0.080357095838 LogLoss: 0.841536443761 AUC: 0.686450906072 Gini: 0.372901812144 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.286623619698:
NO | YES | Error | Rate | |
NO | 1569.0 | 7236.0 | 0.8218 | (7236.0/8805.0) |
YES | 651.0 | 9877.0 | 0.0618 | (651.0/10528.0) |
Total | 2220.0 | 17113.0 | 0.8836 | (0.8836/19333.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.286623619698 | 0.714663000615 | 325.0 |
f2 | 8.00260088661e-05 | 0.856701114818 | 399.0 |
f0point5 | 0.618131936925 | 0.67159241288 | 173.0 |
accuracy | 0.510547936844 | 0.641028293591 | 224.0 |
precision | 0.932283611338 | 0.819430814524 | 23.0 |
absolute_MCC | 0.620255349514 | 0.282106612071 | 172.0 |
min_per_class_accuracy | 0.574981335833 | 0.636493161094 | 194.0 |
tns | 1.0 | 8708.0 | 0.0 |
fns | 1.0 | 10192.0 | 0.0 |
fps | 8.00260088661e-05 | 8805.0 | 399.0 |
tps | 8.00260088661e-05 | 10528.0 | 399.0 |
tnr | 1.0 | 0.988983532084 | 0.0 |
fnr | 1.0 | 0.968085106383 | 0.0 |
fpr | 8.00260088661e-05 | 1.0 | 399.0 |
tpr | 8.00260088661e-05 | 1.0 | 399.0 |
ModelMetricsBinomial: drf ** Reported on validation data. ** MSE: 0.215203503359 R^2: 0.128107873875 LogLoss: 0.626040757587 AUC: 0.710222541095 Gini: 0.42044508219 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.351204755018:
NO | YES | Error | Rate | |
NO | 557.0 | 1608.0 | 0.7427 | (1608.0/2165.0) |
YES | 216.0 | 2504.0 | 0.0794 | (216.0/2720.0) |
Total | 773.0 | 4112.0 | 0.8221 | (0.8221/4885.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.351204755018 | 0.733021077283 | 322.0 |
f2 | 0.1770355165 | 0.863265826009 | 385.0 |
f0point5 | 0.642643034239 | 0.696284032377 | 169.0 |
accuracy | 0.515855323573 | 0.659160696008 | 238.0 |
precision | 0.957237901508 | 0.95 | 10.0 |
absolute_MCC | 0.642643034239 | 0.316924653321 | 169.0 |
min_per_class_accuracy | 0.574910362384 | 0.655889145497 | 205.0 |
tns | 0.998260494554 | 2164.0 | 0.0 |
fns | 0.998260494554 | 2717.0 | 0.0 |
fps | 0.104244194428 | 2165.0 | 399.0 |
tps | 0.104244194428 | 2720.0 | 399.0 |
tnr | 0.998260494554 | 0.999538106236 | 0.0 |
fnr | 0.998260494554 | 0.998897058824 | 0.0 |
fpr | 0.104244194428 | 1.0 | 399.0 |
tpr | 0.104244194428 | 1.0 | 399.0 |
Scoring History:
timestamp | duration | number_of_trees | training_MSE | training_logloss | training_AUC | training_classification_error | validation_MSE | validation_logloss | validation_AUC | validation_classification_error | |
2015-05-22 13:26:25 | 0.334 sec | 1.0 | 0.257064819084 | 1.92745589155 | 0.645868499693 | 0.428229328074 | 0.400629273877 | 2.39647969232 | 0.655043642168 | 0.411463664278 | |
2015-05-22 13:26:25 | 0.440 sec | 2.0 | 0.253175703449 | 1.99065654625 | 0.653143456319 | 0.420079146593 | 0.363333128699 | 1.40768739382 | 0.67876443418 | 0.386489252815 | |
2015-05-22 13:26:25 | 0.552 sec | 3.0 | 0.250745679307 | 1.78042186773 | 0.653824627008 | 0.416884366836 | 0.329983010029 | 1.05156077229 | 0.692085654123 | 0.38792221085 | |
2015-05-22 13:26:25 | 0.636 sec | 4.0 | 0.244652116515 | 1.55450013636 | 0.663639012346 | 0.417550274223 | 0.300502917089 | 0.87597034314 | 0.700898315446 | 0.373797338792 | |
2015-05-22 13:26:26 | 0.727 sec | 5.0 | 0.240277797376 | 1.37589402537 | 0.669836126819 | 0.403455748175 | 0.276923048874 | 0.785013264077 | 0.698065904768 | 0.37011258956 | |
2015-05-22 13:26:26 | 0.823 sec | 6.0 | 0.23645204618 | 1.17572966227 | 0.675103136318 | 0.404289161913 | 0.255862667171 | 0.726255343312 | 0.701924840375 | 0.378915046059 | |
2015-05-22 13:26:26 | 0.923 sec | 7.0 | 0.23381891121 | 1.05318282497 | 0.677629857572 | 0.407652843095 | 0.239395686254 | 0.684070706264 | 0.704263771906 | 0.362128966223 | |
2015-05-22 13:26:26 | 1.023 sec | 8.0 | 0.231583272153 | 0.975679020094 | 0.680780516942 | 0.391148954063 | 0.226867426377 | 0.653676735705 | 0.707648332428 | 0.367656090072 | |
2015-05-22 13:26:26 | 1.129 sec | 9.0 | 0.22954871378 | 0.892928560467 | 0.684517249362 | 0.407442102524 | 0.219018616188 | 0.635320181722 | 0.708369107458 | 0.381166837257 | |
2015-05-22 13:26:26 | 1.242 sec | 10.0 | 0.228111120262 | 0.841536443761 | 0.686450906072 | 0.407955309574 | 0.215203503359 | 0.626040757587 | 0.710222541095 | 0.373387922211 |
Variable Importances:
variable | relative_importance | scaled_importance | percentage |
Origin | 5074.82666016 | 1.0 | 0.332755556342 |
fDayofMonth | 3977.50952148 | 0.783772488766 | 0.260804650545 |
Dest | 2911.67407227 | 0.573748477978 | 0.19091799399 |
UniqueCarrier | 1222.79492188 | 0.240953042096 | 0.080178463575 |
fDayOfWeek | 1001.97784424 | 0.197440801694 | 0.0656995238122 |
Distance | 992.55279541 | 0.195583585781 | 0.0650815248979 |
fMonth | 69.5790481567 | 0.0137106255674 | 0.00456228683847 |
rf_bal = h2o.random_forest(x=air_train[myX], y=air_train[myY], validation_x= air_valid[myX],
validation_y=air_valid[myY], seed=12, ntrees=10, max_depth=20, balance_classes=True)
rf_bal.show()
drf Model Build Progress: [##################################################] 100% Model Details ============= H2OBinomialModel : Distributed RF Model Key: DRFModel__924d4015f4c523d250c26449b16945f4 Model Summary:
number_of_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
10.0 | 161279.0 | 20.0 | 20.0 | 20.0 | 1027.0 | 1201.0 | 1095.7 |
ModelMetricsBinomial: drf ** Reported on train data. ** MSE: 0.227050947395 R^2: 0.084631242813 LogLoss: 0.78142590824 AUC: 0.704638202454 Gini: 0.409276404907 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.401939962958:
NO | YES | Error | Rate | |
NO | 3873.0 | 6648.0 | 0.6319 | (6648.0/10521.0) |
YES | 1460.0 | 9070.0 | 0.1387 | (1460.0/10530.0) |
Total | 5333.0 | 15718.0 | 0.7706 | (0.7706/21051.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.401939962958 | 0.691100274307 | 264.0 |
f2 | 0.0 | 0.833452058698 | 399.0 |
f0point5 | 0.605764139792 | 0.653672952435 | 170.0 |
accuracy | 0.586745397134 | 0.652368058525 | 180.0 |
precision | 0.978060912989 | 0.836501901141 | 5.0 |
absolute_MCC | 0.586745397134 | 0.304762140786 | 180.0 |
min_per_class_accuracy | 0.582231845092 | 0.650603554795 | 182.0 |
tns | 1.0 | 10452.0 | 0.0 |
fns | 1.0 | 10199.0 | 0.0 |
fps | 0.0 | 10521.0 | 399.0 |
tps | 0.0 | 10530.0 | 399.0 |
tnr | 1.0 | 0.993441688052 | 0.0 |
fnr | 1.0 | 0.968566001899 | 0.0 |
fpr | 0.0 | 1.0 | 399.0 |
tpr | 0.0 | 1.0 | 399.0 |
ModelMetricsBinomial: drf ** Reported on validation data. ** MSE: 0.216562495458 R^2: 0.122601948124 LogLoss: 0.623880060288 AUC: 0.708433551827 Gini: 0.416867103654 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.405045186133:
NO | YES | Error | Rate | |
NO | 731.0 | 1434.0 | 0.6624 | (1434.0/2165.0) |
YES | 295.0 | 2425.0 | 0.1085 | (295.0/2720.0) |
Total | 1026.0 | 3859.0 | 0.7709 | (0.7709/4885.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.405045186133 | 0.737194102447 | 289.0 |
f2 | 0.135270716497 | 0.863416804373 | 386.0 |
f0point5 | 0.600421656558 | 0.690548294549 | 192.0 |
accuracy | 0.505934494591 | 0.657932446264 | 243.0 |
precision | 0.996011784004 | 1.0 | 0.0 |
absolute_MCC | 0.620419824493 | 0.301529439421 | 180.0 |
min_per_class_accuracy | 0.591067527012 | 0.648161764706 | 197.0 |
tns | 0.996011784004 | 2165.0 | 0.0 |
fns | 0.996011784004 | 2719.0 | 0.0 |
fps | 0.0505929587793 | 2165.0 | 399.0 |
tps | 0.0684239245551 | 2720.0 | 398.0 |
tnr | 0.996011784004 | 1.0 | 0.0 |
fnr | 0.996011784004 | 0.999632352941 | 0.0 |
fpr | 0.0505929587793 | 1.0 | 399.0 |
tpr | 0.0684239245551 | 1.0 | 398.0 |
Scoring History:
timestamp | duration | number_of_trees | training_MSE | training_logloss | training_AUC | training_classification_error | validation_MSE | validation_logloss | validation_AUC | validation_classification_error | |
2015-05-22 13:26:27 | 0.104 sec | 1.0 | 0.255954985197 | 2.08023432225 | 0.662981914014 | 0.461290738117 | 0.405418554538 | 2.54355159204 | 0.645635358647 | 0.412282497441 | |
2015-05-22 13:26:27 | 0.155 sec | 2.0 | 0.25468472297 | 1.86352558991 | 0.661525255155 | 0.423780968913 | 0.370926635584 | 1.420635165 | 0.676132573699 | 0.414738996929 | |
2015-05-22 13:26:27 | 0.233 sec | 3.0 | 0.249598393724 | 1.57169101175 | 0.668207041164 | 0.432992295061 | 0.338066318176 | 1.03915856891 | 0.691084601277 | 0.367860798362 | |
2015-05-22 13:26:27 | 0.328 sec | 4.0 | 0.244778275953 | 1.39647444431 | 0.674642688382 | 0.430043687689 | 0.309120172949 | 0.890395753219 | 0.697113758321 | 0.380757420676 | |
2015-05-22 13:26:27 | 0.424 sec | 5.0 | 0.240958193613 | 1.29019435107 | 0.681163088793 | 0.407940914567 | 0.284701124531 | 0.812070685409 | 0.696648128651 | 0.371136131013 | |
2015-05-22 13:26:27 | 0.525 sec | 6.0 | 0.236190410816 | 1.10655931453 | 0.688713356476 | 0.403375314861 | 0.262138591924 | 0.736384904024 | 0.6989946169 | 0.367656090072 | |
2015-05-22 13:26:27 | 0.632 sec | 7.0 | 0.233016124122 | 0.970814521363 | 0.693609936492 | 0.394177426481 | 0.244223116357 | 0.689816810281 | 0.702110956392 | 0.368884339816 | |
2015-05-22 13:26:27 | 0.745 sec | 8.0 | 0.230513359481 | 0.901192589552 | 0.698361749513 | 0.385131547188 | 0.230494394698 | 0.657428244182 | 0.705417827062 | 0.362128966223 | |
2015-05-22 13:26:27 | 0.866 sec | 9.0 | 0.22875793945 | 0.84306688264 | 0.701440488461 | 0.382718409482 | 0.221628858232 | 0.636504280496 | 0.705721114658 | 0.372569089048 | |
2015-05-22 13:26:27 | 0.989 sec | 10.0 | 0.227050947395 | 0.78142590824 | 0.704638202454 | 0.385159849888 | 0.216562495458 | 0.623880060288 | 0.708433551827 | 0.353940634596 |
Variable Importances:
variable | relative_importance | scaled_importance | percentage |
Origin | 5626.28417969 | 1.0 | 0.327272142551 |
fDayofMonth | 4230.62353516 | 0.751939184023 | 0.246088747823 |
Dest | 3676.58178711 | 0.653465354698 | 0.213861006715 |
UniqueCarrier | 1316.92211914 | 0.234066050893 | 0.0766032979742 |
fDayOfWeek | 1163.80688477 | 0.206851777763 | 0.0676968244989 |
Distance | 1089.97692871 | 0.193729448051 | 0.063402251539 |
fMonth | 87.2591629028 | 0.0155091993429 | 0.00507572889822 |
air_test = h2o.import_frame(path=h2o.locate("smalldata/airlines/AirlinesTest.csv.zip"))
Parse Progress: [##################################################] 100% Imported /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTest.csv.zip . Parsed 2,691 rows and 12 cols
def model(model_object, test):
#predicting on test file
pred = model_object.predict(test)
pred.head()
#Building confusion matrix for test set
perf = model_object.model_performance(test)
perf.show()
print(perf.confusion_matrix())
print(perf.precision())
print(perf.accuracy())
print(perf.auc())
print("\n\nWITHOUT CLASS BALANCING\n")
model(rf_no_bal, air_test)
WITHOUT CLASS BALANCING First 10 rows and first 3 columns:
Row ID | predict | NO | YES |
1 | YES | 0.2999211110174656 | 0.7000788889825345 |
2 | YES | 0.3735275126993656 | 0.6264724873006344 |
3 | YES | 0.22238414585590363 | 0.7776158541440964 |
4 | YES | 0.3962472975254059 | 0.6037527024745941 |
5 | YES | 0.6098413661122322 | 0.39015863388776784 |
6 | YES | 0.4950307622551918 | 0.5049692377448082 |
7 | NO | 0.6746981769800187 | 0.32530182301998134 |
8 | YES | 0.48598509430885317 | 0.5140149056911468 |
9 | NO | 0.6735334724187851 | 0.32646652758121486 |
10 | NO | 0.7184682190418243 | 0.28153178095817566 |
ModelMetricsBinomial: drf ** Reported on test data. ** MSE: 0.209662417539 R^2: 0.153945177679 LogLoss: 0.618837191629 AUC: 0.731046158615 Gini: 0.462092317229 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.403470018009:
NO | YES | Error | Rate | |
NO | 408.0 | 809.0 | 0.6647 | (809.0/1217.0) |
YES | 129.0 | 1345.0 | 0.0875 | (129.0/1474.0) |
Total | 537.0 | 2154.0 | 0.7522 | (0.7522/2691.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.403470018009 | 0.741455347299 | 293.0 |
f2 | 0.131483560801 | 0.858774178513 | 397.0 |
f0point5 | 0.577017590124 | 0.706999149901 | 203.0 |
accuracy | 0.545709063964 | 0.678558156819 | 219.0 |
precision | 0.949203286087 | 0.970588235294 | 13.0 |
absolute_MCC | 0.545709063964 | 0.348789541022 | 219.0 |
min_per_class_accuracy | 0.579830584209 | 0.672998643148 | 201.0 |
tns | 1.0 | 1216.0 | 0.0 |
fns | 1.0 | 1474.0 | 0.0 |
fps | 0.119230582317 | 1217.0 | 399.0 |
tps | 0.131483560801 | 1474.0 | 397.0 |
tnr | 1.0 | 0.999178307313 | 0.0 |
fnr | 1.0 | 1.0 | 0.0 |
fpr | 0.119230582317 | 1.0 | 399.0 |
tpr | 0.131483560801 | 1.0 | 397.0 |
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.403470018009:
NO | YES | Error | Rate | |
NO | 408.0 | 809.0 | 0.6647 | (809.0/1217.0) |
YES | 129.0 | 1345.0 | 0.0875 | (129.0/1474.0) |
Total | 537.0 | 2154.0 | 0.7522 | (0.7522/2691.0) |
[[0.9492032860871406, 0.9705882352941176]] [[0.5457090639642307, 0.6785581568190264]] 0.731046158615
print("\n\nWITH CLASS BALANCING\n")
model(rf_bal, air_test)
WITH CLASS BALANCING First 10 rows and first 3 columns:
Row ID | predict | NO | YES |
1 | YES | 0.25423263730284795 | 0.7457673626971522 |
2 | YES | 0.3061814057479045 | 0.6938185942520956 |
3 | YES | 0.29582113197078996 | 0.7041788680292099 |
4 | YES | 0.24460687396796132 | 0.7553931260320388 |
5 | YES | 0.5550336349109918 | 0.44496636508900816 |
6 | YES | 0.5633564660627113 | 0.4366435339372887 |
7 | NO | 0.6514019680463551 | 0.3485980319536449 |
8 | YES | 0.41344391039693884 | 0.5865560896030612 |
9 | NO | 0.7010735205237005 | 0.2989264794762995 |
10 | NO | 0.5986760058318191 | 0.40132399416818093 |
ModelMetricsBinomial: drf ** Reported on test data. ** MSE: 0.215031073082 R^2: 0.13228093778 LogLoss: 0.623177900509 AUC: 0.716619152687 Gini: 0.433238305373 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.431652753871:
NO | YES | Error | Rate | |
NO | 456.0 | 761.0 | 0.6253 | (761.0/1217.0) |
YES | 172.0 | 1302.0 | 0.1167 | (172.0/1474.0) |
Total | 628.0 | 2063.0 | 0.742 | (0.742/2691.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.431652753871 | 0.736217133164 | 279.0 |
f2 | 0.168528514213 | 0.859950859951 | 380.0 |
f0point5 | 0.604570530568 | 0.697193500739 | 189.0 |
accuracy | 0.566351616261 | 0.669267930137 | 209.0 |
precision | 0.997537422127 | 1.0 | 0.0 |
absolute_MCC | 0.566351616261 | 0.330678623344 | 209.0 |
min_per_class_accuracy | 0.593757553889 | 0.661462612983 | 195.0 |
tns | 0.997537422127 | 1217.0 | 0.0 |
fns | 0.997537422127 | 1473.0 | 0.0 |
fps | 0.0562118133373 | 1217.0 | 399.0 |
tps | 0.115491940237 | 1474.0 | 394.0 |
tnr | 0.997537422127 | 1.0 | 0.0 |
fnr | 0.997537422127 | 0.999321573948 | 0.0 |
fpr | 0.0562118133373 | 1.0 | 399.0 |
tpr | 0.115491940237 | 1.0 | 394.0 |
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.431652753871:
NO | YES | Error | Rate | |
NO | 456.0 | 761.0 | 0.6253 | (761.0/1217.0) |
YES | 172.0 | 1302.0 | 0.1167 | (172.0/1474.0) |
Total | 628.0 | 2063.0 | 0.742 | (0.742/2691.0) |
[[0.997537422127463, 1.0]] [[0.5663516162614709, 0.6692679301374953]] 0.716619152687