import h2o
h2o.init()
H2O cluster uptime: | 1 minutes 50 seconds 618 milliseconds |
H2O cluster version: | 3.1.0.99999 |
H2O cluster name: | ece |
H2O cluster total nodes: | 1 |
H2O cluster total memory: | 4.44 GB |
H2O cluster total cores: | 8 |
H2O cluster allowed cores: | 8 |
H2O cluster healthy: | True |
H2O Connection ip: | 127.0.0.1 |
H2O Connection port: | 54321 |
#uploading data file to h2o
air = h2o.import_frame(path=h2o.locate("smalldata/airlines/AirlinesTrain.csv.zip"))
Parse Progress: [##################################################] 100% Imported /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTrain.csv.zip . Parsed 24,421 rows and 12 cols
# Constructing validation and train sets by sampling (20/80)
# creating a column as tall as air.nrow()
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]
myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"
#gbm
gbm = h2o.gbm(x=air_train[myX],
y=air_train[myY],
validation_x=air_valid[myX],
validation_y=air_valid[myY],
distribution="bernoulli",
ntrees=100,
max_depth=3,
learn_rate=0.01)
gbm.show()
gbm Model Build Progress: [##################################################] 100% Model Details ============= H2OBinomialModel : Gradient Boosting Machine Model Key: GBMModel__83569002bd127b1b24610fe4ac52444c Model Summary:
number_of_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
100.0 | 21889.0 | 3.0 | 3.0 | 3.0 | 8.0 | 8.0 | 8.0 |
ModelMetricsBinomial: gbm ** Reported on train data. ** MSE: 0.224935884507 R^2: 0.0917523735414 LogLoss: 0.641870843139 AUC: 0.700860264576 Gini: 0.401720529152 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.45100329685:
NO | YES | Error | Rate | |
NO | 2703.0 | 6143.0 | 0.6944 | (6143.0/8846.0) |
YES | 1067.0 | 9680.0 | 0.0993 | (1067.0/10747.0) |
Total | 3770.0 | 15823.0 | 0.7937 | (0.7937/19593.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.45100329685 | 0.728641324802 | 331.0 |
f2 | 0.376803747622 | 0.859382506486 | 396.0 |
f0point5 | 0.538983613241 | 0.683115048095 | 218.0 |
accuracy | 0.521859623661 | 0.654366355331 | 240.0 |
precision | 0.681933134563 | 0.901162790698 | 8.0 |
absolute_MCC | 0.538983613241 | 0.299292001087 | 218.0 |
min_per_class_accuracy | 0.54865448394 | 0.644551967991 | 204.0 |
tns | 0.690888629343 | 8833.0 | 0.0 |
fns | 0.690888629343 | 10648.0 | 0.0 |
fps | 0.371575110378 | 8846.0 | 399.0 |
tps | 0.371575110378 | 10747.0 | 399.0 |
tnr | 0.690888629343 | 0.998530409225 | 0.0 |
fnr | 0.690888629343 | 0.990788126919 | 0.0 |
fpr | 0.371575110378 | 1.0 | 399.0 |
tpr | 0.371575110378 | 1.0 | 399.0 |
ModelMetricsBinomial: gbm ** Reported on validation data. ** MSE: 0.2275183899 R^2: 0.0842002842717 LogLoss: 0.647224224791 AUC: 0.68803214641 Gini: 0.37606429282 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.429662357774:
NO | YES | Error | Rate | |
NO | 435.0 | 1785.0 | 0.8041 | (1785.0/2220.0) |
YES | 137.0 | 2471.0 | 0.0525 | (137.0/2608.0) |
Total | 572.0 | 4256.0 | 0.8566 | (0.8566/4828.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.429662357774 | 0.719988344988 | 356.0 |
f2 | 0.376803773922 | 0.854684009986 | 396.0 |
f0point5 | 0.539014213255 | 0.674244668246 | 217.0 |
accuracy | 0.526662150196 | 0.65057995029 | 232.0 |
precision | 0.67636982654 | 0.835443037975 | 18.0 |
absolute_MCC | 0.539014213255 | 0.292962334179 | 217.0 |
min_per_class_accuracy | 0.548567487854 | 0.631901840491 | 202.0 |
tns | 0.690888600455 | 2213.0 | 0.0 |
fns | 0.690888600455 | 2589.0 | 0.0 |
fps | 0.371575143654 | 2220.0 | 399.0 |
tps | 0.371575143654 | 2608.0 | 399.0 |
tnr | 0.690888600455 | 0.996846846847 | 0.0 |
fnr | 0.690888600455 | 0.992714723926 | 0.0 |
fpr | 0.371575143654 | 1.0 | 399.0 |
tpr | 0.371575143654 | 1.0 | 399.0 |
Scoring History:
timestamp | duration | number_of_trees | training_MSE | training_logloss | training_AUC | training_classification_error | validation_MSE | validation_logloss | validation_AUC | validation_classification_error | |
2015-05-22 13:19:39 | 0.073 sec | 1.0 | 0.247227169696 | 0.687586187163 | 0.662122035392 | 0.385698974123 | 0.248060763596 | 0.689258980589 | 0.650669457801 | 0.386909693455 | |
2015-05-22 13:19:39 | 0.111 sec | 2.0 | 0.246816106849 | 0.686756385519 | 0.66222330505 | 0.385698974123 | 0.247675119161 | 0.688480563888 | 0.650790619991 | 0.386909693455 | |
2015-05-22 13:19:39 | 0.142 sec | 3.0 | 0.246413615521 | 0.685943950168 | 0.66257594751 | 0.385698974123 | 0.247291514047 | 0.687706372792 | 0.65123718427 | 0.386909693455 | |
2015-05-22 13:19:39 | 0.158 sec | 4.0 | 0.246019285467 | 0.685148026987 | 0.662749723193 | 0.386464553667 | 0.246920994436 | 0.686958679422 | 0.651638409882 | 0.387116818558 | |
2015-05-22 13:19:39 | 0.178 sec | 5.0 | 0.245631966235 | 0.684366278301 | 0.662702378116 | 0.386464553667 | 0.246552236947 | 0.68621460685 | 0.651539096612 | 0.387116818558 | |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
2015-05-22 13:19:42 | 3.694 sec | 80.0 | 0.227535224089 | 0.647373511398 | 0.697257952158 | 0.371101924157 | 0.229781026205 | 0.651996607902 | 0.68473434132 | 0.381731565866 | |
2015-05-22 13:19:43 | 3.777 sec | 81.0 | 0.227384102324 | 0.647055090045 | 0.697614870507 | 0.371101924157 | 0.229651399848 | 0.651724668276 | 0.685020968054 | 0.381731565866 | |
2015-05-22 13:19:43 | 3.861 sec | 82.0 | 0.227239988942 | 0.646750400551 | 0.697702497293 | 0.370846730975 | 0.229522678951 | 0.651453477146 | 0.685221321782 | 0.381731565866 | |
2015-05-22 13:19:43 | 3.947 sec | 83.0 | 0.227073325763 | 0.646400341159 | 0.697905978041 | 0.370846730975 | 0.229386828575 | 0.651167496306 | 0.685343433925 | 0.381731565866 | |
2015-05-22 13:19:43 | 4.183 sec | 100.0 | 0.224935884507 | 0.641870843139 | 0.700860264576 | 0.367988567345 | 0.2275183899 | 0.647224224791 | 0.68803214641 | 0.398094449047 |
Variable Importances:
variable | relative_importance | scaled_importance | percentage |
Origin | 17213.3203125 | 1.0 | 0.685965839068 |
Dest | 4465.96972656 | 0.259448476266 | 0.177972791717 |
UniqueCarrier | 1887.43884277 | 0.109649899526 | 0.075216085332 |
fDayofMonth | 1266.3125 | 0.0735658476698 | 0.0504636584235 |
fMonth | 203.423248291 | 0.0118177809161 | 0.008106594002 |
fDayOfWeek | 57.0886230469 | 0.00331653754246 | 0.00227503145812 |
Distance | 0.0 | 0.0 | 0.0 |
#glm
glm = h2o.glm(x=air_train[myX],
y=air_train[myY],
validation_x=air_valid[myX],
validation_y=air_valid[myY],
family = "binomial",
solver="L_BFGS")
glm.pprint_coef()
glm Model Build Progress: [##################################################] 100% Coefficients:
names | coefficients | standardized_coefficients |
Intercept | 0.0373707847069 | 0.195063579531 |
Origin.ABE | -0.0401578633536 | -0.0401578633536 |
Origin.ABQ | -0.0938267138619 | -0.0938267138619 |
Origin.ACY | -0.135339354063 | -0.135339354063 |
Origin.ALB | 0.0711798459683 | 0.0711798459683 |
--- | --- | --- |
fDayOfWeek.f6 | -0.156236716144 | -0.156236716144 |
fDayOfWeek.f7 | 0.0472831537707 | 0.0472831537707 |
fMonth.f1 | -0.221575958907 | -0.221575958907 |
fMonth.f10 | 0.208857303935 | 0.208857303935 |
Distance | 0.00020866333889 | 0.131663819411 |
#uploading test file to h2o
air_test = h2o.import_frame(path=h2o.locate("smalldata/airlines/AirlinesTest.csv.zip"))
Parse Progress: [##################################################] 100% Imported /Users/ece/0xdata/h2o-dev/smalldata/airlines/AirlinesTest.csv.zip . Parsed 2,691 rows and 12 cols
# predicting & performance on test file
gbm_pred = gbm.predict(air_test)
print "GBM predictions: "
gbm_pred.head()
gbm_perf = gbm.model_performance(air_test)
print "GBM performance: "
gbm_perf.show()
glm_pred = glm.predict(air_test)
print "GLM predictions: "
glm_pred.head()
glm_perf = glm.model_performance(air_test)
print "GLM performance: "
glm_perf.show()
GBM predictions: First 10 rows and first 3 columns:
Row ID | predict | NO | YES |
1 | YES | 0.47525141674393157 | 0.5247485832560684 |
2 | YES | 0.48024938136117845 | 0.5197506186388215 |
3 | YES | 0.48024938136117845 | 0.5197506186388215 |
4 | YES | 0.402168737810524 | 0.597831262189476 |
5 | YES | 0.5136446294303063 | 0.48635537056969363 |
6 | YES | 0.5136446294303063 | 0.48635537056969363 |
7 | YES | 0.5478525167901855 | 0.45214748320981446 |
8 | YES | 0.5580925509767907 | 0.4419074490232094 |
9 | YES | 0.5580925509767907 | 0.4419074490232094 |
10 | YES | 0.5580925509767907 | 0.4419074490232094 |
GBM performance: ModelMetricsBinomial: gbm ** Reported on test data. ** MSE: 0.226368400011 R^2: 0.0865312024067 LogLoss: 0.644861693711 AUC: 0.692409878597 Gini: 0.384819757194 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.441901440341:
NO | YES | Error | Rate | |
NO | 293.0 | 924.0 | 0.7592 | (924.0/1217.0) |
YES | 112.0 | 1362.0 | 0.076 | (112.0/1474.0) |
Total | 405.0 | 2286.0 | 0.8352 | (0.8352/2691.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.441901440341 | 0.724468085106 | 339.0 |
f2 | 0.383773415183 | 0.859786810355 | 391.0 |
f0point5 | 0.543468874795 | 0.6851506265 | 213.0 |
accuracy | 0.522296596314 | 0.657748049052 | 242.0 |
precision | 0.678096525394 | 0.847222222222 | 14.0 |
absolute_MCC | 0.543468874795 | 0.30464489118 | 213.0 |
min_per_class_accuracy | 0.549319523433 | 0.642469470828 | 206.0 |
tns | 0.690888600455 | 1213.0 | 0.0 |
fns | 0.690888600455 | 1461.0 | 0.0 |
fps | 0.371575143654 | 1217.0 | 399.0 |
tps | 0.371575143654 | 1474.0 | 399.0 |
tnr | 0.690888600455 | 0.996713229252 | 0.0 |
fnr | 0.690888600455 | 0.99118046133 | 0.0 |
fpr | 0.371575143654 | 1.0 | 399.0 |
tpr | 0.371575143654 | 1.0 | 399.0 |
GLM predictions: First 10 rows and first 3 columns:
Row ID | predict | p0 | p1 |
1 | YES | 0.33138044246038023 | 0.6686195575396198 |
2 | YES | 0.3914744148501228 | 0.6085255851498772 |
3 | YES | 0.36039204225753896 | 0.639607957742461 |
4 | YES | 0.4304740051645429 | 0.5695259948354571 |
5 | YES | 0.5256165167500713 | 0.4743834832499287 |
6 | YES | 0.5562418812088273 | 0.44375811879117266 |
7 | YES | 0.48440139277691874 | 0.5155986072230813 |
8 | YES | 0.44487802611756044 | 0.5551219738824396 |
9 | YES | 0.5819723452658147 | 0.41802765473418535 |
10 | YES | 0.5685108555327485 | 0.4314891444672515 |
GLM performance: ModelMetricsBinomialGLM: glm ** Reported on test data. ** MSE: 0.220260505275 R^2: 0.111178508566 LogLoss: 0.630774448994 Null degrees of freedom: 2690 Residual degrees of freedom: 2438 Null deviance: 3705.94255374 Residual deviance: 3394.82808448 AIC: 3900.82808448 AUC: 0.69739355066 Gini: 0.39478710132 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.443988379059:
NO | YES | Error | Rate | |
NO | 391.0 | 826.0 | 0.6787 | (826.0/1217.0) |
YES | 161.0 | 1313.0 | 0.1092 | (161.0/1474.0) |
Total | 552.0 | 2139.0 | 0.7879 | (0.7879/2691.0) |
Maximum Metrics:
metric | threshold | value | idx |
f1 | 0.443988379059 | 0.726819817326 | 284.0 |
f2 | 0.247001441468 | 0.860535860536 | 382.0 |
f0point5 | 0.569158065903 | 0.685638454733 | 183.0 |
accuracy | 0.540614921318 | 0.655890003716 | 211.0 |
precision | 0.887237238744 | 1.0 | 0.0 |
absolute_MCC | 0.569158065903 | 0.303360041004 | 183.0 |
min_per_class_accuracy | 0.563183947037 | 0.644504748982 | 189.0 |
tns | 0.887237238744 | 1217.0 | 0.0 |
fns | 0.887237238744 | 1472.0 | 0.0 |
fps | 0.186084076673 | 1217.0 | 399.0 |
tps | 0.215917647428 | 1474.0 | 393.0 |
tnr | 0.887237238744 | 1.0 | 0.0 |
fnr | 0.887237238744 | 0.998643147897 | 0.0 |
fpr | 0.186084076673 | 1.0 | 399.0 |
tpr | 0.215917647428 | 1.0 | 393.0 |
# Building confusion matrix for test set
gbm_CM = gbm_perf.confusion_matrix()
print(gbm_CM)
print
glm_CM = glm_perf.confusion_matrix()
print(glm_CM)
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.441901440341:
NO | YES | Error | Rate | |
NO | 293.0 | 924.0 | 0.7592 | (924.0/1217.0) |
YES | 112.0 | 1362.0 | 0.076 | (112.0/1474.0) |
Total | 405.0 | 2286.0 | 0.8352 | (0.8352/2691.0) |
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.443988379059:
NO | YES | Error | Rate | |
NO | 391.0 | 826.0 | 0.6787 | (826.0/1217.0) |
YES | 161.0 | 1313.0 | 0.1092 | (161.0/1474.0) |
Total | 552.0 | 2139.0 | 0.7879 | (0.7879/2691.0) |
# ROC for test set
print('GBM Precision: {0}'.format(gbm_perf.precision()))
print('GBM Accuracy: {0}'.format(gbm_perf.accuracy()))
print('GBM AUC: {0}'.format(gbm_perf.auc()))
print
print('GLM Precision: {0}'.format(glm_perf.precision()))
print('GLM Accuracy: {0}'.format(glm_perf.accuracy()))
print('GLM AUC: {0}'.format(glm_perf.auc()))
GBM Precision: [[0.6780965253938488, 0.8472222222222222]] GBM Accuracy: [[0.5222965963143628, 0.6577480490523968]] GBM AUC: 0.692409878597 GLM Precision: [[0.8872372387438643, 1.0]] GLM Accuracy: [[0.5406149213176982, 0.6558900037160906]] GLM AUC: 0.69739355066