# This is a demo of H2O's GLM function
# It imports a data set, parses it, and prints a summary
# Then, it runs GLM with a binomial link function
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()
Warning: Version mismatch. H2O is version 3.5.0.99999, but the python package is version UNKNOWN.
H2O cluster uptime: | 44 minutes 50 seconds 74 milliseconds |
H2O cluster version: | 3.5.0.99999 |
H2O cluster name: | ludirehak |
H2O cluster total nodes: | 1 |
H2O cluster total memory: | 3.56 GB |
H2O cluster total cores: | 8 |
H2O cluster allowed cores: | 8 |
H2O cluster healthy: | True |
H2O Connection ip: | 127.0.0.1 |
H2O Connection port: | 54321 |
from h2o.utils.shared_utils import _locate # private function. used to find files within h2o git project directory.
air = h2o.upload_file(path=_locate("smalldata/airlines/AirlinesTrain.csv.zip"))
Parse Progress: [##################################################] 100% Uploaded pya01a74e5-0aa6-4ef0-ae1a-0d3fe860eee9 into cluster with 24,421 rows and 12 cols
r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]
myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"
rf_no_bal = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=False)
rf_no_bal.train(x=myX, y=myY, training_frame=air_train, validation_frame=air_valid)
rf_no_bal.show()
drf Model Build Progress: [##################################################] 100% Model Details ============= H2ORandomForestEstimator : Distributed RF Model Key: DRF_model_python_1445557087082_2742 Model Summary:
number_of_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
10.0 | 287650.0 | 20.0 | 20.0 | 20.0 | 1664.0 | 2418.0 | 2103.5 |
ModelMetricsBinomial: drf ** Reported on train data. ** MSE: 0.269503006052 R^2: -0.0873991649123 LogLoss: 2.43382549553 AUC: 0.646622642412 Gini: 0.293245284825 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.402941766395:
NO | YES | Error | Rate | |
NO | 1948.0 | 6780.0 | 0.7768 | (6780.0/8728.0) |
YES | 936.0 | 9580.0 | 0.089 | (936.0/10516.0) |
Total | 2884.0 | 16360.0 | 0.401 | (7716.0/19244.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx |
max f1 | 0.4 | 0.7 | 299.0 |
max f2 | 0.0 | 0.9 | 399.0 |
max f0point5 | 0.6 | 0.7 | 190.0 |
max accuracy | 0.6 | 0.6 | 193.0 |
max precision | 0.9 | 0.7 | 30.0 |
max absolute_MCC | 0.6 | 0.2 | 190.0 |
max min_per_class_accuracy | 0.7 | 0.6 | 140.0 |
ModelMetricsBinomial: drf ** Reported on validation data. ** MSE: 0.245293478794 R^2: 0.00968032826017 LogLoss: 0.758757679035 AUC: 0.685987609758 Gini: 0.371975219515 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.42132409513:
NO | YES | Error | Rate | |
NO | 467.0 | 1781.0 | 0.7923 | (1781.0/2248.0) |
YES | 160.0 | 2566.0 | 0.0587 | (160.0/2726.0) |
Total | 627.0 | 4347.0 | 0.3902 | (1941.0/4974.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx |
max f1 | 0.4 | 0.7 | 315.0 |
max f2 | 0.2 | 0.9 | 396.0 |
max f0point5 | 0.7 | 0.7 | 174.0 |
max accuracy | 0.7 | 0.6 | 200.0 |
max precision | 1.0 | 0.9 | 0.0 |
max absolute_MCC | 0.7 | 0.3 | 174.0 |
max min_per_class_accuracy | 0.7 | 0.6 | 165.0 |
Scoring History:
timestamp | duration | number_of_trees | training_MSE | training_logloss | training_AUC | training_classification_error | validation_MSE | validation_logloss | validation_AUC | validation_classification_error | |
2015-10-22 17:22:58 | 0.074 sec | 1.0 | 0.3 | 8.4 | 0.6 | 0.4 | 0.3 | 8.1 | 0.6 | 0.5 | |
2015-10-22 17:22:58 | 0.163 sec | 2.0 | 0.3 | 7.4 | 0.6 | 0.4 | 0.3 | 4.0 | 0.6 | 0.4 | |
2015-10-22 17:22:58 | 0.245 sec | 3.0 | 0.3 | 6.5 | 0.6 | 0.4 | 0.3 | 2.6 | 0.6 | 0.4 | |
2015-10-22 17:22:58 | 0.311 sec | 4.0 | 0.3 | 5.6 | 0.6 | 0.5 | 0.3 | 1.9 | 0.7 | 0.4 | |
2015-10-22 17:22:58 | 0.391 sec | 5.0 | 0.3 | 4.8 | 0.6 | 0.4 | 0.3 | 1.4 | 0.7 | 0.4 | |
2015-10-22 17:22:58 | 0.480 sec | 6.0 | 0.3 | 4.0 | 0.6 | 0.4 | 0.3 | 1.1 | 0.7 | 0.4 | |
2015-10-22 17:22:58 | 0.565 sec | 7.0 | 0.3 | 3.6 | 0.6 | 0.4 | 0.2 | 1.0 | 0.7 | 0.4 | |
2015-10-22 17:22:58 | 0.659 sec | 8.0 | 0.3 | 3.1 | 0.6 | 0.4 | 0.2 | 0.9 | 0.7 | 0.4 | |
2015-10-22 17:22:58 | 0.751 sec | 9.0 | 0.3 | 2.7 | 0.6 | 0.4 | 0.2 | 0.8 | 0.7 | 0.4 | |
2015-10-22 17:22:58 | 0.851 sec | 10.0 | 0.3 | 2.4 | 0.6 | 0.4 | 0.2 | 0.8 | 0.7 | 0.4 |
Variable Importances:
variable | relative_importance | scaled_importance | percentage |
Origin | 6152.2 | 1.0 | 0.3 |
fDayofMonth | 5583.6 | 0.9 | 0.3 |
Dest | 4203.4 | 0.7 | 0.2 |
UniqueCarrier | 1609.3 | 0.3 | 0.1 |
fDayOfWeek | 1556.2 | 0.3 | 0.1 |
Distance | 1493.0 | 0.2 | 0.1 |
fMonth | 131.7 | 0.0 | 0.0 |
rf_bal = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=True)
rf_bal.train(x=myX, y=myY, training_frame=air_train, validation_frame=air_valid)
rf_bal.show()
drf Model Build Progress: [##################################################] 100% Model Details ============= H2ORandomForestEstimator : Distributed RF Model Key: DRF_model_python_1445557087082_2744 Model Summary:
number_of_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | |
10.0 | 299144.0 | 20.0 | 20.0 | 20.0 | 1750.0 | 2460.0 | 2168.2 |
ModelMetricsBinomial: drf ** Reported on train data. ** MSE: 0.268874582249 R^2: -0.0754992978501 LogLoss: 2.09200342169 AUC: 0.685292136376 Gini: 0.370584272753 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.538182890839:
NO | YES | Error | Rate | |
NO | 3925.0 | 6621.0 | 0.6278 | (6621.0/10546.0) |
YES | 1574.0 | 8952.0 | 0.1495 | (1574.0/10526.0) |
Total | 5499.0 | 15573.0 | 0.3889 | (8195.0/21072.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx |
max f1 | 0.5 | 0.7 | 226.0 |
max f2 | 0.0 | 0.8 | 399.0 |
max f0point5 | 0.8 | 0.6 | 124.0 |
max accuracy | 0.7 | 0.6 | 140.0 |
max precision | 0.9 | 0.7 | 28.0 |
max absolute_MCC | 0.7 | 0.3 | 151.0 |
max min_per_class_accuracy | 0.7 | 0.6 | 140.0 |
ModelMetricsBinomial: drf ** Reported on validation data. ** MSE: 0.249809873778 R^2: -0.00855364526058 LogLoss: 0.770654128805 AUC: 0.682375448104 Gini: 0.364750896207 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.56328826827:
NO | YES | Error | Rate | |
NO | 822.0 | 1426.0 | 0.6343 | (1426.0/2248.0) |
YES | 367.0 | 2359.0 | 0.1346 | (367.0/2726.0) |
Total | 1189.0 | 3785.0 | 0.3605 | (1793.0/4974.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx |
max f1 | 0.6 | 0.7 | 261.0 |
max f2 | 0.1 | 0.9 | 399.0 |
max f0point5 | 0.7 | 0.7 | 179.0 |
max accuracy | 0.6 | 0.6 | 235.0 |
max precision | 1.0 | 0.8 | 6.0 |
max absolute_MCC | 0.7 | 0.3 | 194.0 |
max min_per_class_accuracy | 0.7 | 0.6 | 167.0 |
Scoring History:
timestamp | duration | number_of_trees | training_MSE | training_logloss | training_AUC | training_classification_error | validation_MSE | validation_logloss | validation_AUC | validation_classification_error | |
2015-10-22 17:22:59 | 0.093 sec | 1.0 | 0.3 | 7.3 | 0.6 | 0.4 | 0.3 | 7.9 | 0.6 | 0.5 | |
2015-10-22 17:22:59 | 0.152 sec | 2.0 | 0.3 | 6.8 | 0.6 | 0.4 | 0.3 | 3.7 | 0.6 | 0.4 | |
2015-10-22 17:22:59 | 0.210 sec | 3.0 | 0.3 | 5.9 | 0.6 | 0.4 | 0.3 | 2.2 | 0.6 | 0.4 | |
2015-10-22 17:22:59 | 0.287 sec | 4.0 | 0.3 | 5.2 | 0.6 | 0.4 | 0.3 | 1.6 | 0.7 | 0.4 | |
2015-10-22 17:22:59 | 0.377 sec | 5.0 | 0.3 | 4.3 | 0.7 | 0.4 | 0.3 | 1.3 | 0.7 | 0.4 | |
2015-10-22 17:22:59 | 0.469 sec | 6.0 | 0.3 | 3.7 | 0.7 | 0.4 | 0.3 | 1.0 | 0.7 | 0.4 | |
2015-10-22 17:22:59 | 0.571 sec | 7.0 | 0.3 | 3.2 | 0.7 | 0.4 | 0.3 | 0.9 | 0.7 | 0.4 | |
2015-10-22 17:22:59 | 0.678 sec | 8.0 | 0.3 | 2.8 | 0.7 | 0.4 | 0.3 | 0.9 | 0.7 | 0.4 | |
2015-10-22 17:22:59 | 0.784 sec | 9.0 | 0.3 | 2.4 | 0.7 | 0.4 | 0.2 | 0.8 | 0.7 | 0.4 | |
2015-10-22 17:22:59 | 0.894 sec | 10.0 | 0.3 | 2.1 | 0.7 | 0.4 | 0.2 | 0.8 | 0.7 | 0.4 |
Variable Importances:
variable | relative_importance | scaled_importance | percentage |
Origin | 6811.1 | 1.0 | 0.3 |
fDayofMonth | 6129.0 | 0.9 | 0.3 |
Dest | 4860.0 | 0.7 | 0.2 |
UniqueCarrier | 1824.5 | 0.3 | 0.1 |
fDayOfWeek | 1634.1 | 0.2 | 0.1 |
Distance | 1591.5 | 0.2 | 0.1 |
fMonth | 129.6 | 0.0 | 0.0 |
air_test = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTest.csv.zip"))
Parse Progress: [##################################################] 100% Imported /Users/ludirehak/h2o-3/smalldata/airlines/AirlinesTest.csv.zip. Parsed 2,691 rows and 12 cols
def model(model_object, test):
#predicting on test file
pred = model_object.predict(test)
pred.head()
#Building confusion matrix for test set
perf = model_object.model_performance(test)
perf.show()
print(perf.confusion_matrix())
print(perf.precision())
print(perf.accuracy())
print(perf.auc())
print("\n\nWITHOUT CLASS BALANCING\n")
model(rf_no_bal, air_test)
WITHOUT CLASS BALANCING H2OFrame with 2691 rows and 3 columns:
predict | YES | YES | YES | YES | YES | YES | NO | YES | YES | YES |
NO | 0.1 | 0.0 | 0.225 | 0.175 | 0.5 | 0.4 | 0.6 | 0.3 | 0.3 | 0.4 |
YES | 0.9 | 1.0 | 0.775 | 0.825 | 0.5 | 0.6 | 0.4 | 0.7 | 0.7 | 0.6 |
ModelMetricsBinomial: drf ** Reported on test data. ** MSE: 0.242134967995 R^2: 0.0225448334417 LogLoss: 0.818660036508 AUC: 0.705312795104 Gini: 0.410625590208 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.51742125228:
NO | YES | Error | Rate | |
NO | 377.0 | 840.0 | 0.6902 | (840.0/1217.0) |
YES | 143.0 | 1331.0 | 0.097 | (143.0/1474.0) |
Total | 520.0 | 2171.0 | 0.3653 | (983.0/2691.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx |
max f1 | 0.5 | 0.7 | 276.0 |
max f2 | 0.2 | 0.9 | 381.0 |
max f0point5 | 0.7 | 0.7 | 174.0 |
max accuracy | 0.7 | 0.7 | 186.0 |
max precision | 1.0 | 0.9 | 7.0 |
max absolute_MCC | 0.7 | 0.3 | 174.0 |
max min_per_class_accuracy | 0.7 | 0.7 | 162.0 |
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.51742125228:
NO | YES | Error | Rate | |
NO | 377.0 | 840.0 | 0.6902 | (840.0/1217.0) |
YES | 143.0 | 1331.0 | 0.097 | (143.0/1474.0) |
Total | 520.0 | 2171.0 | 0.3653 | (983.0/2691.0) |
[[0.985450211376883, 0.8556701030927835]] [[0.6939187561627477, 0.6651802303976218]] 0.705312795104
print("\n\nWITH CLASS BALANCING\n")
model(rf_bal, air_test)
WITH CLASS BALANCING H2OFrame with 2691 rows and 3 columns:
predict | YES | YES | YES | YES | NO | NO | NO | YES | YES | NO |
NO | 0.0 | 0.3 | 0.1 | 0.0 | 0.4 | 0.5 | 0.7 | 0.1 | 0.3 | 0.5 |
YES | 1.0 | 0.7 | 0.9 | 1.0 | 0.6 | 0.5 | 0.3 | 0.9 | 0.7 | 0.5 |
ModelMetricsBinomial: drf ** Reported on test data. ** MSE: 0.24831550935 R^2: -0.00240489657592 LogLoss: 0.758488823047 AUC: 0.693547371085 Gini: 0.38709474217 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.475092852495:
NO | YES | Error | Rate | |
NO | 269.0 | 948.0 | 0.779 | (948.0/1217.0) |
YES | 85.0 | 1389.0 | 0.0577 | (85.0/1474.0) |
Total | 354.0 | 2337.0 | 0.3839 | (1033.0/2691.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx |
max f1 | 0.5 | 0.7 | 307.0 |
max f2 | 0.3 | 0.9 | 379.0 |
max f0point5 | 0.7 | 0.7 | 184.0 |
max accuracy | 0.7 | 0.7 | 210.0 |
max precision | 1.0 | 0.85 | 1.0 |
max absolute_MCC | 0.7 | 0.3 | 210.0 |
max min_per_class_accuracy | 0.7 | 0.6 | 164.0 |
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.475092852495:
NO | YES | Error | Rate | |
NO | 269.0 | 948.0 | 0.779 | (948.0/1217.0) |
YES | 85.0 | 1389.0 | 0.0577 | (85.0/1474.0) |
Total | 354.0 | 2337.0 | 0.3839 | (1033.0/2691.0) |
[[0.9962384300103982, 0.85]] [[0.6673053431289202, 0.6540319583797845]] 0.693547371085