In [1]:

# This is a demo of H2O's GLM function
# It imports a data set, parses it, and prints a summary
# Then, it runs GLM with a binomial link function
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator

In [2]:

h2o.init()

Warning: Version mismatch. H2O is version 3.5.0.99999, but the python package is version UNKNOWN.

H2O cluster uptime:	44 minutes 50 seconds 74 milliseconds
H2O cluster version:	3.5.0.99999
H2O cluster name:	ludirehak
H2O cluster total nodes:	1
H2O cluster total memory:	3.56 GB
H2O cluster total cores:	8
H2O cluster allowed cores:	8
H2O cluster healthy:	True
H2O Connection ip:	127.0.0.1
H2O Connection port:	54321

In [3]:

from h2o.utils.shared_utils import _locate # private function. used to find files within h2o git project directory.

air = h2o.upload_file(path=_locate("smalldata/airlines/AirlinesTrain.csv.zip"))

Parse Progress: [##################################################] 100%
Uploaded pya01a74e5-0aa6-4ef0-ae1a-0d3fe860eee9 into cluster with 24,421 rows and 12 cols

In [4]:

r = air[0].runif()
air_train = air[r < 0.8]
air_valid = air[r >= 0.8]

In [5]:

myX = ["Origin", "Dest", "Distance", "UniqueCarrier", "fMonth", "fDayofMonth", "fDayOfWeek"]
myY = "IsDepDelayed"

In [6]:

rf_no_bal = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=False)
rf_no_bal.train(x=myX, y=myY, training_frame=air_train, validation_frame=air_valid)
rf_no_bal.show()

drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2ORandomForestEstimator :  Distributed RF
Model Key:  DRF_model_python_1445557087082_2742

Model Summary:

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	10.0	287650.0	20.0	20.0	20.0	1664.0	2418.0	2103.5


ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.269503006052
R^2: -0.0873991649123
LogLoss: 2.43382549553
AUC: 0.646622642412
Gini: 0.293245284825

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.402941766395:

	NO	YES	Error	Rate
NO	1948.0	6780.0	0.7768	(6780.0/8728.0)
YES	936.0	9580.0	0.089	(936.0/10516.0)
Total	2884.0	16360.0	0.401	(7716.0/19244.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric	threshold	value	idx
max f1	0.4	0.7	299.0
max f2	0.0	0.9	399.0
max f0point5	0.6	0.7	190.0
max accuracy	0.6	0.6	193.0
max precision	0.9	0.7	30.0
max absolute_MCC	0.6	0.2	190.0
max min_per_class_accuracy	0.7	0.6	140.0

ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.245293478794
R^2: 0.00968032826017
LogLoss: 0.758757679035
AUC: 0.685987609758
Gini: 0.371975219515

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.42132409513:

	NO	YES	Error	Rate
NO	467.0	1781.0	0.7923	(1781.0/2248.0)
YES	160.0	2566.0	0.0587	(160.0/2726.0)
Total	627.0	4347.0	0.3902	(1941.0/4974.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric	threshold	value	idx
max f1	0.4	0.7	315.0
max f2	0.2	0.9	396.0
max f0point5	0.7	0.7	174.0
max accuracy	0.7	0.6	200.0
max precision	1.0	0.9	0.0
max absolute_MCC	0.7	0.3	174.0
max min_per_class_accuracy	0.7	0.6	165.0

Scoring History:

timestamp	duration	number_of_trees	training_MSE	training_logloss	training_AUC	training_classification_error	validation_MSE	validation_logloss	validation_AUC	validation_classification_error
2015-10-22 17:22:58	0.074 sec	1.0	0.3	8.4	0.6	0.4	0.3	8.1	0.6	0.5
2015-10-22 17:22:58	0.163 sec	2.0	0.3	7.4	0.6	0.4	0.3	4.0	0.6	0.4
2015-10-22 17:22:58	0.245 sec	3.0	0.3	6.5	0.6	0.4	0.3	2.6	0.6	0.4
2015-10-22 17:22:58	0.311 sec	4.0	0.3	5.6	0.6	0.5	0.3	1.9	0.7	0.4
2015-10-22 17:22:58	0.391 sec	5.0	0.3	4.8	0.6	0.4	0.3	1.4	0.7	0.4
2015-10-22 17:22:58	0.480 sec	6.0	0.3	4.0	0.6	0.4	0.3	1.1	0.7	0.4
2015-10-22 17:22:58	0.565 sec	7.0	0.3	3.6	0.6	0.4	0.2	1.0	0.7	0.4
2015-10-22 17:22:58	0.659 sec	8.0	0.3	3.1	0.6	0.4	0.2	0.9	0.7	0.4
2015-10-22 17:22:58	0.751 sec	9.0	0.3	2.7	0.6	0.4	0.2	0.8	0.7	0.4
2015-10-22 17:22:58	0.851 sec	10.0	0.3	2.4	0.6	0.4	0.2	0.8	0.7	0.4

Variable Importances:

variable	relative_importance	scaled_importance	percentage
Origin	6152.2	1.0	0.3
fDayofMonth	5583.6	0.9	0.3
Dest	4203.4	0.7	0.2
UniqueCarrier	1609.3	0.3	0.1
fDayOfWeek	1556.2	0.3	0.1
Distance	1493.0	0.2	0.1
fMonth	131.7	0.0	0.0

In [7]:

rf_bal = H2ORandomForestEstimator(seed=12, ntrees=10, max_depth=20, balance_classes=True)
rf_bal.train(x=myX, y=myY, training_frame=air_train, validation_frame=air_valid)
rf_bal.show()

drf Model Build Progress: [##################################################] 100%
Model Details
=============
H2ORandomForestEstimator :  Distributed RF
Model Key:  DRF_model_python_1445557087082_2744

Model Summary:

	number_of_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
	10.0	299144.0	20.0	20.0	20.0	1750.0	2460.0	2168.2


ModelMetricsBinomial: drf
** Reported on train data. **

MSE: 0.268874582249
R^2: -0.0754992978501
LogLoss: 2.09200342169
AUC: 0.685292136376
Gini: 0.370584272753

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.538182890839:

	NO	YES	Error	Rate
NO	3925.0	6621.0	0.6278	(6621.0/10546.0)
YES	1574.0	8952.0	0.1495	(1574.0/10526.0)
Total	5499.0	15573.0	0.3889	(8195.0/21072.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric	threshold	value	idx
max f1	0.5	0.7	226.0
max f2	0.0	0.8	399.0
max f0point5	0.8	0.6	124.0
max accuracy	0.7	0.6	140.0
max precision	0.9	0.7	28.0
max absolute_MCC	0.7	0.3	151.0
max min_per_class_accuracy	0.7	0.6	140.0

ModelMetricsBinomial: drf
** Reported on validation data. **

MSE: 0.249809873778
R^2: -0.00855364526058
LogLoss: 0.770654128805
AUC: 0.682375448104
Gini: 0.364750896207

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.56328826827:

	NO	YES	Error	Rate
NO	822.0	1426.0	0.6343	(1426.0/2248.0)
YES	367.0	2359.0	0.1346	(367.0/2726.0)
Total	1189.0	3785.0	0.3605	(1793.0/4974.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric	threshold	value	idx
max f1	0.6	0.7	261.0
max f2	0.1	0.9	399.0
max f0point5	0.7	0.7	179.0
max accuracy	0.6	0.6	235.0
max precision	1.0	0.8	6.0
max absolute_MCC	0.7	0.3	194.0
max min_per_class_accuracy	0.7	0.6	167.0

Scoring History:

timestamp	duration	number_of_trees	training_MSE	training_logloss	training_AUC	training_classification_error	validation_MSE	validation_logloss	validation_AUC	validation_classification_error
2015-10-22 17:22:59	0.093 sec	1.0	0.3	7.3	0.6	0.4	0.3	7.9	0.6	0.5
2015-10-22 17:22:59	0.152 sec	2.0	0.3	6.8	0.6	0.4	0.3	3.7	0.6	0.4
2015-10-22 17:22:59	0.210 sec	3.0	0.3	5.9	0.6	0.4	0.3	2.2	0.6	0.4
2015-10-22 17:22:59	0.287 sec	4.0	0.3	5.2	0.6	0.4	0.3	1.6	0.7	0.4
2015-10-22 17:22:59	0.377 sec	5.0	0.3	4.3	0.7	0.4	0.3	1.3	0.7	0.4
2015-10-22 17:22:59	0.469 sec	6.0	0.3	3.7	0.7	0.4	0.3	1.0	0.7	0.4
2015-10-22 17:22:59	0.571 sec	7.0	0.3	3.2	0.7	0.4	0.3	0.9	0.7	0.4
2015-10-22 17:22:59	0.678 sec	8.0	0.3	2.8	0.7	0.4	0.3	0.9	0.7	0.4
2015-10-22 17:22:59	0.784 sec	9.0	0.3	2.4	0.7	0.4	0.2	0.8	0.7	0.4
2015-10-22 17:22:59	0.894 sec	10.0	0.3	2.1	0.7	0.4	0.2	0.8	0.7	0.4

Variable Importances:

variable	relative_importance	scaled_importance	percentage
Origin	6811.1	1.0	0.3
fDayofMonth	6129.0	0.9	0.3
Dest	4860.0	0.7	0.2
UniqueCarrier	1824.5	0.3	0.1
fDayOfWeek	1634.1	0.2	0.1
Distance	1591.5	0.2	0.1
fMonth	129.6	0.0	0.0

In [8]:

air_test = h2o.import_file(path=_locate("smalldata/airlines/AirlinesTest.csv.zip"))

Parse Progress: [##################################################] 100%
Imported /Users/ludirehak/h2o-3/smalldata/airlines/AirlinesTest.csv.zip. Parsed 2,691 rows and 12 cols

In [9]:

def model(model_object, test):
        #predicting on test file
        pred = model_object.predict(test)
        pred.head()
        #Building confusion matrix for test set
        perf = model_object.model_performance(test)
        perf.show()
        print(perf.confusion_matrix())
        print(perf.precision())
        print(perf.accuracy())
        print(perf.auc())

In [10]:

print("\n\nWITHOUT CLASS BALANCING\n")
model(rf_no_bal, air_test)


WITHOUT CLASS BALANCING

H2OFrame with 2691 rows and 3 columns:

predict	YES	YES	YES	YES	YES	YES	NO	YES	YES	YES
NO	0.1	0.0	0.225	0.175	0.5	0.4	0.6	0.3	0.3	0.4
YES	0.9	1.0	0.775	0.825	0.5	0.6	0.4	0.7	0.7	0.6

ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.242134967995
R^2: 0.0225448334417
LogLoss: 0.818660036508
AUC: 0.705312795104
Gini: 0.410625590208

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.51742125228:

	NO	YES	Error	Rate
NO	377.0	840.0	0.6902	(840.0/1217.0)
YES	143.0	1331.0	0.097	(143.0/1474.0)
Total	520.0	2171.0	0.3653	(983.0/2691.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric	threshold	value	idx
max f1	0.5	0.7	276.0
max f2	0.2	0.9	381.0
max f0point5	0.7	0.7	174.0
max accuracy	0.7	0.7	186.0
max precision	1.0	0.9	7.0
max absolute_MCC	0.7	0.3	174.0
max min_per_class_accuracy	0.7	0.7	162.0

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.51742125228:

	NO	YES	Error	Rate
NO	377.0	840.0	0.6902	(840.0/1217.0)
YES	143.0	1331.0	0.097	(143.0/1474.0)
Total	520.0	2171.0	0.3653	(983.0/2691.0)

[[0.985450211376883, 0.8556701030927835]]
[[0.6939187561627477, 0.6651802303976218]]
0.705312795104

In [11]:

print("\n\nWITH CLASS BALANCING\n")
model(rf_bal, air_test)


WITH CLASS BALANCING

H2OFrame with 2691 rows and 3 columns:

predict	YES	YES	YES	YES	NO	NO	NO	YES	YES	NO
NO	0.0	0.3	0.1	0.0	0.4	0.5	0.7	0.1	0.3	0.5
YES	1.0	0.7	0.9	1.0	0.6	0.5	0.3	0.9	0.7	0.5

ModelMetricsBinomial: drf
** Reported on test data. **

MSE: 0.24831550935
R^2: -0.00240489657592
LogLoss: 0.758488823047
AUC: 0.693547371085
Gini: 0.38709474217

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.475092852495:

	NO	YES	Error	Rate
NO	269.0	948.0	0.779	(948.0/1217.0)
YES	85.0	1389.0	0.0577	(85.0/1474.0)
Total	354.0	2337.0	0.3839	(1033.0/2691.0)

Maximum Metrics: Maximum metrics at their respective thresholds

metric	threshold	value	idx
max f1	0.5	0.7	307.0
max f2	0.3	0.9	379.0
max f0point5	0.7	0.7	184.0
max accuracy	0.7	0.7	210.0
max precision	1.0	0.85	1.0
max absolute_MCC	0.7	0.3	210.0
max min_per_class_accuracy	0.7	0.6	164.0

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.475092852495:

	NO	YES	Error	Rate
NO	269.0	948.0	0.779	(948.0/1217.0)
YES	85.0	1389.0	0.0577	(85.0/1474.0)
Total	354.0	2337.0	0.3839	(1033.0/2691.0)

[[0.9962384300103982, 0.85]]
[[0.6673053431289202, 0.6540319583797845]]
0.693547371085