We will demonstrate how to use contamination parameter to get predicted labels out of an Isolation Forest model. We will also experiment with a new feature added the algorithm - using a validation frame to check quality of the model and do early stopping.
Using validation frame is still an experimental feauture in active development. Please send us your feedback and we will try to incorporate your comments!
Please also be careful about your conclusions based on experiments with this notebook. The notebook is intentionally not seeded and we encourage you to rerun the experiments several times to see how different random inputs changes the performance of the model.
import sys
import h2o
from h2o.frame import H2OFrame
import numpy as np
import pandas as pd
h2o.init(strict_version_check=False)
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: | 52 mins 37 secs |
H2O_cluster_timezone: | America/New_York |
H2O_data_parsing_timezone: | UTC |
H2O_cluster_version: | 3.31.0.99999 |
H2O_cluster_version_age: | 53 minutes |
H2O_cluster_name: | mkurka |
H2O_cluster_total_nodes: | 1 |
H2O_cluster_free_memory: | 3.276 Gb |
H2O_cluster_total_cores: | 8 |
H2O_cluster_allowed_cores: | 8 |
H2O_cluster_status: | locked, healthy |
H2O_connection_url: | http://localhost:54321 |
H2O_connection_proxy: | {"http": null, "https": null} |
H2O_internal_security: | False |
H2O_API_Extensions: | Algos, AutoML, Core V3, TargetEncoder, Core V4 |
Python_version: | 2.7.14 final |
N = 1000
cont = 0.05 # ratio of outliers/anomalies
regular_data = np.random.normal(0, 0.5, (int(N*(1-cont)), 2))
anomaly_data = np.column_stack((np.random.normal(-1.5, 1, int(N*cont)), np.random.normal(1.5, 1, int(N*cont))))
import matplotlib.pyplot as plt
plt.scatter(anomaly_data[:,0], anomaly_data[:,1])
plt.scatter(regular_data[:,0], regular_data[:,1])
plt.show()
regular_pd = pd.DataFrame({'x': regular_data[:, 0], 'y': regular_data[:, 1], 'label': np.zeros(regular_data.shape[0])})
anomaly_pd = pd.DataFrame({'x': anomaly_data[:, 0], 'y': anomaly_data[:, 1], 'label': np.ones(anomaly_data.shape[0])})
dataset = H2OFrame(regular_pd.append(anomaly_pd).sample(frac=1))
Parse progress: |█████████████████████████████████████████████████████████| 100%
train_with_label, test = dataset.split_frame([0.8])
train_with_label["label"].table()
label | Count |
---|---|
0 | 756 |
1 | 39 |
test["label"].table()
label | Count |
---|---|
0 | 194 |
1 | 11 |
train = train_with_label.drop(["label"])
test["label"] = test["label"].asfactor()
from h2o.estimators.isolation_forest import H2OIsolationForestEstimator
from h2o.model.metrics_base import H2OAnomalyDetectionModelMetrics, H2OBinomialModelMetrics
We will use validation frame and enable early stopping. The observations of validation frame are labeled and anomalies/outliers are marked with label "1", regular observations with "0". This lets us use binomial classification metrics to do early stopping. The model will calculate binomial metrics on the validation frame and use early stopping based on the performance observed on the validation data.
if_model = H2OIsolationForestEstimator(seed=12, ntrees=200,
score_tree_interval=7, stopping_rounds=3, stopping_metric="mean_per_class_error",
validation_response_column="label")
if_model.train(training_frame=train, validation_frame=test)
isolationforest Model Build progress: |███████████████████████████████████| 100%
The trained model will have different kind of metrics for training and validation frame. For training - where we don't have labeled data - anomaly metrics will be returned. For validation frame we will see binomial model metrics.
if_model
Model Details ============= H2OIsolationForestEstimator : Isolation Forest Model Key: IsolationForest_model_python_1593529508000_107 Model Summary:
number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 56.0 | 56.0 | 39085.0 | 8.0 | 8.0 | 8.0 | 25.0 | 78.0 | 50.767857 |
ModelMetricsAnomaly: isolationforest ** Reported on train data. ** Anomaly Score: 6.77065108782 Normalized Anomaly Score: 0.0467037784801 ModelMetricsBinomial: isolationforest ** Reported on validation data. ** MSE: 0.0547974682524 RMSE: 0.234088590607 LogLoss: 0.188428153792 Mean Per-Class Error: 0.157919400187 AUC: 0.832474226804 AUCPR: 0.256449326499 Gini: 0.664948453608 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.145454545455:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 184.0 | 10.0 | 0.0515 | (10.0/194.0) |
1 | 1 | 4.0 | 7.0 | 0.3636 | (4.0/11.0) |
2 | Total | 188.0 | 17.0 | 0.0683 | (14.0/205.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.145455 | 0.500000 | 13.0 |
1 | max f2 | 0.145455 | 0.573770 | 13.0 |
2 | max f0point5 | 0.167273 | 0.476190 | 10.0 |
3 | max accuracy | 0.778182 | 0.941463 | 0.0 |
4 | max precision | 0.167273 | 0.461538 | 10.0 |
5 | max recall | 0.003636 | 1.000000 | 46.0 |
6 | max specificity | 0.778182 | 0.994845 | 0.0 |
7 | max absolute_mcc | 0.145455 | 0.477875 | 13.0 |
8 | max min_per_class_accuracy | 0.080000 | 0.818182 | 25.0 |
9 | max mean_per_class_accuracy | 0.080000 | 0.842081 | 25.0 |
10 | max tns | 0.778182 | 193.000000 | 0.0 |
11 | max fns | 0.778182 | 11.000000 | 0.0 |
12 | max fps | 0.000000 | 194.000000 | 47.0 |
13 | max tps | 0.003636 | 11.000000 | 46.0 |
14 | max tnr | 0.778182 | 0.994845 | 0.0 |
15 | max fnr | 0.778182 | 1.000000 | 0.0 |
16 | max fpr | 0.000000 | 1.000000 | 47.0 |
17 | max tpr | 0.003636 | 1.000000 | 46.0 |
Scoring History:
timestamp | duration | number_of_trees | mean_tree_path_length | mean_anomaly_score | ||
---|---|---|---|---|---|---|
0 | 2020-06-30 11:57:47 | 0.001 sec | 0.0 | NaN | NaN | |
1 | 2020-06-30 11:57:47 | 0.031 sec | 7.0 | 6.777928 | 0.044414 | |
2 | 2020-06-30 11:57:47 | 0.062 sec | 14.0 | 6.764469 | 0.048492 | |
3 | 2020-06-30 11:57:47 | 0.097 sec | 21.0 | 6.769009 | 0.047557 | |
4 | 2020-06-30 11:57:47 | 0.124 sec | 28.0 | 6.763426 | 0.049805 | |
5 | 2020-06-30 11:57:47 | 0.152 sec | 35.0 | 6.770500 | 0.045900 | |
6 | 2020-06-30 11:57:47 | 0.179 sec | 42.0 | 6.772585 | 0.045267 | |
7 | 2020-06-30 11:57:47 | 0.211 sec | 49.0 | 6.770547 | 0.046268 | |
8 | 2020-06-30 11:57:47 | 0.229 sec | 56.0 | 6.770651 | 0.046704 |
predicted = if_model.predict(train)
predicted.head()
isolationforest prediction progress: |████████████████████████████████████| 100%
predict | score | mean_length |
---|---|---|
0 | 0.123636 | 6.39286 |
0 | 0 | 7 |
0 | 0.0145455 | 6.92857 |
1 | 0.421818 | 4.92857 |
0 | 0.0509091 | 6.75 |
0 | 0.0181818 | 6.91071 |
0 | 0 | 7 |
0 | 0.00727273 | 6.96429 |
0 | 0.134545 | 6.33929 |
0 | 0 | 7 |
The output incudes predicted class of the observation not anomaly/anomaly. This is accomplished by using the validation frame. In current implementation we pick the threshold to maximize the F1 score.
predicted_train_labels = predicted["predict"].as_data_frame(use_pandas=True)
train_pd = train.as_data_frame(use_pandas=True)
plt.scatter(train_pd["x"], train_pd["y"], c=predicted_train_labels["predict"])
plt.show()
if_model.model_performance(train_with_label)
ModelMetricsBinomial: isolationforest ** Reported on test data. ** MSE: 0.0436448796715 RMSE: 0.20891356986 LogLoss: 0.138389352082 Mean Per-Class Error: 0.105667480667 AUC: 0.936440103107 AUCPR: 0.348593089161 Gini: 0.872880206214 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.0909090909091:
0 | 1 | Error | Rate | ||
---|---|---|---|---|---|
0 | 0 | 692.0 | 64.0 | 0.0847 | (64.0/756.0) |
1 | 1 | 8.0 | 31.0 | 0.2051 | (8.0/39.0) |
2 | Total | 700.0 | 95.0 | 0.0906 | (72.0/795.0) |
Maximum Metrics: Maximum metrics at their respective thresholds
metric | threshold | value | idx | |
---|---|---|---|---|
0 | max f1 | 0.090909 | 0.462687 | 53.0 |
1 | max f2 | 0.076364 | 0.629771 | 56.0 |
2 | max f0point5 | 0.138182 | 0.406360 | 41.0 |
3 | max accuracy | 0.927273 | 0.953459 | 1.0 |
4 | max precision | 1.000000 | 1.000000 | 0.0 |
5 | max recall | 0.018182 | 1.000000 | 72.0 |
6 | max specificity | 1.000000 | 1.000000 | 0.0 |
7 | max absolute_mcc | 0.076364 | 0.476273 | 56.0 |
8 | max min_per_class_accuracy | 0.065455 | 0.871795 | 59.0 |
9 | max mean_per_class_accuracy | 0.043636 | 0.894333 | 65.0 |
10 | max tns | 1.000000 | 756.000000 | 0.0 |
11 | max fns | 1.000000 | 38.000000 | 0.0 |
12 | max fps | 0.000000 | 756.000000 | 77.0 |
13 | max tps | 0.018182 | 39.000000 | 72.0 |
14 | max tnr | 1.000000 | 1.000000 | 0.0 |
15 | max fnr | 1.000000 | 0.974359 | 0.0 |
16 | max fpr | 0.000000 | 1.000000 | 77.0 |
17 | max tpr | 0.018182 | 1.000000 | 72.0 |
if_model_cont = H2OIsolationForestEstimator(seed=12, contamination=cont)
if_model_cont.train(training_frame=train)
if_model_cont
/Users/mkurka/git/h2o/h2o-3/h2o-py/h2o/estimators/estimator_base.py:200: RuntimeWarning: Stopping tolerance is ignored for _stopping_rounds=0. warnings.warn(mesg["message"], RuntimeWarning)
isolationforest Model Build progress: |███████████████████████████████████| 100% Model Details ============= H2OIsolationForestEstimator : Isolation Forest Model Key: IsolationForest_model_python_1593529508000_108 Model Summary:
number_of_trees | number_of_internal_trees | model_size_in_bytes | min_depth | max_depth | mean_depth | min_leaves | max_leaves | mean_leaves | ||
---|---|---|---|---|---|---|---|---|---|---|
0 | 50.0 | 50.0 | 34937.0 | 8.0 | 8.0 | 8.0 | 25.0 | 78.0 | 50.84 |
ModelMetricsAnomaly: isolationforest ** Reported on train data. ** Anomaly Score: 6.77064148276 Normalized Anomaly Score: 0.0458717034484 Scoring History:
timestamp | duration | number_of_trees | mean_tree_path_length | mean_anomaly_score | ||
---|---|---|---|---|---|---|
0 | 2020-06-30 11:57:48 | 0.003 sec | 0.0 | NaN | NaN | |
1 | 2020-06-30 11:57:48 | 0.012 sec | 1.0 | 6.864504 | 0.019357 | |
2 | 2020-06-30 11:57:48 | 0.034 sec | 2.0 | 6.851371 | 0.027023 | |
3 | 2020-06-30 11:57:48 | 0.057 sec | 3.0 | 6.820479 | 0.029920 | |
4 | 2020-06-30 11:57:48 | 0.080 sec | 4.0 | 6.805021 | 0.038996 | |
5 | 2020-06-30 11:57:48 | 0.092 sec | 5.0 | 6.807712 | 0.035609 | |
6 | 2020-06-30 11:57:48 | 0.100 sec | 6.0 | 6.783165 | 0.043367 | |
7 | 2020-06-30 11:57:48 | 0.108 sec | 7.0 | 6.777928 | 0.044414 | |
8 | 2020-06-30 11:57:48 | 0.115 sec | 8.0 | 6.776496 | 0.048325 | |
9 | 2020-06-30 11:57:48 | 0.127 sec | 9.0 | 6.782595 | 0.044469 | |
10 | 2020-06-30 11:57:48 | 0.142 sec | 10.0 | 6.783853 | 0.043229 | |
11 | 2020-06-30 11:57:48 | 0.151 sec | 11.0 | 6.783153 | 0.045006 | |
12 | 2020-06-30 11:57:48 | 0.161 sec | 12.0 | 6.785637 | 0.042873 | |
13 | 2020-06-30 11:57:48 | 0.171 sec | 13.0 | 6.773668 | 0.046703 | |
14 | 2020-06-30 11:57:48 | 0.180 sec | 14.0 | 6.764469 | 0.048492 | |
15 | 2020-06-30 11:57:48 | 0.188 sec | 15.0 | 6.766426 | 0.046715 | |
16 | 2020-06-30 11:57:48 | 0.197 sec | 16.0 | 6.769173 | 0.045039 | |
17 | 2020-06-30 11:57:48 | 0.204 sec | 17.0 | 6.763707 | 0.048397 | |
18 | 2020-06-30 11:57:48 | 0.212 sec | 18.0 | 6.768759 | 0.046248 | |
19 | 2020-06-30 11:57:48 | 0.221 sec | 19.0 | 6.769960 | 0.048030 |
See the whole table with table.as_data_frame()
predicted_train_labels_cont = if_model_cont.predict(train)["predict"].as_data_frame(use_pandas=True)
isolationforest prediction progress: |████████████████████████████████████| 100%
plt.scatter(train_pd["x"], train_pd["y"], c=predicted_train_labels_cont["predict"])
plt.show()