Author: Kevin Yang
Contact: kyang@h2o.ai
This tutorial replicates Erin LeDell's oncology demo using Scikit Learn and Pandas, and is intended to provide a comparison of the syntactical and performance differences between sklearn and H2O implementations of Gradient Boosting Machines.
We'll be using Pandas, Numpy and the collections package for most of the data exploration.
import pandas as pd
import numpy as np
from collections import Counter
The following code downloads a copy of the EEG Eye State dataset. All data is from one continuous EEG measurement with the Emotiv EEG Neuroheadset. The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.
Let's import the same dataset directly with pandas
csv_url = "http://www.stat.berkeley.edu/~ledell/data/eeg_eyestate_splits.csv"
data = pd.read_csv(csv_url)
Once we have loaded the data, let's take a quick look. First the dimension of the frame:
data.shape
(14980, 16)
Now let's take a look at the top of the frame:
data.head()
AF3 | F7 | F3 | FC5 | T7 | P7 | O1 | O2 | P8 | T8 | FC6 | F4 | F8 | AF4 | eyeDetection | split | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4329.23 | 4009.23 | 4289.23 | 4148.21 | 4350.26 | 4586.15 | 4096.92 | 4641.03 | 4222.05 | 4238.46 | 4211.28 | 4280.51 | 4635.90 | 4393.85 | 0 | valid |
1 | 4324.62 | 4004.62 | 4293.85 | 4148.72 | 4342.05 | 4586.67 | 4097.44 | 4638.97 | 4210.77 | 4226.67 | 4207.69 | 4279.49 | 4632.82 | 4384.10 | 0 | test |
2 | 4327.69 | 4006.67 | 4295.38 | 4156.41 | 4336.92 | 4583.59 | 4096.92 | 4630.26 | 4207.69 | 4222.05 | 4206.67 | 4282.05 | 4628.72 | 4389.23 | 0 | train |
3 | 4328.72 | 4011.79 | 4296.41 | 4155.90 | 4343.59 | 4582.56 | 4097.44 | 4630.77 | 4217.44 | 4235.38 | 4210.77 | 4287.69 | 4632.31 | 4396.41 | 0 | train |
4 | 4326.15 | 4011.79 | 4292.31 | 4151.28 | 4347.69 | 4586.67 | 4095.90 | 4627.69 | 4210.77 | 4244.10 | 4212.82 | 4288.21 | 4632.82 | 4398.46 | 0 | train |
The first two columns contain an ID and the response. The "diagnosis" column is the response. Let's take a look at the column names. The data contains derived features from the medical images of the tumors.
data.columns.tolist()
['AF3', 'F7', 'F3', 'FC5', 'T7', 'P7', 'O1', 'O2', 'P8', 'T8', 'FC6', 'F4', 'F8', 'AF4', 'eyeDetection', 'split']
To select a subset of the columns to look at, typical Pandas indexing applies:
columns = ['AF3', 'eyeDetection', 'split']
data[columns].head(10)
AF3 | eyeDetection | split | |
---|---|---|---|
0 | 4329.23 | 0 | valid |
1 | 4324.62 | 0 | test |
2 | 4327.69 | 0 | train |
3 | 4328.72 | 0 | train |
4 | 4326.15 | 0 | train |
5 | 4321.03 | 0 | train |
6 | 4319.49 | 0 | test |
7 | 4325.64 | 0 | test |
8 | 4326.15 | 0 | test |
9 | 4326.15 | 0 | train |
Now let's select a single column, for example -- the response column, and look at the data more closely:
data['eyeDetection'].head()
0 0 1 0 2 0 3 0 4 0 Name: eyeDetection, dtype: int64
It looks like a binary response, but let's validate that assumption:
data['eyeDetection'].unique()
array([0, 1])
We can query the categorical "levels" as well ('B' and 'M' stand for "Benign" and "Malignant" diagnosis):
data['eyeDetection'].nunique()
2
Since "diagnosis" column is the response we would like to predict, we may want to check if there are any missing values, so let's look for NAs. To figure out which, if any, values are missing, we can use the isna
method on the diagnosis column. The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to a Frame also apply to a single column.
data.isnull()
AF3 | F7 | F3 | FC5 | T7 | P7 | O1 | O2 | P8 | T8 | FC6 | F4 | F8 | AF4 | eyeDetection | split | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
5 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
6 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
7 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
8 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
9 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
10 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
11 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
12 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
13 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
15 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
16 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
17 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
18 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
19 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
20 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
21 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
22 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
23 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
24 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
25 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
26 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
27 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
28 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
29 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14950 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14951 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14952 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14953 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14954 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14955 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14956 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14957 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14958 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14959 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14960 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14961 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14962 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14963 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14964 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14965 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14966 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14967 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14968 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14969 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14970 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14971 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14972 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14973 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14974 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14975 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14976 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14977 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14978 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14979 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
14980 rows × 16 columns
data['eyeDetection'].isnull()
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 14950 False 14951 False 14952 False 14953 False 14954 False 14955 False 14956 False 14957 False 14958 False 14959 False 14960 False 14961 False 14962 False 14963 False 14964 False 14965 False 14966 False 14967 False 14968 False 14969 False 14970 False 14971 False 14972 False 14973 False 14974 False 14975 False 14976 False 14977 False 14978 False 14979 False Name: eyeDetection, dtype: bool
The isna
method doesn't directly answer the question, "Does the diagnosis column contain any NAs?", rather it returns a 0 if that cell is not missing (Is NA? FALSE == 0) and a 1 if it is missing (Is NA? TRUE == 1). So if there are no missing values, then summing over the whole column should produce a summand equal to 0.0. Let's take a look:
data['eyeDetection'].isnull().sum()
0
Great, no missing labels.
Out of curiosity, let's see if there is any missing data in this frame:
data.isnull().sum()
AF3 0 F7 0 F3 0 FC5 0 T7 0 P7 0 O1 0 O2 0 P8 0 T8 0 FC6 0 F4 0 F8 0 AF4 0 eyeDetection 0 split 0 dtype: int64
The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an "imbalanace" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution, both visually and numerically.
Counter(data['eyeDetection'])
Counter({0: 8257, 1: 6723})
Ok, the data is not exactly evenly distributed between the two classes -- there are more 0's than 1's in the dataset. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below).
Let's calculate the percentage that each class represents:
n = data.shape[0] # Total number of training samples
np.array(Counter(data['eyeDetection']).values())/float(n)
array([ 0.5512016, 0.4487984])
So far we have explored the original dataset (all rows). For the machine learning portion of this tutorial, we will break the dataset into three parts: a training set, validation set and a test set.
If you want H2O to do the splitting for you, you can use the split_frame
method. However, we have explicit splits that we want (for reproducibility reasons), so we can just subset the Frame to get the partitions we want.
train = data[data['split']=="train"]
train.shape
(8988, 16)
valid = data[data['split']=="valid"]
valid.shape
(2996, 16)
test = data[data['split']=="test"]
test.shape
(2996, 16)
We will do a quick demo of the H2O software -- trying to predict eye state (open/closed) from EEG data.
The response, y
, is the 'diagnosis' column, and the predictors, x
, are all the columns aside from the first two columns ('id' and 'diagnosis').
y = 'eyeDetection'
x = data.columns.drop(['eyeDetection','split'])
from sklearn.ensemble import GradientBoostingClassifier
import sklearn
test.shape
(2996, 16)
model = GradientBoostingClassifier(n_estimators=100,
max_depth=4,
learning_rate=0.1)
X=train[x].reset_index(drop=True)
y=train[y].reset_index(drop=True)
model.fit(X, y)
GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance', max_depth=4, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False)
print(model)
GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance', max_depth=4, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False)
model.get_params()
{'init': None, 'learning_rate': 0.1, 'loss': 'deviance', 'max_depth': 4, 'max_features': None, 'max_leaf_nodes': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'presort': 'auto', 'random_state': None, 'subsample': 1.0, 'verbose': 0, 'warm_start': False}
from sklearn.metrics import r2_score, roc_auc_score, mean_squared_error
y_pred = model.predict(X)
r2_score(y_pred, y)
0.54512915254897387
roc_auc_score(y_pred, y)
0.89097094432760837
mean_squared_error(y_pred, y)
0.11103693813974187
from sklearn import cross_validation
cross_validation.cross_val_score(model, X, y, scoring='roc_auc', cv=5)
array([ 0.54945509, 0.55455629, 0.32538286, 0.38222385, 0.42590001])
cross_validation.cross_val_score(model, valid[x].reset_index(drop=True), valid['eyeDetection'].reset_index(drop=True), scoring='roc_auc', cv=5)
array([ 0.64409495, 0.55143686, 0.30297715, 0.36688253, 0.40355729])