Multinomial Partial Dependency plot¶

Authors Lauren DiPerna, Veronika Maurerova

Build a GLM with the Iris Dataset¶

In [7]:

# Import the Iris Dataset and Build a GLM
import h2o
h2o.init()
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# import the iris dataset:
# this dataset is used to classify the type of iris plant
# the original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Iris
# iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
iris = h2o.import_file("../../smalldata/iris/iris_wheader.csv")

# convert response column to a factor
iris['class'] = iris['class'].asfactor()

# set the predictor names and the response column name
predictors = iris.col_names[:-1]
response = 'class'

# split into train and validation
train, valid = iris.split_frame(ratios = [.8], seed=1234)

# build model
model = H2OGeneralizedLinearEstimator(family = 'multinomial')
model.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

Checking whether there is an H2O instance running at http://localhost:54321 . connected.

H2O_cluster_uptime:	38 mins 55 secs
H2O_cluster_timezone:	Europe/Berlin
H2O_data_parsing_timezone:	UTC
H2O_cluster_version:	3.30.0.99999
H2O_cluster_version_age:	20 hours and 1 minute
H2O_cluster_name:	mori
H2O_cluster_total_nodes:	1
H2O_cluster_free_memory:	4.916 Gb
H2O_cluster_total_cores:	8
H2O_cluster_allowed_cores:	8
H2O_cluster_status:	locked, healthy
H2O_connection_url:	http://localhost:54321
H2O_connection_proxy:	{"http": null, "https": null}
H2O_internal_security:	False
H2O_API_Extensions:	Algos, Core V3, Core V4
Python_version:	3.7.3 candidate

Parse progress: |█████████████████████████████████████████████████████████| 100%
glm Model Build progress: |███████████████████████████████████████████████| 100%

Specify Feature of Interest¶

In the cell below, if you decide to use a different dataset, model, or features please update the following variables:

model
data_pdp
col

In [8]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

# hide progress bar
h2o.no_progress()

# specify the model to you:
model = model

# specify the dataframe to use
data_pdp = iris

# specify the feature of interest, available features include: 
# ['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
# col = "sepal_len"
# col = 'sepal_wid'
col = 'petal_len'
# col = 'petal_wid'

# create a copy of the column of interest, so that values are preserved after each run
col_data = data_pdp[col]

Generate a PDP per class manualy¶

In [9]:

# get a list of the classes in your target
classes = h2o.as_list(data_pdp['class'].unique(), use_pandas=False,header=False)
classes = [class_val[0] for class_val in classes]


# create bins for the pdp plot
bins = data_pdp[col].quantile(prob=list(np.linspace(0.05,1,19)))[:,1].unique()
bins = bins.as_data_frame().values.tolist()
bins = [bin_val[0] for bin_val in bins]
bins.sort()


# Loop over each class and print the pdp for the given feature
for class_val in classes:
    mean_responses = []

    for bin_val in bins:
        # warning this line modifies the dataset.
        # when you rerun on a new column make sure to return
        # all columns to their original values.
        data_pdp[col] = bin_val
        response = model.predict(data_pdp)
        mean_response = response[:,class_val].mean()[0]
        mean_responses.append(mean_response)
        mean_responses

    pdp_manual = pd.DataFrame({col: bins, 'mean_response':mean_responses},columns=[col,'mean_response'])
    plt.plot(pdp_manual[col], pdp_manual.mean_response);
    plt.xlabel(col);
    plt.ylabel('mean_response');
    plt.title('PDP for Class {0}'.format(class_val));
    plt.show()


# reset col value to original value for future runs of this cell
data_pdp[col] = col_data

Use target parameter and plot H2O multinomial PDP¶

In [11]:

# h2o multinomial PDP class setosa
data = model.partial_plot(data=iris, cols=["petal_len"], plot_stddev=False, plot=True, targets=["Iris-setosa"])

In [12]:

# h2o multinomial PDP class versicolor
data = model.partial_plot(data=iris, cols=["petal_len"], plot_stddev=False, plot=True, targets=["Iris-versicolor"])

In [13]:

# h2o multinomial PDP class virginica
data = model.partial_plot(data=iris, cols=["petal_len"], plot_stddev=False, plot=True, targets=["Iris-virginica"])

In [14]:

# h2o multinomial PDP all classes
data = model.partial_plot(data=iris, cols=["petal_len"], plot_stddev=False, plot=True, targets=["Iris-setosa", "Iris-versicolor", "Iris-virginica"])

In [16]:

# h2o multinomial PDP all classes with stddev
data = model.partial_plot(data=iris, cols=["petal_len"], plot_stddev=True, plot=True, targets=["Iris-setosa", "Iris-versicolor", "Iris-virginica"])