#!/usr/bin/env python
# coding: utf-8
# #
Model Interpretability on Random Forest using SHAP
# ## Table of Contents
#
# 1. [Problem Statement](#section1)
# 2. [Importing Packages](#section2)
# 3. [Loading Data](#section3)
# - 3.1 [Description of the Dataset](#section301)
# 4. [Data train/test split](#section4)
# 5. [Random Forest Model](#section5)
# - 5.1 [Random Forest in scikit-learn](#section501)
# - 5.2 [Feature Importances](#section502)
# - 5.3 [Using the Model for Prediction](#section503)
# 6. [Model Evaluation](#section6)
# - 6.1 [R-Squared Value](#section601)
# 7. [Model Interpretability using SHAP](#section7)
# - 7.1 [Explain predictions](#section701)
# - 7.2 [Visualize a single prediction](#section702)
# - 7.3 [Visualize many predictions](#section703)
# - 7.4 [SHAP Dependence Plots](#section704)
# - 7.5 [SHAP Summary Plot](#section705)
# - 7.6 [Bar chart of Mean Importance](#section706)
#
# ## 1. Problem Statement
#
#
# - We have often found that **Machine Learning (ML)** algorithms capable of capturing **structural non-linearities** in training data - models that are sometimes referred to as **'black box' (e.g. Random Forests, Deep Neural Networks, etc.)** - perform far **better at prediction** than their **linear counterparts (e.g. Generalized Linear Models)**.
#
#
# - They are, however, much **harder to interpret** - in fact, quite often it is **not possible to gain any insight into why a particular prediction has been produced**, when given an **instance of input data (i.e. the model features)**.
#
#
# - Consequently, it has **not been possible to use 'black box' ML algorithms** in situations where clients have sought **cause-and-effect explanations for model predictions**, with end-results being that sub-optimal predictive models have been used in their place, as their explanatory power has been more valuable, in relative terms.
#
#
# - The **problem with model explainability** is that it’s **very hard to define a model’s decision boundary in human understandable manner**.
#
#
# - **SHAP (SHapley Additive exPlanations)** is a unified approach to **explain the output of any machine learning model**.
#
#
#
#
#
# - We will use **SHAP** to **interpret** our **RandomForest model**.
# ---
#
# ## 2. Importing Packages
# In[ ]:
# Install SHAP using the following command.
get_ipython().system('pip install shap')
# In[1]:
import numpy as np
np.set_printoptions(precision=4) # To display values only upto four decimal places.
import pandas as pd
pd.set_option('mode.chained_assignment', None) # To suppress pandas warnings.
pd.set_option('display.max_colwidth', -1) # To display all the data in the columns.
pd.options.display.max_columns = 40 # To display all the columns. (Set the value to a high number)
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid') # To apply seaborn whitegrid style to the plots.
plt.rc('figure', figsize=(10, 8)) # Set the default figure size of plots.
get_ipython().run_line_magic('matplotlib', 'inline')
import warnings
warnings.filterwarnings('ignore') # To suppress all the warnings in the notebook.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
# ---
#
# ## 3. Loading Data
# In[2]:
df = pd.read_csv('../../data/Boston.csv')
df.head()
#
# ### 3.1 Description of the Dataset
# - This dataset contains information on **Housing Values in Suburbs of Boston**.
#
#
# - The column **medv** is the **target variable**. It is the **median** value of **owner-occupied homes in $1000s**.
# | Column Name | Description |
# | ---------------------------------|:----------------------------------------------------------------------------------------:|
# | crim | Per capita crime rate by town. |
# | zn | Proportion of residential land zoned for lots over 25,000 sq.ft. |
# | indus | Proportion of non-retail business acres per town. |
# | chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
# | nox | Nitrogen oxides concentration (parts per 10 million). |
# | rm | Average number of rooms per dwelling. |
# | age | Proportion of owner-occupied units built prior to 1940. |
# | dis | Weighted mean of distances to five Boston employment centres. |
# | rad | Index of accessibility to radial highways. |
# | tax | Full-value property-tax rate per 10,000 dollars. |
# | ptratio | Pupil-teacher ratio by town. |
# | black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
# | lstat | Lower status of the population (percent). |
# | medv | Target, median value of owner-occupied homes in $1000s. |
# In[3]:
df.info()
# In[4]:
df.describe()
# ---
#
# ## 4. Data train/test split
#
# - Now that the entire **data** is of **numeric datatype**, lets begin our modelling process.
#
#
# - Firstly, **splitting** the complete **dataset** into **training** and **testing** datasets.
# In[5]:
df.head()
# In[6]:
X = df.iloc[:, :-1]
X.head()
# In[7]:
y = df.iloc[:, -1]
y.head()
# In[8]:
# Using scikit-learn's train_test_split function to split the dataset into train and test sets.
# 80% of the data will be in the train set and 20% in the test set, as specified by test_size=0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# In[9]:
# Checking the shapes of all the training and test sets for the dependent and independent features.
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
# ---
#
# ## 5. Random Forest Model
#
#
# ### 5.1 Random Forest with Scikit-Learn
# In[12]:
# Creating a Random Forest Regressor.
regressor_rf = RandomForestRegressor(n_estimators=200, random_state=0, oob_score=True, n_jobs=-1)
# In[13]:
# Fitting the model on the dataset.
regressor_rf.fit(X_train, y_train)
# In[14]:
regressor_rf.oob_score_
# - From the **OOB score** we can see how our model's gonna perform against the **test set or new** samples.
# ---
#
# ### 5.2 Feature Importances
# In[15]:
X_train.columns
# In[16]:
# Checking the feature importances of various features.
# Sorting the importances by descending order (lowest importance at the bottom).
for score, name in sorted(zip(regressor_rf.feature_importances_, X_train.columns), reverse=True):
print('Feature importance of', name, ':', score*100, '%')
# ---
#
# ### 5.3 Using the Model for Prediction
# In[17]:
# Making predictions on the training set.
y_pred_train = regressor_rf.predict(X_train)
y_pred_train[:10]
# In[18]:
# Making predictions on test set.
y_pred_test = regressor_rf.predict(X_test)
y_pred_test[:10]
# ---
#
# ## 6. Model Evaluation
#
# **Error** is the deviation of the values predicted by the model with the true values.
#
# ### 6.1 R-Squared Value
# In[19]:
# R-Squared Value on the training set.
print('R-Squared Value for train data is:', r2_score(y_train, y_pred_train))
# In[20]:
# R-Squared Value on the test set.
print('R-Squared Value for test data is:', r2_score(y_test, y_pred_test))
# - We get an **R-Squared Value** of **97.74%** on our train set and an **R-Squared Value** of **87.98%** on our test set.
#
# ## 7. Model Interpretability using SHAP
#
#
# - **SHAP (SHapley Additive exPlanations)** is a unified approach to **explain the output of any machine learning model**.
#
#
# - **SHAP connects game theory with local explanations**, uniting several previous methods and representing the only possible consistent and locally accurate additive feature attribution method based on expectations.
# In[21]:
import shap
# In[22]:
# Load JS visualization code to notebook
shap.initjs()
#
# ### 7.1 Explain predictions
# In[23]:
# Explain the model's predictions using SHAP values
explainer = shap.TreeExplainer(regressor_rf)
# In[25]:
shap_values = explainer.shap_values(X_train.values)
# In[27]:
shap_values.shape
#
# ### 7.2 Visualize a single prediction
# In[23]:
# Visualize the first prediction's explanation
shap.force_plot(explainer.expected_value, shap_values[0, :], X_train.iloc[0, :])
# - The above explanation shows **features** each contributing to **push the model output** from the **base value** (the **average model output over the training dataset** we passed) **to the model output**.
#
#
# - **Features pushing the prediction higher** are shown in **red**, those **pushing the prediction lower are in blue**.
#
#
# - The **values** written after each **feature** is their **actual value** in the **data** for this **particular sample (row)**.
#
# ### 7.3 Visualize many predictions
#
# - If we take **many explanations** such as the one shown above, **rotate them 90 degrees**, and then **stack them horizontally**, we can see **explanations for an entire dataset**.
# In[24]:
# Visualize the training set predictions
shap.force_plot(explainer.expected_value, shap_values, X_train)
#
# ### 7.4 SHAP Dependence Plots
#
# - **SHAP dependence plots** show the **effect of a single feature across the whole dataset**.
#
#
# - They **plot a feature's value vs. the SHAP value** of that feature **across many samples**.
#
#
# - The **vertical dispersion** of **SHAP values at a single feature value** is driven by **interaction effects**, and **another feature** is chosen for **coloring** to **highlight possible interactions**.
# In[25]:
# Create a SHAP dependence plot to show the effect of a single feature across the whole dataset
shap.dependence_plot('rm', shap_values, X_train)
# - Since **SHAP values** represent a **feature's responsibility for a change in the model output**, the plot above represents the **change in predicted house price** as **rm** (the average number of rooms per house in an area) **changes**.
#
# - **Vertical dispersion** at a single value of **rm** represents **interaction effects** with **other features**.
#
# - To help **reveal** these **interactions dependence_plot automatically selects another feature for coloring**.
#
# - In this case **coloring** by **rad** (index of accessibility to radial highways) **highlights** that the **average number of rooms per house** has **less impact on home price for areas** with a **high rad** value.
# In[26]:
for name in X_train.columns:
shap.dependence_plot(name, shap_values, X_train)
# - This shows us the method for **Plotting** the **Dependence Plots** for **each feature** in the dataset in a few lines of code.
#
# ### 7.5 SHAP Summary Plot
#
# - To get an overview of **which features** are **most important for a model** we can **plot** the **SHAP values** of **every feature** for **every sample**.
#
#
# - We use a **density scatter plot** of **SHAP values** for **each feature to identify** how much **impact each feature has on the model output** for **individuals** in the **validation dataset**.
#
#
# - **Features** are **sorted** by the **sum of the SHAP value magnitudes** across all samples.
#
#
# - Note that when the **scatter points don't fit on a line** they **pile up to show density**, and the **color of each point represents** the **feature value of that individual**.
# In[27]:
# Summarize the effects of all the features
shap.summary_plot(shap_values, X_train)
# - The **plot** above **sorts features** by the **sum of SHAP value magnitudes over all samples**, and **uses SHAP values** to show the **distribution of the impacts each feature has on the model output**.
#
#
# - The **color represents the feature value (red high, blue low)**.
#
#
# - This **reveals** for example that a **high LSTAT** (% lower status of the population) **lowers the predicted home price**.
#
# ### 7.6 Bar chart of Mean Importance
#
# - This takes the **average of the SHAP value magnitudes** across the dataset and **plots** it as a **simple bar chart**.
# In[28]:
shap.summary_plot(shap_values, X_train, plot_type="bar")
# - Here we can see that the **rm** column has the **highest feature importance** followed by **lstat** column.
#
#
# - This implies that the **number of rooms** and **lower status of the population** have the **highest** effect on the **price** of a **house** in **Boston**.