#!/usr/bin/env python # coding: utf-8 # #
Model Interpretability on Random Forest using LIME
# ## Table of Contents # # 1. [Problem Statement](#section1)

# 2. [Importing Packages](#section2)

# 3. [Loading Data](#section3) # - 3.1 [Description of the Dataset](#section301)

# 4. [Data train/test split](#section4)

# 5. [Random Forest Model](#section5) # - 5.1 [Random Forest in scikit-learn](#section501)

# - 5.2 [Feature Importances](#section502)

# - 5.3 [Using the Model for Prediction](#section503)

# 6. [Model Evaluation](#section6) # - 6.1 [R-Squared Value](#section601)

# 7. [Model Interpretability using LIME](#section7) # - 7.1 [Setup LIME Algorithm](#section701)

# - 7.2 [Explore Key Features in Instance-by-Instance Predictions](#section702)
# # ## 1. Problem Statement # # - We have often found that **Machine Learning (ML)** algorithms capable of capturing **structural non-linearities** in training data - models that are sometimes referred to as **'black box' (e.g. Random Forests, Deep Neural Networks, etc.)** - perform far **better at prediction** than their **linear counterparts (e.g. Generalised Linear Models)**. # # # - They are, however, much **harder to interpret** - in fact, quite often it is **not possible to gain any insight into why a particular prediction has been produced**, when given an **instance of input data (i.e. the model features)**. # # # - Consequently, it has **not been possible to use 'black box' ML algorithms** in situations where clients have sought **cause-and-effect explanations for model predictions**, with end-results being that sub-optimal predictive models have been used in their place, as their explanatory power has been more valuable, in relative terms. # # # - The **problem with model explainability** is that it’s **very hard to define a model’s decision boundary in human understandable manner**. # # # - **LIME** is a **python library** which tries to **solve for model interpretability by producing locally faithful explanations**. # #
#

# # # - We will use **LIME** to **interpret** our **RandomForest model**. # --- # # ## 2. Importing Packages # In[ ]: # Install LIME using the following command. get_ipython().system('pip install lime') # In[2]: import numpy as np np.set_printoptions(precision=4) # To display values only upto four decimal places. import pandas as pd pd.set_option('mode.chained_assignment', None) # To suppress pandas warnings. pd.set_option('display.max_colwidth', -1) # To display all the data in the columns. pd.options.display.max_columns = 40 # To display all the columns. (Set the value to a high number) import matplotlib.pyplot as plt plt.style.use('seaborn-whitegrid') # To apply seaborn whitegrid style to the plots. plt.rc('figure', figsize=(10, 8)) # Set the default figure size of plots. get_ipython().run_line_magic('matplotlib', 'inline') import warnings warnings.filterwarnings('ignore') # To suppress all the warnings in the notebook. from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score # --- # # ## 3. Loading Data # In[3]: df = pd.read_csv('../../data/Boston.csv') df.head() # # ### 3.1 Description of the Dataset # - This dataset contains information on **Housing Values in Suburbs of Boston**. # # # - The column **medv** is the **target variable**. It is the **median** value of **owner-occupied homes in $1000s**. # | Column Name | Description | # | ---------------------------------|:----------------------------------------------------------------------------------------:| # | crim | Per capita crime rate by town. | # | zn | Proportion of residential land zoned for lots over 25,000 sq.ft. | # | indus | Proportion of non-retail business acres per town. | # | chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). | # | nox | Nitrogen oxides concentration (parts per 10 million). | # | rm | Average number of rooms per dwelling. | # | age | Proportion of owner-occupied units built prior to 1940. | # | dis | Weighted mean of distances to five Boston employment centres. | # | rad | Index of accessibility to radial highways. | # | tax | Full-value property-tax rate per 10,000 dollars. | # | ptratio | Pupil-teacher ratio by town. | # | black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. | # | lstat | Lower status of the population (percent). | # | medv | Target, median value of owner-occupied homes in $1000s. | # In[4]: df.info() # In[5]: df.describe() # --- # # ## 4. Data train/test split # # - Now that the entire **data** is of **numeric datatype**, lets begin our modelling process. # # # - Firstly, **splitting** the complete **dataset** into **training** and **testing** datasets. # In[6]: df.head() # In[7]: X = df.iloc[:, :-1] X.head() # In[8]: y = df.iloc[:, -1] y.head() # In[9]: # Using scikit-learn's train_test_split function to split the dataset into train and test sets. # 80% of the data will be in the train set and 20% in the test set, as specified by test_size=0.2 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # In[10]: # Checking the shapes of all the training and test sets for the dependent and independent features. print(X_train.shape) print(y_train.shape) print(X_test.shape) print(y_test.shape) # --- # # ## 5. Random Forest Model # # # ### 5.1 Random Forest with Scikit-Learn # In[11]: # Creating a Random Forest Regressor. regressor_rf = RandomForestRegressor(n_estimators=200, random_state=0, oob_score=True, n_jobs=-1) # In[12]: # Fitting the model on the dataset. regressor_rf.fit(X_train, y_train) # In[13]: regressor_rf.oob_score_ # - From the **OOB score** we can see how our model's gonna perform against the **test set or new** samples. # --- # # ### 5.2 Feature Importances # In[14]: X_train.columns # In[15]: # Checking the feature importances of various features. # Sorting the importances by descending order (lowest importance at the bottom). for score, name in sorted(zip(regressor_rf.feature_importances_, X_train.columns), reverse=True): print('Feature importance of', name, ':', score*100, '%') # --- # # ### 5.3 Using the Model for Prediction # In[16]: # Making predictions on the training set. y_pred_train = regressor_rf.predict(X_train) y_pred_train[:10] # In[17]: # Making predictions on test set. y_pred_test = regressor_rf.predict(X_test) y_pred_test[:10] # --- # # ## 6. Model Evaluation # # **Error** is the deviation of the values predicted by the model with the true values. # # ### 6.1 R-Squared Value # In[18]: # R-Squared Value on the training set. print('R-Squared Value for train data is:', r2_score(y_train, y_pred_train)) # In[19]: # R-Squared Value on the test set. print('R-Squared Value for test data is:', r2_score(y_test, y_pred_test)) # - We get an **R-Squared Value** of **97.74%** on our train set and an **R-Squared Value** of **87.98%** on our test set. # --- # # ## 7. Model Interpretability using LIME # # # - **LIME** stands for **Local Interpretable Model-Agnostic Explanations** is a technique to **explain the predictions of any machine learning classifier**, and **evaluate its usefulness** in various **tasks** related to **trust**. # In[20]: # Selecting the indexes of the categorical features in the dataset. categorical_features = np.argwhere(np.array([len(set(X_train.values[:, x])) for x in range(X_train.shape[1])]) <= 10).flatten() categorical_features # # ### 7.1 Setup LIME Algorithm # In[21]: from lime.lime_tabular import LimeTabularExplainer # In[22]: # Creating the LIME explainer object. explainer = LimeTabularExplainer(X_train.values, mode='regression', feature_names=X_train.columns, categorical_features=categorical_features, verbose=True, random_state=0) # --- # # ### 7.2 Explore Key Features in Instance-by-Instance Predictions # # # - **Start by choosing an instance** from the **test dataset**. # # # - Use **LIME** to **estimate a local model** to use for **explaining our model's predictions**. The **outputs** will be: # # 1. The **intercept** estimated for the local model. # 2. The **local model's estimate** for the **Regression Forest's prediction**. # 3. The **Regression Forest's actual prediction**. # # # - Note, that the **actual value from the data does not enter into this** - the **idea of LIME** is to **gain insight** into **why the chosen model** - in our case the Random Forest regressor - **is predicting whatever it has been asked to predict**. Whether or not this prediction is actually any good, is a separate issue. # In[23]: # Selecting a random instance from the test dataset. i = np.random.randint(0, X_test.shape[0]) print('i =', i) # In[24]: # Using LIME to estimate a local model. Using only 6 features to explain our model's predictions. exp = explainer.explain_instance(X_test.values[i], regressor_rf.predict, num_features=6) # - **Printing** the **DataFrame row** for the **chosen test instance**. # In[25]: # Here the index column is the original index as per the df dataframe and the number at the beginning the index after reset. X_test.reset_index().loc[[i]] # - **LIME's interpretation** of our **Random Forest's prediction**. # In[26]: exp.show_in_notebook(show_table=True, show_all=False) # - First, note that the **row** we **explained** is **displayed** on the **right side**, in **table** format. Since we had the **show_all parameter** set to **false**, only the **features used in the explanation are displayed**. # # # - The **value column** displays the **original value for each feature**. # - To get the **output generated above** in the **form of a list**. # In[27]: exp.as_list() # **Obesrvations obtained from LIME's interpretation of our Random Forest's prediction**: # # - The **values** shown after the condition is the **amount** by which the value is **shifted** from the **intercept** estimated for the local model. # # # - When all these values are **added** to the **intercept**, it gives us the **Prediction_local** (local model's estimate for the Regression Forest's prediction) calculated by **LIME**. # In[28]: print('Intercept =', exp.intercept[0]) print('Prediction_local =', exp.local_pred[0]) # In[29]: # Calculating the Prediction_local by adding all the values obtained above for each condition into the intercept. # The intercept can be obtained from the exp.intercept using the index 0. intercept = exp.intercept[0] prediction_local = intercept for i in range(len(exp.as_list())): prediction_local += exp.as_list()[i][1] print('Prediction_local =', prediction_local) # # # # - Choosing **another instance** from the **test dataset**. # In[30]: i = np.random.randint(0, X_test.shape[0]) print('i =', i) exp = explainer.explain_instance(X_test.values[i], regressor_rf.predict, num_features=6) # - **Printing** the **DataFrame row** for the **chosen test instance**. # In[31]: X_test.reset_index().loc[[i]] # - **LIME's interpretation** of our **Random Forest's prediction**. # In[32]: exp.show_in_notebook(show_table=True, show_all=False) # In[33]: exp.as_list() # In[34]: print('Intercept =', exp.intercept[0]) print('Prediction_local =', exp.local_pred[0]) # In[35]: intercept = exp.intercept[0] prediction_local = intercept for i in range(len(exp.as_list())): prediction_local += exp.as_list()[i][1] print('Prediction_local =', prediction_local) # - By **changing** the chosen **i**, we observe that the **narrative provided by LIME** also **changes, in response to changes in the model** in the **local region** of the **feature space** in which it is working to **generate a given prediction**. # # # - This is clearly an **improvement on relying purely** on the **Regression Forest's (static) expected relative feature importance** and of **great benefit to models that provice no insight whatsoever**.