The machine learning model is a black box model. By giving input into it, we can get an output based on the machine learning model we use on our own problem in our data. The way how humans interpret things is different from how machines interpret them. Thus, It is better to provide the tools that can make the output on certain machine learning model to be interpretable for humans or non-technical users. We often find in a business context, the interpretation of the model plays an important role in making the data-driven decision. The better we interpret the output, the easier for the non-technical user to understand the output for the advantage of the business process acceleration. Hence, I will explain one of the packages that are widely used for interpreting the black box model of the output called LIME. We will be implementing LIME on our classification problem. Because the main focus for this kernel is to understand how LIME works, i just do a few preprocessing on categorical features and numerical ones. It is better to do proper EDA and feature engineering for getting better prediction.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import lime
import lime.lime_tabular
from lime import submodular_pick
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder # For transforming categories to integer labels
df =pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df[:5]
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
# Dropping all irrelevant columns
df.drop(columns=['customerID'], inplace = True)
df.head()
gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
2 | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
3 | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
4 | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 7043 non-null object 1 SeniorCitizen 7043 non-null int64 2 Partner 7043 non-null object 3 Dependents 7043 non-null object 4 tenure 7043 non-null int64 5 PhoneService 7043 non-null object 6 MultipleLines 7043 non-null object 7 InternetService 7043 non-null object 8 OnlineSecurity 7043 non-null object 9 OnlineBackup 7043 non-null object 10 DeviceProtection 7043 non-null object 11 TechSupport 7043 non-null object 12 StreamingTV 7043 non-null object 13 StreamingMovies 7043 non-null object 14 Contract 7043 non-null object 15 PaperlessBilling 7043 non-null object 16 PaymentMethod 7043 non-null object 17 MonthlyCharges 7043 non-null float64 18 TotalCharges 7043 non-null object 19 Churn 7043 non-null object dtypes: float64(1), int64(2), object(17) memory usage: 1.1+ MB
# Ensuring that SeniorCitizen and TotalCharges is of type object and float
df['TotalCharges'] =pd.to_numeric(df['TotalCharges'],errors = 'coerce')
# Dropping missing values
df.dropna(inplace=True)
df.shape
(7032, 20)
# Label Encoding features
categorical_feat =list(df.select_dtypes(include=["object"]))
# Using label encoder to transform string categories to integer labels
le = LabelEncoder()
for feat in categorical_feat:
df[feat] = le.fit_transform(df[feat]).astype('int')
df.head()
gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 29.85 | 29.85 | 0 |
1 | 1 | 0 | 0 | 0 | 34 | 1 | 0 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 56.95 | 1889.50 | 0 |
2 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 53.85 | 108.15 | 1 |
3 | 1 | 0 | 0 | 0 | 45 | 0 | 1 | 0 | 2 | 0 | 2 | 2 | 0 | 0 | 1 | 0 | 0 | 42.30 | 1840.75 | 0 |
4 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 70.70 | 151.65 | 1 |
features = df.drop(columns=['Churn'])
labels = df['Churn']
# Dividing the data into training-test set with 80:20 split ratio
x_train,x_test,y_train,y_test = train_test_split(features,labels,test_size=0.2, random_state=123)
model = XGBClassifier(n_estimators = 300, random_state = 123)
model.fit(x_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise', importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=300, n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, ...)
model.score(x_test, y_test)
0.7839374555792467
predict_fn = lambda x: model.predict_proba(x)
np.random.seed(123)
# Defining the LIME explainer object
explainer = lime.lime_tabular.LimeTabularExplainer(df[features.columns].astype(int).values,
mode='classification',
class_names=['Did not Churn', 'Churn'],
training_labels=df['Churn'],
feature_names=features.columns)
# using LIME to get the explanations
i = 5
exp = explainer.explain_instance(df.loc[i,features.columns].astype(int).values, predict_fn, num_features=5)
exp.show_in_notebook(show_table=True)
The left-most bar plot is showing us the prediction probabilities, which can be treated as the model's confidence level in making the prediction. In this case, the model is 99% confident that the particular passenge would 'Churn' and 1% for not Churn.
The second visualization is probably the most important visualization which provides maximum explainability. This visualization tells us that the most important feature with a feature importance score of 21% is the feature 'MonthlyCharges', followed by 'Contract' with a feature importance score of 19%. But as illustrated in orange, for the selected data instance the features 'MonthlyCharges','Contract', 'tenure', 'TotalCharges','OnlineSecurity' contributes towards the model outcome of 'Churn' along with their threshold scores learnt from the entire dataset. There is also a condition where the feature values contributes towards the model outcome of 'did not Churn' as shown on the next instances on global interpretable prediction instances. The threshold feature values learnt by the LIME model is also inalignment with our own common sense and apriori knowledge. The higher the monthlyCharges, the higher churn ratio of the customers. So, the model explanation provided is human-friendly and consistent with our prior belief.
The third visualization from the left shows the top five features and their respective values. Here, the features highlighted in orange are contributing toward class 1, while features highlighted in blue are contributing toward class 0.
figure = exp.as_pyplot_figure(label = exp.available_labels()[0])
The visualization shows the top 5 features and their respective values in which the features highlighted in orange are contributing towards class Churn.
# Let's use SP-LIME to return explanations on a sample data set
# and obtain a non-redundant global decision perspective of the black-box model
sp_exp = submodular_pick.SubmodularPick(explainer,
df[features.columns].values,
predict_fn,
num_features=5,
num_exps_desired=5)
[exp.show_in_notebook() for exp in sp_exp.sp_explanations]
print('SP-LIME Explanations.')
SP-LIME Explanations.
[exp.as_pyplot_figure(label=exp.available_labels()[0]) for exp in sp_exp.sp_explanations]
print('SP-LIME Local Explanations')
SP-LIME Local Explanations
The preceding plots shows a global understanding of the model alongside the local explanations. This is provided using the SP-LIME algorithm. You can see the plots on samples data to obtain global decision of perspective of black box models. The red barplots indicate the range of feature values contribute more to the did not churn while the green ones contribute toward the churn outcome.