Table of Contents
Customer churn (also known as customer attrition) occurs when a customer stops using a company's products or services.
Customer churn affects profitability, especially in industries where revenues are heavily dependent on subscriptions (e.g. banks, telephone and internet service providers, pay-TV companies, insurance firms, etc.). It is estimated that acquiring a new customer can cost up to five times more than retaining an existing one.
Therefore, customer churn analysis is essential as it can help a business:
If you want to know more about this topic, please refer to Refs. [10-13].
This notebook aims to understand and predict customer churn for a bank. Specifically, we will initially perform Exploratory Data Analysis (EDA) to identify and visualise the factors contributing to customer churn. This analysis will later help us pre-process our data and build Machine Learning (ML) models to predict whether a customer will churn or not.
This problem is a typical binary classification task. The predictive power of our models will be assessed based on the area under the ROC curve (AUC). The same metric will be used during hyperparameter tuning to find the optimal parameters for the ML models.
Skills: Exploratory Data Analysis, Data Visualisation, Data Pre-processing (Feature Selection, Encoding Categorical Features, Feature Scaling), Addressing Class Imbalance, Model Tuning, Ensemble Learning.
Models Used: Logistic Regression, Support Vector Machines, Random Forests, Gradient Boosting, XGBoost, and Light Gradient Boosting Machine.
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")
# Data Handling and Analysis
import numpy as np
import pandas as pd
# Data Visualisation
import matplotlib.pyplot as plt
import seaborn as sns
# Statistics
from scipy.stats import chi2_contingency
# Machine Learning
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.model_selection import learning_curve
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from imblearn.ensemble import BalancedRandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer
import scikitplot as skplt
import watermark
%load_ext watermark
# See version of system, Python, and libraries
%watermark -n -v -m -iv
Python implementation: CPython Python version : 3.10.2 IPython version : 8.0.1 Compiler : MSC v.1929 64 bit (AMD64) OS : Windows Release : 10 Machine : AMD64 Processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel CPU cores : 4 Architecture: 64bit json : 2.0.9 matplotlib: 3.5.1 xgboost : 1.5.2 pandas : 1.4.1 scipy : 1.8.0 seaborn : 0.11.2 scikitplot: 0.3.7 lightgbm : 3.3.2 numpy : 1.22.2 watermark : 2.3.0 sklearn : 1.0.2 imblearn : 0.9.0
pd.set_option('display.precision', 3)
%config InlineBackend.figure_format = 'retina'
plt.style.use('fivethirtyeight')
edgecolor = 'white'
facecolor = '#f0f0f0'
plt.rcParams['axes.edgecolor'] = '0.15'
plt.rcParams['axes.linewidth'] = 1
plt.rcParams['axes.facecolor'] = facecolor
font_size = 20
plt.rcParams['axes.labelsize'] = font_size
plt.rcParams['axes.titlesize'] = font_size
plt.rcParams['xtick.labelsize'] = font_size - 2
plt.rcParams['ytick.labelsize'] = font_size - 2
plt.rcParams['legend.fontsize'] = font_size - 2
random_state = 0
colors = ['#00A5E0', '#DD403A']
colors_cat = ['#E8907E', '#D5CABD', '#7A6F86', '#C34A36', '#B0A8B9', '#845EC2',
'#8f9aaa', '#FFB86F', '#63BAAA', '#9D88B3', '#38c4e3']
colors_comp = ['steelblue', 'seagreen', 'black', 'darkorange', 'purple',
'firebrick', 'slategrey']
sns.palplot(colors)
plt.title('Palette #1 - Binary Comparison')
plt.tick_params(axis='x', bottom=False, labelbottom=False)
sns.palplot(colors_cat)
plt.title('Palette #2 - Categories')
plt.tick_params(axis='x', bottom=False, labelbottom=False)
sns.palplot(colors_comp)
plt.title('Palette #3 - Classifier Comparison')
plt.tick_params(axis='x', bottom=False, labelbottom=False)
def plot_continuous(df, feature):
'''Plot a histogram and boxplot for the churned and retained distributions for the specified feature.'''
df_churned = df[df['Exited'] == 1]
df_retained = df[df['Exited'] == 0]
df_func = df.copy()
df_func['Exited'] = df_func['Exited'].astype('category')
fig, (ax1, ax2) = plt.subplots(2,
figsize=(9, 7),
sharex=True,
gridspec_kw={'height_ratios': (.7, .3)})
for df, color, label in zip([df_retained, df_churned],
colors,
['Retained', 'Churned']):
sns.histplot(data=df,
x=feature,
bins=15,
color=color,
alpha=0.66,
edgecolor=edgecolor,
label=label,
kde=False,
ax=ax1)
ax1.legend()
sns.boxplot(x=feature,
y='Exited',
data=df_func,
linewidth=1.5,
palette=colors,
ax=ax2)
ax2.set_ylabel('')
ax2.set_yticklabels(['Retained', 'Churned'])
plt.tight_layout()
plot_categorical()
¶def plot_categorical(df, feature):
'''For a categorical feature, plot a seaborn.countplot for the total counts of each category next to a barplot for the churn rate.'''
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
sns.countplot(x=feature,
hue='Exited',
data=df,
palette=colors,
ax=ax1)
ax1.set_ylabel('Count')
ax1.legend(labels=['Retained', 'Churned'])
sns.barplot(x=feature,
y='Exited',
data=df,
palette=colors_cat, errwidth=2,
ax=ax2)
ax2.set_ylabel('Churn rate')
if (feature == 'HasCrCard' or feature == 'IsActiveMember'):
ax1.set_xticklabels(['No', 'Yes'])
ax2.set_xticklabels(['No', 'Yes'])
plt.tight_layout()
plot_conf_mx()
¶def plot_conf_mx(cm, ax):
'''Plot a confusion matrix in the specified axes object.'''
sns.heatmap(data=cm,
annot=True,
cmap='Blues',
annot_kws={'fontsize': font_size + 10},
ax=ax)
ax.set_xlabel('Predicted Label')
ax.set_xticks([0.5, 1.5])
ax.set_xticklabels(['Retained', 'Churned'])
ax.set_ylabel('True Label')
ax.set_yticks([0.25, 1.25])
ax.set_yticklabels(['Retained', 'Churned'])
clf_performance()
¶comparison_dict = {}
def clf_performance(classifier, classifier_name, classifier_name_abv, cv):
'''Display the overall performance of a classifier with this template.'''
print('\n', classifier_name)
print('-------------------------------')
print(' Best Score ({}): '.format(scoring_metric) + str(np.round(classifier.best_score_, 3)))
print(' Best Parameters: ')
for key, value in classifier.best_params_.items():
print(' {}: {}'.format(key, value))
y_pred_pp = cross_val_predict(estimator=classifier.best_estimator_,
X=X_train,
y=y_train,
cv=cv,
method='predict_proba')[:, 1]
y_pred = y_pred_pp.round()
cm = confusion_matrix(y_train, y_pred, normalize='true')
fpr, tpr, _ = roc_curve(y_train, y_pred_pp)
comparison_dict[classifier_name_abv] = [
precision_score(y_train, y_pred),
recall_score(y_train, y_pred),
roc_auc_score(y_train, y_pred_pp), fpr, tpr
]
fig, ax = plt.subplots(figsize=(6, 5))
plot_conf_mx(cm, ax)
We start by importing the dataset as a Pandas DataFrame. We can also take a look at the top five rows using the head()
method:
df = pd.read_csv('Data\Dataset.csv')
print('Dataset Imported Successfully!\n')
print('It contains {} rows and {} columns.'.format(df.shape[0], df.shape[1]))
print('Top five rows:')
df.head()
Dataset Imported Successfully! It contains 10000 rows and 14 columns. Top five rows:
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
Our DataFrame has 14 features and 10K instances. The last feature, 'Exited', is the target variable and indicates whether the customer has churned (0 = No, 1 = Yes). The meaning of the rest of the features can be easily inferred from their name.
Features 'RowNumber', 'CustomerId', and 'Surname' are specific to each customer and can be dropped:
df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1, inplace=True)
df.columns
Index(['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary', 'Exited'], dtype='object')
The info()
method can give us valuable information such as the number of non-null values and the type of each feature:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CreditScore 10000 non-null int64 1 Geography 10000 non-null object 2 Gender 10000 non-null object 3 Age 10000 non-null int64 4 Tenure 10000 non-null int64 5 Balance 10000 non-null float64 6 NumOfProducts 10000 non-null int64 7 HasCrCard 10000 non-null int64 8 IsActiveMember 10000 non-null int64 9 EstimatedSalary 10000 non-null float64 10 Exited 10000 non-null int64 dtypes: float64(2), int64(7), object(2) memory usage: 859.5+ KB
Thankfully, there are no missing values in our DataFrame. The describe()
method gives us a statistical summary of the numerical features:
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
CreditScore | 10000.0 | 650.529 | 96.653 | 350.00 | 584.00 | 652.000 | 718.000 | 850.00 |
Age | 10000.0 | 38.922 | 10.488 | 18.00 | 32.00 | 37.000 | 44.000 | 92.00 |
Tenure | 10000.0 | 5.013 | 2.892 | 0.00 | 3.00 | 5.000 | 7.000 | 10.00 |
Balance | 10000.0 | 76485.889 | 62397.405 | 0.00 | 0.00 | 97198.540 | 127644.240 | 250898.09 |
NumOfProducts | 10000.0 | 1.530 | 0.582 | 1.00 | 1.00 | 1.000 | 2.000 | 4.00 |
HasCrCard | 10000.0 | 0.706 | 0.456 | 0.00 | 0.00 | 1.000 | 1.000 | 1.00 |
IsActiveMember | 10000.0 | 0.515 | 0.500 | 0.00 | 0.00 | 1.000 | 1.000 | 1.00 |
EstimatedSalary | 10000.0 | 100090.240 | 57510.493 | 11.58 | 51002.11 | 100193.915 | 149388.247 | 199992.48 |
Exited | 10000.0 | 0.204 | 0.403 | 0.00 | 0.00 | 0.000 | 0.000 | 1.00 |
The most important things to note are:
EDA will help us understand our dataset better. However, before we look at the data further, we need to create a test set, put it aside, and use it only to evaluate our ML models. This practice protects our models from data snooping bias (you can read more on page 51 of Ref. [1]) and ensures that evaluation will be performed using unseen data.
We will split our dataset into a train and a test set using scikit-learn's StratifiedShuffleSplit()
function, which implements stratified sampling. In this method, the population is divided into homogeneous subgroups called strata, and the right number of instances are sampled from each stratum to guaranτee that the test set is representative of the overall population (Ref. [1]).
split = StratifiedShuffleSplit(n_splits=1,
test_size=0.2,
random_state=random_state)
for train_index, test_index in split.split(df, df['Exited']):
train_df = df.loc[train_index]
test_df = df.loc[test_index]
fig, ax = plt.subplots(figsize=(6, 6))
sns.countplot(x='Exited', data=train_df, palette=colors, ax=ax)
for index, value in enumerate(train_df['Exited'].value_counts()):
label = '{}%'.format(round((value / train_df['Exited'].shape[0]) * 100, 2))
ax.annotate(label,
xy=(index, value + 250),
ha='center',
va='center',
color=colors[index],
fontweight='bold',
size=font_size + 4)
ax.set_xticklabels(['Retained', 'Churned'])
ax.set(xlabel='Status', ylabel='Count')
ax.set_ylim([0, 7100]);
The bank kept 80% of its clientele.
Notice that our dataset is skewed/imbalanced since the number of instances in the 'Retained' class outnumbers the number of instances in the 'Churned' class by a factor of 4. Therefore, we need to factor in class imbalance before building any models.
Different visualisation techniques apply to different types of variables, so it is helpful to differentiate between continuous and categorical variables and look at them separately.
continuous = ['Age', 'CreditScore', 'Balance', 'EstimatedSalary']
categorical = ['Geography', 'Gender', 'Tenure', 'NumOfProducts', 'HasCrCard', 'IsActiveMember']
print('Continuous: ', ', '.join(continuous))
print('Categorical: ', ', '.join(categorical))
Continuous: Age, CreditScore, Balance, EstimatedSalary Categorical: Geography, Gender, Tenure, NumOfProducts, HasCrCard, IsActiveMember
By calling the hist()
method, we can plot a histogram for each of the four continuous numeric features:
fig, ax = plt.subplots(figsize=(15, 10), facecolor=facecolor)
train_df[continuous].hist(bins=20,
layout=(2, 2),
color='lightsteelblue',
edgecolor=edgecolor,
linewidth=1.5,
ax=ax);
We can compute the standard correlation coefficient between every pair of (continuous) features using the pandas' corr()
method and plot it as a matrix:
ig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(train_df[continuous].corr(),
annot=True,
annot_kws={'fontsize': font_size},
cmap='Blues',
ax=ax)
ax.tick_params(axis='x', rotation=25)
ax.tick_params(axis='y', rotation=360);
There is no significant intercorrelation between our features, so we do not have to worry about multicollinearity.
Let's look at these features in greater detail.
plot_continuous(df, 'Age')
Interestingly, there is a clear difference between age groups, with older customers more likely to churn. This observation could potentially indicate that preferences change with age, and the bank hasn't adapted its strategy to meet the requirements of older customers.
plot_continuous(df, 'CreditScore')
There is no significant difference between retained and churned customers regarding their credit scores.
plot_continuous(df, 'Balance')
Again, the two distributions are pretty similar. There is a significant percentage of non-churned customers with a low account balance.
plot_continuous(df, 'EstimatedSalary')
ax = plt.gca()
ax.xaxis.set_major_locator(plt.MaxNLocator(5))
Both churned and retained customers display a similar uniform distribution in their salaries. Consequently, we can conclude that salary does not have a significant effect on the likelihood to churn.
Let's plot a seaborn.countplot for each categorical feature:
df_cat = train_df[categorical]
fig, ax = plt.subplots(2, 3, figsize=(14, 8))
for index, column in enumerate(df_cat.columns):
plt.subplot(2, 3, index + 1)
sns.countplot(x=column,
data=train_df,
edgecolor=edgecolor,
palette=colors_cat)
plt.ylabel('Count')
if (column == 'HasCrCard' or column == 'IsActiveMember'):
plt.xticks([0, 1], ['No', 'Yes'])
plt.tight_layout();
Important points:
Again, we will look at these features in greater detail.
plot_categorical(df, 'Geography')
Customers in Germany are more likely to churn than customers in the other two countries (the churn rate is almost double compared to Spain and France). Many reasons could explain this finding, such as higher competition or different preferences for German customers.
plot_categorical(df, 'Gender')
plot_categorical(df, 'Tenure')
plot_categorical(df, 'NumOfProducts')
Interestingly, having three or four products can significantly increase the likelihood of churn. I am not sure how to interpret this result. It could potentially mean that the bank cannot properly support customers with more products, thus increasing customer dissatisfaction.
plot_categorical(df, 'HasCrCard')
plot_categorical(df, 'IsActiveMember')
It is not a surprise that inactive customers are more likely to churn. A significant portion of the clientele is inactive; therefore, the bank will benefit from changing its policy so that more customers become active.
Data pre-processing is the process of converting raw data into a well-readable format that is suitable for building and training ML models.
Let's complete this process step-by-step.
We have already performed feature selection by dropping columns 'RowNumber', 'CustomerId', and 'Surname' at the beginning of our notebook. EDA revealed several more features that can be dropped as they do not provide any value in predicting our target variable:
chi2_array, p_array = [], []
for column in categorical:
crosstab = pd.crosstab(train_df[column], train_df['Exited'])
chi2, p, dof, expected = chi2_contingency(crosstab)
chi2_array.append(chi2)
p_array.append(p)
df_chi = pd.DataFrame({
'Variable': categorical,
'Chi-square': chi2_array,
'p-value': p_array
})
df_chi.sort_values(by='Chi-square', ascending=False)
Variable | Chi-square | p-value | |
---|---|---|---|
3 | NumOfProducts | 1215.109 | 3.865e-263 |
0 | Geography | 237.940 | 2.148e-52 |
5 | IsActiveMember | 182.644 | 1.283e-41 |
1 | Gender | 95.713 | 1.328e-22 |
2 | Tenure | 19.438 | 3.504e-02 |
4 | HasCrCard | 0.560 | 4.544e-01 |
'Tenure' and 'HasCrCard' have a small chi-square and a p-value greater than 0.05 (the standard cut-off value), confirming our initial hypothesis that these two features do not convey any useful information.
We can use the drop()
method to remove these three features from the train set:
features_drop = ['Tenure', 'HasCrCard', 'EstimatedSalary']
train_df = train_df.drop(columns=features_drop)
ML algorithms usually require all input (and output) features to be numeric. Consequently, categorical features must be converted (encoded) to numbers before building models.
Our dataset contains two features that require encoding.
LabelEncoder()
which maps each unique label to an integer (Male --> 1 and Female --> 0).get_dummies()
) this feature, and the two new features for France and Spain had small feature importance.train_df['Gender'] = LabelEncoder().fit_transform(train_df['Gender'])
train_df['Geography'] = train_df['Geography'].map({
'Germany': 1,
'Spain': 0,
'France': 0
})
Feature scaling is a technique used to normalise the range of features. Some algorithms are sensitive to feature scaling (e.g. SVMs), while others are invariant (e.g. Random Forests).
I decided to use scikit-learn's StandardScaler()
, which standardises features by subtracting the mean and dividing by the standard deviation. This transformation results in features with zero mean and unit variance.
scaler = StandardScaler()
scl_columns = ['CreditScore', 'Age', 'Balance']
train_df[scl_columns] = scaler.fit_transform(train_df[scl_columns])
Finally, we will split the train set into 'X_train' and 'y_train' sets:
y_train = train_df['Exited']
X_train = train_df.drop(columns='Exited')
As we have seen previously, there is an imbalance in the classes to be predicted, with one class (0 – retained) much more prevalent than the other (1 - churned):
y_train.value_counts()
0 6370 1 1630 Name: Exited, dtype: int64
Class imbalance is usually a problem and occurs in many real-world tasks. Classification using imbalanced data favours the majority class, meaning that ML algorithms will likely result in models that do little more than predict the most common class. Also, common metrics can be misleading when handling class-imbalanced data (e.g. if a dataset contains 99.9% 0s and 0.1% 1s, a classifier that always predicts 0 will have 99.9% accuracy).
Thankfully, some methods can address this problem. Resampling is a widely adopted technique designed to alter the class distribution in the training dataset. It consists of removing instances of the majority class (Undersampling) and/or adding copies of instances from the minority class (Oversampling). Additionally, there are algorithms for generating synthetic instances of the minority class. For example, as we read in Ref. [2], the SMOTE ('Synthetic Minority Oversampling Technique') algorithm 'finds a record that is similar to the record being upsampled and creates a synthetic record that is a randomly weighted average of the original record and the neighbouring record, where the weight is generated separately for each predictor'.
In this project, we will leave our dataset untouched and use the same algorithms but give them a different perspective on the problem. Penalised classification imposes an additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model to pay more attention to the minority class (Ref. [9]). For most algorithms, penalised classification is implemented by including an extra hyperparameter called class_weight
and setting it equal to balanced
.
We need to perform the same pre-processing steps for the test set as we did for the training set.
test_df = test_df.drop(features_drop, axis=1)
test_df['Gender'] = LabelEncoder().fit_transform(test_df['Gender'])
test_df['Geography'] = test_df['Geography'].map({
'Germany': 1,
'Spain': 0,
'France': 0
})
test_df[scl_columns] = scaler.transform(test_df[scl_columns]) # not fit_transform, scaler has already been trained
y_test = test_df['Exited']
X_test = test_df.drop('Exited', 1)
In this section, we will create two simple models to estimate the baseline performance on the training set. Specifically, we will use the Gaussian Naïve Bayes and Logistic Regression algorithms.
We will use their default parameters and evaluate their (mean) AUC score by performing k-fold cross-validation. The idea behind k-fold cross-validation, illustrated in this figure, is simple: it splits the (training) set into k subsets/folds, trains the models using k-1 folds, and evaluates the model on the remaining one fold. This process is repeated until every fold is tested once. We will define the number of folds as a separate variable, cv
, equal to 5. By setting this variable as an integer, stratified sampling will be used to create the folds.
scoring_metric = 'roc_auc'
cv = 5 # If int, StratifiedKFold is used.
clf_list = [('GNB', GaussianNB()),
(' LR', LogisticRegression(random_state=random_state,
class_weight='balanced'))]
cv_base_mean, cv_base_std = [], []
for clf in clf_list:
cv_scores = cross_val_score(estimator=clf[1],
X=X_train,
y=y_train,
scoring=scoring_metric,
cv=cv,
n_jobs=-1)
cv_base_mean.append(cv_scores.mean())
cv_base_std.append(cv_scores.std())
print('Baseline Models - CV ({}) Score:'.format(scoring_metric))
for i in range(len(clf_list)):
print(' {}: {} ± {}'.format(clf_list[i][0],
np.round(cv_base_mean[i], 3),
np.round(cv_base_std[i], 3)))
Baseline Models - CV (roc_auc) Score: GNB: 0.797 ± 0.005 LR: 0.771 ± 0.009
We are now ready to start building ML models. The seven classifiers we will use are:
We will not go into detail about how these algorithms work. More information can be found in Ref. [1] or the corresponding documentation.
Using default hyperparameters usually results in non-optimised models that overfit or underfit the dataset. Hyperparameter tuning is the process of finding the set of hyperparameter values that achieves optimal performance. For this purpose, we will first define which hyperparameters we want to experiment with and what values to try out. We will pass this information to Scikit-Learn’s GridSearchCV
, which evaluates all the possible combinations of hyperparameter values. As mentioned earlier, roc_auc will be the scoring metric for optimising the models. Note that GridSearchCV
evaluates the performance by performing k-fold cross-validation; therefore, the number of folds must be provided. We will set this parameter equal to 5, as we did earlier for the baseline performance.
lr = LogisticRegression(random_state=random_state, class_weight='balanced')
param_grid = {
'penalty': ['l1', 'l2'],
'C': [0.001, 0.01, 0.1, 1, 10],
'solver': ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga'],
'max_iter': [50, 100, 500]
}
lr_clf = GridSearchCV(estimator=lr,
param_grid=param_grid,
scoring=scoring_metric,
cv=cv,
verbose=False,
n_jobs=-1)
best_lr_clf = lr_clf.fit(X_train, y_train)
clf_performance(best_lr_clf, 'Logistic Regression', 'LR', cv)
Logistic Regression ------------------------------- Best Score (roc_auc): 0.772 Best Parameters: C: 0.01 max_iter: 50 penalty: l2 solver: lbfgs
svc = SVC(probability=True, random_state=random_state, class_weight='balanced')
param_grid = {
'kernel': ['linear', 'rbf', 'sigmoid'],
'C': [0.01, 0.1, 1, 10],
'gamma': ['scale', 'auto']
}
svc_clf = GridSearchCV(estimator=svc,
param_grid=param_grid,
scoring=scoring_metric,
cv=cv,
verbose=False,
n_jobs=-1)
best_svc_clf = svc_clf.fit(X_train, y_train)
clf_performance(best_svc_clf, 'Support Vector Classifier', 'SVC', cv)
Support Vector Classifier ------------------------------- Best Score (roc_auc): 0.861 Best Parameters: C: 1 gamma: auto kernel: rbf
There are two variations of the standard Random Forest classifier that are suited for imbalanced classification (Ref. [8]):
Random Forest With Class Weighting: This variation is implemented by setting the class_weight
parameter to balanced
, which changes each class's weight when calculating the impurity score of a chosen split point.
Random Forest With Bootstrap Class Weighting: This variation is implemented by setting the class_weight
parameter to balanced_subsample
. This variation is different from the first, as it changes the class weighting based on the class distribution in each bootstrap sample instead of the entire training dataset.
rf = RandomForestClassifier(random_state=random_state)
param_grid = {
'n_estimators': [100], # default=100
'criterion': ['entropy', 'gini'],
'max_depth': [5], # default=None
'min_samples_split': [5, 10],
'min_samples_leaf': [5, 10],
'max_features': ['auto', 'sqrt'],
'bootstrap': [True, False],
'class_weight': ['balanced', 'balanced_subsample']
}
rf_clf = GridSearchCV(estimator=rf,
param_grid=param_grid,
scoring=scoring_metric,
cv=cv,
verbose=False,
n_jobs=-1)
best_rf_clf = rf_clf.fit(X_train, y_train)
clf_performance(best_rf_clf, 'Random Forest', 'RF', cv)
Random Forest ------------------------------- Best Score (roc_auc): 0.853 Best Parameters: bootstrap: True class_weight: balanced criterion: gini max_depth: 5 max_features: auto min_samples_leaf: 5 min_samples_split: 5 n_estimators: 100
A third variation of the Random Forest classifier performs data resampling on the bootstrap sample to explicitly change the class distribution. The BalancedRandomForestClassifier
algorithm is provided by the imbalanced-learn
package.
brf = BalancedRandomForestClassifier(random_state=random_state)
param_grid = {
'n_estimators': [100],
'criterion': ['entropy', 'gini'],
'max_depth': [5],
'min_samples_split': [5, 10],
'min_samples_leaf': [5, 10],
'max_features': ['auto', 'sqrt'],
'bootstrap': [True, False],
'sampling_strategy': ['auto'],
'class_weight': ['balanced', 'balanced_subsample']
}
brf_clf = GridSearchCV(estimator=brf,
param_grid=param_grid,
scoring=scoring_metric,
cv=cv,
verbose=False,
n_jobs=-1)
best_brf_clf = brf_clf.fit(X_train, y_train)
clf_performance(best_brf_clf, 'Balanced Random Forest', 'BRF', cv)
Balanced Random Forest ------------------------------- Best Score (roc_auc): 0.853 Best Parameters: bootstrap: True class_weight: balanced_subsample criterion: gini max_depth: 5 max_features: auto min_samples_leaf: 5 min_samples_split: 5 n_estimators: 100 sampling_strategy: auto
gbc = GradientBoostingClassifier(random_state=random_state)
param_grid = {
'learning_rate': [0.001, 0.01],
'n_estimators': [600],
'subsample': [0.66, 0.75],
'min_samples_split': [5, 7],
'min_samples_leaf': [5, 7],
'max_depth': [4], # default=3
'max_features': ['auto', 'log2', None],
'n_iter_no_change': [20],
'validation_fraction': [0.2],
'tol': [0.01]
}
gbc_clf = GridSearchCV(estimator=gbc,
param_grid=param_grid,
scoring=scoring_metric,
cv=cv,
verbose=False,
n_jobs=-1)
best_gbc_clf = gbc_clf.fit(X_train, y_train)
clf_performance(best_gbc_clf, 'Gradient Boosting Classifier', 'GBC', cv)
Gradient Boosting Classifier ------------------------------- Best Score (roc_auc): 0.859 Best Parameters: learning_rate: 0.01 max_depth: 4 max_features: log2 min_samples_leaf: 7 min_samples_split: 5 n_estimators: 600 n_iter_no_change: 20 subsample: 0.66 tol: 0.01 validation_fraction: 0.2
The number of estimators after early stopping is:
best_gbc_clf.best_estimator_.n_estimators_
186
ratio = y_train.value_counts()[0] / y_train.value_counts()[1] # Ratio of number of negative class to the positive class - Needed for XGB classifier
xgb = XGBClassifier(random_state=random_state)
param_grid = {
'n_estimators': [50],
'learning_rate': [0.001, 0.01],
'max_depth': [6], # default=6
'reg_alpha': [10],
'reg_lambda': [10],
'subsample': [0.5, 0.75],
'colsample_bytree': [0.50, 0.75],
'gamma': [0.1, 1, 10],
'min_child_weight': [1, 2],
'scale_pos_weight': [ratio, 5]
}
xgb_clf = GridSearchCV(estimator=xgb,
param_grid=param_grid,
scoring=scoring_metric,
cv=cv,
verbose=False,
n_jobs=-1)
best_xgb_clf = xgb_clf.fit(X_train, y_train)
clf_performance(best_xgb_clf, 'XGBoost Classifier', 'XGB', cv)
[17:11:50] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. XGBoost Classifier ------------------------------- Best Score (roc_auc): 0.857 Best Parameters: colsample_bytree: 0.75 gamma: 0.1 learning_rate: 0.01 max_depth: 6 min_child_weight: 1 n_estimators: 50 reg_alpha: 10 reg_lambda: 10 scale_pos_weight: 5 subsample: 0.75 [17:11:50] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [17:11:51] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [17:11:51] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [17:11:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. [17:11:52] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
lgbm = LGBMClassifier(random_state=random_state)
param_grid = {
'max_depth': [10],
'num_leaves': [20],
'learning_rate': [0.01],
'n_estimators': [100],
'feature_fraction': [0.75],
'subsample': [0.75],
'min_child_samples': [5, 10],
'reg_alpha': [0.1, 0.5],
'reg_lambda': [0.1, 0.5],
'is_unbalance': [True],
'class_weight': ['balanced', {0: 0.35, 1: 0.65}]
}
lgbm_clf = GridSearchCV(estimator=lgbm,
param_grid=param_grid,
scoring=scoring_metric,
cv=cv,
verbose=False,
n_jobs=-1)
best_lgbm_clf = lgbm_clf.fit(X_train, y_train)
clf_performance(best_lgbm_clf, 'LGBMClassifier', 'LGBM', cv)
[LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75 LGBMClassifier ------------------------------- Best Score (roc_auc): 0.866 Best Parameters: class_weight: {0: 0.35, 1: 0.65} feature_fraction: 0.75 is_unbalance: True learning_rate: 0.01 max_depth: 10 min_child_samples: 5 n_estimators: 100 num_leaves: 20 reg_alpha: 0.1 reg_lambda: 0.1 subsample: 0.75 [LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75 [LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75 [LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75 [LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75
We can combine the predictions of all these classifiers to determine if we get better predictive performance compared to each individual constituent classifier. This practice is the main motivation behind Ensemble Learning.
Specifically, we will use Soft Voting. In this case, every individual classifier provides a probability value that a specific data point belongs to a particular target class. The predictions are weighted by the classifier's importance and summed up. Then the target label with the greatest sum of weighted probabilities wins the vote.
estimators = [('LR', best_lr_clf.best_estimator_),
# ('SVC', best_svc_clf.best_estimator_),
('RF', best_rf_clf.best_estimator_),
('BRF', best_brf_clf.best_estimator_),
# ('GBC', best_gbc_clf.best_estimator_),
('XGB', best_xgb_clf.best_estimator_),
('LGBM', best_lgbm_clf.best_estimator_)]
tuned_voting_soft = VotingClassifier(estimators=estimators[1:],
voting='soft',
n_jobs=-1)
y_pred_pp = cross_val_predict(tuned_voting_soft,
X_train,
y_train,
cv=cv,
method='predict_proba')[:, 1]
y_pred = y_pred_pp.round()
cm = confusion_matrix(y_train, y_pred, normalize='true')
fpr, tpr, _ = roc_curve(y_train, y_pred_pp)
comparison_dict['SVot'] = [
precision_score(y_train, y_pred),
recall_score(y_train, y_pred),
roc_auc_score(y_train, y_pred_pp), fpr, tpr
]
print('Soft Voting\n-------------------------------')
print(' Best Score ({}): '.format(scoring_metric),np.round(roc_auc_score(y_train, y_pred_pp), 3))
fig, ax = plt.subplots(figsize=(6, 5))
plot_conf_mx(cm, ax)
Soft Voting ------------------------------- Best Score (roc_auc): 0.859
colors_fi = ['steelblue', 'darkgray', 'cadetblue', 'bisque']
fig = plt.subplots(2, 2, figsize=(12, 10))
for i, (name, clf) in enumerate(zip(['RF', 'GBC', 'XGB', 'LGBM'],
[best_rf_clf, best_gbc_clf, best_xgb_clf, best_lgbm_clf])):
fi = pd.DataFrame({
'Feature': X_train.columns,
'Importance': clf.best_estimator_.feature_importances_
}).sort_values('Importance', ascending=True).set_index('Feature')
ax = plt.subplot(2, 2, i + 1)
fi.plot.barh(color=colors_fi[i],
edgecolor=edgecolor,
legend=False,
ax=ax)
ax.set_title(name)
ax.set(xlabel='Importance', ylabel='')
plt.tight_layout();
The first three classifiers ('RF', 'GBC', and 'XGB') display similar behaviour: 'Age' and 'NumOfProducts' seem like the most useful features, followed by 'IsActiveMember'. On the other hand, 'CreditScore' is the least important feature with a small value close to zero.
In contrast, 'Balance' is the most important feature for the LGBM model, followed by 'Age'. For the same classifier, 'CreditScore' is placed in the middle of the leaderboard.
Initially, we can compare the performance of our classifiers in terms of three individual metrics (Precision, recall, and area under the ROC curve):
comparison_matrix = {}
for key, value in comparison_dict.items():
comparison_matrix[str(key)] = value[0:3]
comparison_df = pd.DataFrame(comparison_matrix,
index=['Precision', 'Recall', 'AUC']).T
comparison_df.style.highlight_max(color='indianred', axis=0).format('{:.3f}')
Precision | Recall | AUC | |
---|---|---|---|
LR | 0.397 | 0.694 | 0.772 |
SVC | 0.698 | 0.506 | 0.860 |
RF | 0.503 | 0.723 | 0.853 |
BRF | 0.498 | 0.743 | 0.852 |
GBC | 0.828 | 0.372 | 0.859 |
XGB | 0.457 | 0.796 | 0.856 |
LGBM | 0.468 | 0.792 | 0.865 |
SVot | 0.496 | 0.758 | 0.859 |
fig, ax = plt.subplots(figsize=(12, 5))
comparison_df.plot(kind='bar',
fontsize=12,
color=['#5081DE', '#424656', '#D85870'],
edgecolor=edgecolor,
ax=ax)
ax.legend(loc='upper center',
fontsize=font_size - 6,
ncol=len(comparison_df.columns),
bbox_to_anchor=(0.5, 1.12))
ax.set_ylim([0, 1])
ax.xaxis.set_tick_params(which='major', rotation=0)
ax.set_yticks([0, 0.5, 1])
ax.axhline(y=0.80, color='red', linestyle='--')
ax.text(x=-0.5, y=0.83, s='0.80', size=font_size + 2, color='red')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False);
Apart from Logistic Regression, all other classifiers have an AUC score higher than 0.80 (baseline performance). The LGBM model has the highest AUC score (0.865). However, apart from the GBC model, almost all classifiers display a similar performance with an AUC score around 0.860 (and similar scores for the other three metrics).
Using single metrics is not the only way of comparing the predictive performance of classification models. The ROC curve (Receiver Operating Characteristic curve) is a graph showing the performance of a classifier at different classification thresholds. It plots the true positive rate (another name for recall) against the false positive rate.
fig, ax = plt.subplots(figsize=(10, 5))
for index, key in enumerate(comparison_dict.keys()):
if key == 'SVot': continue
auc = comparison_dict[key][2]
fpr, tpr = comparison_dict[key][3], comparison_dict[key][4]
ax.plot(fpr,
tpr,
color=colors_comp[index],
label='{}: {}'.format(key, np.round(auc, 3)))
ax.plot([0, 1], [0, 1], 'k--', label='Baseline')
ax.set(title='ROC Curve',
xlabel='False Positive Rate',
ylabel='False Positive Rate')
ax.set_xticks([0, 0.25, 0.5, 0.75, 1])
ax.set_yticks([0, 0.25, 0.5, 0.75, 1])
ax.autoscale(axis='both', tight=True)
ax.legend(fontsize=font_size - 6);
The dashed diagonal line represents a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner). In our case, all classifiers, apart from Logistic Regression, perform similarly.
Recently, I came across another tool for assessing the performance of a classifier model. Simply put, a Cumulative Gain shows the percentage of targets reached when considering a certain percentage of the population with the highest probability to be targeted according to the model (see here and here). The scikitplot
library offers an easy way of plotting this chart:
y_pred = cross_val_predict(best_lgbm_clf,
X_train,
y_train,
cv=cv,
method='predict_proba')
fig, ax = plt.subplots(figsize=(10, 6))
skplt.metrics.plot_cumulative_gain(y_train, y_pred, ax=ax)
ax.plot([0.5, 0.5], [0, 0.89], color='firebrick', lw=3)
ax.plot([0.0, 0.5], [0.88, 0.89], color='firebrick', lw=3)
ax.set_title('LGBM - Cumulative Gains Curve', size=font_size)
ax.set_xlabel('Percentage of Sample', size=font_size)
ax.set_xticks([0, 0.2, 0.4, 0.6, 0.8, 1.0])
ax.set_xticklabels([0, 0.2, 0.4, 0.6, 0.8, 1.0], fontsize=font_size - 2)
ax.set_ylabel('Gain', size=font_size)
ax.set_yticks([0, 0.2, 0.4, 0.6, 0.8, 1.0])
ax.set_yticklabels([0, 0.2, 0.4, 0.6, 0.8, 1.0], fontsize=font_size - 2)
ax.text(0.15, 0.90, '89%', size=font_size, color='firebrick')
ax.legend(fontsize=font_size - 4);
[LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75 [LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75 [LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75 [LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75 [LightGBM] [Warning] feature_fraction is set=0.75, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.75
This chart shows that if we target 50% of the customers most likely to churn (according to the model), the model will pick 89% of customers who will actually churn, while the random pick would pick only 50% of the targets.
An integral part of any ML project is testing our models on unseen data. For this purpose, we will use the AUC score and compare the training and test set performance for all tuned models.
train_auc_dict = {}
for key, value in comparison_dict.items():
train_auc_dict[str(key)] = value[2]
test_auc_dict = {}
for i, (name, clf) in enumerate(
zip(['LR', 'SVC', 'RF', 'BRF', 'GBC', 'XGB', 'LGBM', 'SVot'], [
best_lr_clf.best_estimator_, best_svc_clf.best_estimator_,
best_rf_clf.best_estimator_, best_brf_clf.best_estimator_,
best_gbc_clf.best_estimator_, best_xgb_clf.best_estimator_,
best_lgbm_clf.best_estimator_,
tuned_voting_soft.fit(X_train, y_train)
])):
y_pred = clf.predict_proba(X_test)[:, 1]
test_auc_dict[name] = roc_auc_score(y_test, y_pred)
auc_comparison_df = pd.merge(pd.Series(train_auc_dict, name='AUC-Train'),
pd.Series(test_auc_dict, name='AUC-Test'),
left_index=True,
right_index=True)
auc_comparison_df['Diff. %'] = 100 * (
auc_comparison_df['AUC-Train'] - auc_comparison_df['AUC-Test']) / auc_comparison_df['AUC-Train']
auc_comparison_df
AUC-Train | AUC-Test | Diff. % | |
---|---|---|---|
LR | 0.772 | 0.767 | 0.657 |
SVC | 0.860 | 0.850 | 1.194 |
RF | 0.853 | 0.843 | 1.191 |
BRF | 0.852 | 0.843 | 1.030 |
GBC | 0.859 | 0.847 | 1.331 |
XGB | 0.856 | 0.855 | 0.116 |
LGBM | 0.865 | 0.858 | 0.829 |
SVot | 0.859 | 0.852 | 0.858 |
The performance on the test set for all models is fairly similar to the training set, which proves that we do not overfit the training set. Therefore, we can predict customer churn with an AUC approximately equal to 86%.
We assume that the bank intends to invest £1,000 in services and offers for each customer likely to churn. If the customer remains, the bank expects a return approximately five times the initial investment, hence £5,000. We can write a function that calculates the total profit for our predictions (Ref. [7]).
def calculate_profit(y_true, y_pred, **kwargs):
tp = np.where((y_pred == 1) & (y_true == 1), (5000 - 1000), 0)
fp = np.where((y_pred == 1) & (y_true == 0), -1000, 0)
return np.sum([tp, fp])
profit = make_scorer(calculate_profit, greater_is_better=True)
estimators = [('GNB', GaussianNB().fit(X_train, y_train)),
('LR', best_lr_clf.best_estimator_),
('SCV', best_svc_clf.best_estimator_),
('RF', best_rf_clf.best_estimator_),
('BRF', best_brf_clf.best_estimator_),
('GBC', best_gbc_clf.best_estimator_),
('XGB', best_xgb_clf.best_estimator_),
('LGBM', best_lgbm_clf.best_estimator_),
('Ensemble', tuned_voting_soft.fit(X_train, y_train))]
profit_test = []
for estimator in estimators:
y_pred = estimator[1].predict(X_test)
profit_test.append(calculate_profit(y_test, y_pred))
profit_df = pd.DataFrame(data={'Profit-Test': profit_test},
index=[est[0] for est in estimators])
profit_df.style.highlight_max(color='indianred', axis=0)
Profit-Test | |
---|---|
GNB | 456000 |
LR | 677000 |
SCV | 882000 |
RF | 830000 |
BRF | 825000 |
GBC | 530000 |
XGB | 890000 |
LGBM | 900000 |
Ensemble | 847000 |
The LGBM classifier, the classifier with the highest AUC score, yields the highest profit, around £900,000! This amount is almost double the profit we would get with the baseline Gaussian Naive Bayes classifier.
The main resources I used are the following two books:
[1] Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurélien Géron (2019)
[2] Practical Statistics for Data Scientists, 2nd Edition, by Peter Bruce, Andrew Bruce, and Peter Gedeck (2020)
The following resources also helped me in my analysis:
[3] Bank Customer Churn, by Zicheng Shi (same dataset but analysis in R) (Accessed: 21 December 2020)
[4] Metrics and scoring: quantifying the quality of predictions on scikit-learn.org (Accessed: 20 December 2020)
[5] Easy Guide To Data Preprocessing In Python, by Ahmad Anis (Accessed: 20 December 2020)
[6] Meaningful Metrics: Cumulative Gains and Lyft Charts, by Raffi Sahakyan (Accessed: 20 December 2020)
[7] Predict Customer Churn (the right way) using PyCaret, by Moez Ali (Accessed: 08 December 2021)
[8] Bagging and Random Forest for Imbalanced Classification by Jason Brownlee (Accessed: 06 December 2021)
[9] 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset by Jason Brownlee (Accessed: 12 December 2021)
If you would like to read more about customer churn:
[10] Churn Rate, by Jake Frankenfield on Investopedia (Accessed: 15 December 2020)
[11] Customer attrition on Wikipedia (Accessed: 15 December 2020)
[12] A Survey on Churn Prediction Techniques in Communication Sector, by N.Kamalraj and A.Malathi (Accessed: 15 December 2020)
[13] Customer Attrition on optimove (Accessed: 15 December 2020)
Our notebook just came to an end! Our final report to the bank should be based on two main points: