#!/usr/bin/env python # coding: utf-8 #
# #

# Telco Customer Churn Prediction 😁😊🙁😠😡

# #
#

# #

# # # #
# #

# Table of Contents

#
# - [1. Problem statement](#Problem_Statement) # - [1.1. Introduction](#Introduction) # - [1.2. Obejctives](#Obejctives) # - [1.3. Dataset Features](#Dataset_Features) # - [2. Import Libraries and Data](#Import_Libraries_and_Data) # - [3. Handling Missing Values](#Handling_Missing_Values) # - [4. Data Analysis and Visualization](#Data_Analysis_and_Visualization) # - [5. Outlier Detection](#Outlier_Detection) # - [6. Check for Rare Categories](#Check_for_Rare_Categories) # - [7. Categorical Variables Encoding](#Categorical_Variables_Encoding) # - [8. Balance Data](#Balance_Data) # - [9. Dataset Splitting](#Dataset_Splitting) # - [10. Feature Scaling](#Feature_Scaling) # - [11. Modeling and Parameter Optimization](#Modeling_and_Parameter_Optimization) # - [12. Feature Importance](#Feature_Importance) # - [13. Results](#Results) # # # #
# #

# 1. Problem Statement

#
# # Back to Table of Contents # ## # #
# #

# 1.1. Introduction

#
# **What is Customer Churn?** # # Customer churn is the percentage of customers that stopped using company's product or service during a certain time frame. Customer churn is one of the most important metrics for a growing business to evaluate as it is much less expensive to retain existing customers than it is to acquire new customers. Customers in the telecom industry can choose from a variety of service providers and actively switch from one to the next. The telecommunications business has an annual churn rate of 15-25 percent in this highly competitive market. # # Customer churn is extremley costly for companies. Based on a churn rate just under two percent for top companies, one source estimates carriers lose $65 million per month from churn. To reduce customer churn, telecom companies should predict which customers are highly prone to churn. # # Individualized customer retention is demanding because most companies have a large number of customers and cannot afford to devote much time to each of them. The costs would be too great, outweighing the additional revenue. However, if a corporation could forecast which customers are likely to leave ahead of time, it could concentrate customer retention efforts only on these "high risk" clients. # ## # #
# #

# 1.2. Obejctives

#
# In this projects below questions will be answered: # # * What's the $\%$ of Customers Churn and customers that keep in with the active services? # * Is there any patterns in Customers Churn based on the gender? # * Is there any patterns/preference in Customers Churn based on the type of service provided? # * What's the most profitable service types? # * Which features and services are most profitable? # * Which features have the most impact on predicting customers churn? # * Which model is the best for predicting churn? # ## # #
# #

# 1.3. Dataset Features

#
# * `Customer ID`: A unique ID that identifies each customer. # # Demographic info about customers: # # * `gender`: Whether the customer is a male or a female # # * `SeniorCitizen`: Whether the customer is a senior citizen or not (1, 0) # # * `Partner`: Whether the customer has a partner or not (Yes, No) # # * `Dependents`: Whether the customer has dependents or not (Yes, No) # # Services that each customer has signed up for: # # * `PhoneService`: Whether the customer has a phone service or not (Yes, No) # # * `MultipleLines`: Whether the customer has multiple lines or not (Yes, No, No phone service) # # * `InternetService`: Customer’s internet service provider (DSL, Fiber optic, No) # # * `OnlineSecurity`: Whether the customer has online security or not (Yes, No, No internet service) # # * ` OnlineBackup`: Whether the customer has online backup or not (Yes, No, No internet service) # # * `DeviceProtection`: Whether the customer has device protection or not (Yes, No, No internet service) # # * `TechSupport`: Whether the customer has tech support or not (Yes, No, No internet service) # # * `StreamingTV`: Whether the customer has streaming TV or not (Yes, No, No internet service) # # * `StreamingMovies`: Whether the customer has streaming movies or not (Yes, No, No internet service) # # Customer account information: # # * `tenure`: Number of months the customer has stayed with the company # # * `Contract`: The contract term of the customer (Month-to-month, One year, Two year) # # * `PaperlessBilling`: Whether the customer has paperless billing or not (Yes, No) # # * `PaymentMethod`: The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)) # # * `MonthlyCharges`: The amount charged to the customer monthly # # * `TotalCharges`: The total amount charged to the customer # # * **`Churn`**: Target, Whether the customer has left within the last month or not (Yes or No) # # # #
# #

# 2. Import Libraries and Data

#
# # Back to Table of Contents # In[46]: get_ipython().system('pip install mlens') # In[47]: # handle table-like data and matrices import pandas as pd import numpy as np # visualisation import seaborn as sns import matplotlib.pyplot as plt import missingno as msno import plotly.express as px import plotly.graph_objects as go from plotly.subplots import make_subplots import plotly.figure_factory as ff from plotly.offline import download_plotlyjs, init_notebook_mode, iplot init_notebook_mode(connected=True) # preprocessing from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split, cross_val_score # balance data from imblearn.over_sampling import BorderlineSMOTE # models from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier, StackingClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from xgboost import XGBClassifier from mlens.ensemble import SuperLearner from sklearn.neural_network import MLPClassifier # evaluations from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_auc_score, plot_roc_curve, roc_curve, auc from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV # ignore warnings import warnings warnings.filterwarnings('ignore') # to display the total number columns present in the dataset pd.set_option('display.max_columns', None) # In[48]: data = pd.read_csv('Telco Customer Churn.csv') # # # #
# #

# 3. Handling Missing Values

#
# # Back to Table of Contents # let's find if we have missing values in the dataset. # In[49]: data = data.replace(r'^\s*$', np.nan, regex=True) # In[50]: data.isnull().sum() # In[51]: msno.matrix(data); # If we examine the data carefully, we can actually estimate the value of the missing data. # # Contract length in month * tenure (if not 0) * monthly charges # # This is more accurate than filling missing values with mean or median. # In[52]: data[data['TotalCharges'].isnull()].index.tolist() # In[53]: ind = data[data['TotalCharges'].isnull()].index.tolist() for i in ind: if data['Contract'].iloc[i,] == 'Two year': data['TotalCharges'].iloc[i,] = int(np.maximum(data['tenure'].iloc[i,], 1)) * data['MonthlyCharges'].iloc[i,] * 24 elif data['Contract'].iloc[i,] == 'One year': data['TotalCharges'].iloc[i,] = int(np.maximum(data['tenure'].iloc[i,], 1)) * data['MonthlyCharges'].iloc[i,] * 12 else: data['TotalCharges'].iloc[i,] = int(np.maximum(data['tenure'].iloc[i,], 1)) * data['MonthlyCharges'].iloc[i,] # In[54]: data.isnull().sum() # let's find if we have duplicate rows. # In[55]: data.duplicated().sum() # # # #
# #

# 4. Data Analysis and Visualization

#
# # Back to Table of Contents # In[56]: data.head(3) # In[57]: data.shape # There are 7043 cutomers and 21 features in the dataset. # In[58]: for i in data.columns[6:-3]: print(f'Number of categories in the variable {i}: {len(data[i].unique())}') # In[59]: data.info() # In[60]: data.describe() # In[61]: data.describe(include=object).T # In[62]: fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]]) fig.add_trace(go.Pie(labels=data['gender'].unique(), values=data['gender'].value_counts(), name='Gender', marker_colors=['gold', 'mediumturquoise']), 1, 1) fig.add_trace(go.Pie(labels=data['Churn'].unique(), values=data['Churn'].value_counts(), name='Churn', marker_colors=['darkorange', 'lightgreen']), 1, 2) fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2))) fig.update_layout( title_text='Gender and Churn Distributions', # Add annotations in the center of the donut pies. annotations=[dict(text='Gender', x=0.19, y=0.5, font_size=20, showarrow=False), dict(text='Churn', x=0.8, y=0.5, font_size=20, showarrow=False)]) iplot(fig) # * We have imbalanced data. # # * $26.6 \%$ of customers switched to another company. # # * Customers are $49.5 \%$ female and $50.5 \%$ male. # In[63]: fig = px.sunburst(data, path=['Churn', 'gender'], title='Sunburst Plot of Gender and churn') iplot(fig) # In[64]: print(f'A female customer has a probability of {round(data[(data["gender"] == "Female") & (data["Churn"] == "Yes")].count()[0] / data[(data["gender"] == "Female")].count()[0] *100,2)} % churn') print(f'A male customer has a probability of {round(data[(data["gender"] == "Male") & (data["Churn"] == "Yes")].count()[0] / data[(data["gender"] == "Male")].count()[0]*100,2)} % churn') # * There is negligible difference in customer percentage who changed the service provider. Both genders behaved in similar way when it comes to migrating to another service provider. # In[65]: fig = px.histogram(data, x='Churn', color='Contract', barmode='group', title='Customer Contract Distribution w.r.t. Churn', color_discrete_sequence = ['#EC7063','#E9F00B','#0BF0D1'], text_auto=True) fig.update_layout(width=1100, height=500, bargap=0.3) fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # In[66]: print(f'A customer with month-to-month contract has a probability of {round(data[(data["Contract"] == "Month-to-month") & (data["Churn"] == "Yes")].count()[0] / data[(data["Contract"] == "Month-to-month")].count()[0] *100,2)} % churn') print(f'A customer with one year contract has a probability of {round(data[(data["Contract"] == "One year") & (data["Churn"] == "Yes")].count()[0] / data[(data["Contract"] == "One year")].count()[0]*100,2)} % churn') print(f'A customer with two year contract has a probability of {round(data[(data["Contract"] == "Two year") & (data["Churn"] == "Yes")].count()[0] / data[(data["Contract"] == "Two year")].count()[0]*100,2)} % churn') # * About $43\%$ of customer with Month-to-Month Contract opted to move out as compared to $11\%$ of customrs with One Year Contract and $3\%$ with Two Year Contract. A major percent of people who left the comapny had Month-to-Month Contract. This is acutually logical since people who have long-term contract are more loyal to the company. # In[67]: fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]]) fig.add_trace(go.Pie(labels=data['PaymentMethod'].unique(), values=data['PaymentMethod'].value_counts(), name='Payment Method', marker_colors=['gold', 'mediumturquoise','darkorange', 'lightgreen']), 1, 1) fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2))) fig.update_layout( title_text='Payment Method Distributions', annotations=[dict(text='Payment Method', x=0.5, y=0.5, font_size=18, showarrow=False)]) iplot(fig) # In[68]: fig = px.histogram(data, x='Churn', color='PaymentMethod', barmode='group', title='Payment Method Distribution w.r.t. Churn', color_discrete_sequence = ['#EC7063', '#0BF0D1', '#E9F00B', '#5DADE2'], text_auto=True) fig.update_layout(width=1100, height=500, bargap=0.3) fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # In[69]: print(f'A customer that use Electronic check for paying has a probability of {round(data[(data["PaymentMethod"] == "Electronic check") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaymentMethod"] == "Electronic check")].count()[0] *100,2)} % churn') print(f'A customer that use Mailed check for paying has a probability of {round(data[(data["PaymentMethod"] == "Mailed check") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaymentMethod"] == "Mailed check")].count()[0]*100,2)} % churn') print(f'A customer that use Bank transfer (automatic) for paying has a probability of {round(data[(data["PaymentMethod"] == "Bank transfer (automatic)") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaymentMethod"] == "Bank transfer (automatic)")].count()[0]*100,2)} % churn') print(f'A customer that use Credit card (automatic) for paying has a probability of {round(data[(data["PaymentMethod"] == "Credit card (automatic)") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaymentMethod"] == "Credit card (automatic)")].count()[0]*100,2)} % churn') # * Major customers who moved out had Electronic Check as Payment Method. # # * Customers who chose Credit-Card automatic transfer or Bank Automatic Transfer and Mailed Check as Payment Method were less likely to move out. # In[70]: data[data['gender']=='Male'][['InternetService', 'Churn']].value_counts() # In[71]: data[data['gender']=='Female'][['InternetService', 'Churn']].value_counts() # In[72]: fig = go.Figure() fig.add_trace(go.Bar( x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'], ['Female', 'Male', 'Female', 'Male']], y = [965, 992, 219, 240], name = 'DSL', )) fig.add_trace(go.Bar( x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'], ['Female', 'Male', 'Female', 'Male']], y = [889, 910, 664, 633], name = 'Fiber optic', )) fig.add_trace(go.Bar( x = [['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'], ['Female', 'Male', 'Female', 'Male']], y = [690, 717, 56, 57], name = 'No Internet', )) fig.update_layout(title_text='Churn Distribution w.r.t. Internet Service and Gender') fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # * A lot of customers choose the Fiber optic service and it's also evident that the customers who use Fiber optic have high churn rate, this might suggest a dissatisfaction with this type of internet service. # # * Customers having DSL service are majority in number and have less churn rate compared to Fibre optic service. # In[73]: fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]]) fig.add_trace(go.Pie(labels=data['Dependents'].unique(), values=data['Dependents'].value_counts(), name='Dependents', marker_colors=['#E5527A ', '#AAB7B8']), 1, 1) fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2))) fig.update_layout( title_text='Dependents Distribution', annotations=[dict(text='Dependents', x=0.5, y=0.5, font_size=18, showarrow=False)]) iplot(fig) # In[74]: fig = px.histogram(data, x='Dependents', color='Churn', barmode='group', title='Dependents Distribution w.r.t. Churn', color_discrete_sequence = ['#00CC96','#FFA15A'], text_auto=True) fig.update_layout(width=1100, height=500, bargap=0.3) fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # In[75]: print(f'A customer with dependents has a probability of {round(data[(data["Dependents"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["Dependents"] == "Yes")].count()[0] *100,2)} % churn') print(f'A customer without dependents has a probability of {round(data[(data["Dependents"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["Dependents"] == "No")].count()[0]*100,2)} % churn') # * Customers without dependents are more likely to churn # In[76]: fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]]) fig.add_trace(go.Pie(labels=data['Partner'].unique(), values=data['Partner'].value_counts(), name='Partner', marker_colors=['gold', 'purple']), 1, 1) fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2))) fig.update_layout( title_text='Partner Distribution', annotations=[dict(text='Partner', x=0.5, y=0.5, font_size=18, showarrow=False)]) iplot(fig) # In[77]: fig = px.histogram(data, x='Churn', color='Partner', barmode='group', title='Partner Distribution w.r.t. Churn', color_discrete_sequence = ['#C82735','#BCC827'], text_auto=True) fig.update_layout(width=1100, height=500, bargap=0.3) fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # In[78]: print(f'A customer with a partner has a probability of {round(data[(data["Partner"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["Partner"] == "Yes")].count()[0] *100,2)} % churn') print(f'A customer without a partner has a probability of {round(data[(data["Partner"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["Partner"] == "No")].count()[0]*100,2)} % churn') # * Customers that doesn't have partners are more likely to churn # In[79]: fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]]) fig.add_trace(go.Pie(labels=['No', 'Yes'], values=data['SeniorCitizen'].value_counts(), name='Senior Citizen', marker_colors=['#56E11A', '#1A87E1']), 1, 1) fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2))) fig.update_layout( title_text='Senior Citizen Distribution', annotations=[dict(text='Senior Citizen', x=0.5, y=0.5, font_size=18, showarrow=False)]) iplot(fig) # In[80]: fig = px.histogram(data, x='Churn', color='SeniorCitizen', barmode='group', title='Senior Citizen Distribution w.r.t. Churn', color_discrete_sequence = ['#E11AC6','#BAE11A'], text_auto=True) fig.update_layout(width=1100, height=500, bargap=0.3) fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # In[81]: print(f'A customer that is a senior citizen has a probability of {round(data[(data["SeniorCitizen"] == 1) & (data["Churn"] == "Yes")].count()[0] / data[(data["SeniorCitizen"] == 1)].count()[0] *100,2)} % churn') print(f'A customer that is not a senior citizen has a probability of {round(data[(data["SeniorCitizen"] == 0) & (data["Churn"] == "Yes")].count()[0] / data[(data["SeniorCitizen"] == 0)].count()[0]*100,2)} % churn') # * It can be observed that the fraction of senior citizen is very less. # # * About $42\%$ of the senior citizens churn. # In[82]: fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]]) fig.add_trace(go.Pie(labels=data['OnlineSecurity'].unique(), values=data['OnlineSecurity'].value_counts(), name='OnlineSecurity', marker_colors=['#1AE178', '#2CECE6', 'red']), 1, 1) fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2))) fig.update_layout( title_text='Online Security Distribution', annotations=[dict(text='Online Security', x=0.5, y=0.5, font_size=18, showarrow=False)]) iplot(fig) # In[83]: fig = px.histogram(data, x='Churn', color='OnlineSecurity', barmode='group', title='Online Security Distribution w.r.t. Churn', color_discrete_sequence = ['#EB984E','yellow', '#5499C7'], text_auto=True) fig.update_layout(width=1100, height=500, bargap=0.3) fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # In[84]: print(f'A customer with an online security has a probability of {round(data[(data["OnlineSecurity"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["OnlineSecurity"] == "Yes")].count()[0] *100,2)} % churn') print(f'A customer without an online security has a probability of {round(data[(data["OnlineSecurity"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["OnlineSecurity"] == "No")].count()[0]*100,2)} % churn') print(f'A customer with no internet service has a probability of {round(data[(data["OnlineSecurity"] == "No internet service") & (data["Churn"] == "Yes")].count()[0] / data[(data["OnlineSecurity"] == "No internet service")].count()[0]*100,2)} % churn') # * Most customers churn in the absence of online security. # In[85]: fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]]) fig.add_trace(go.Pie(labels=data['PaperlessBilling'].unique(), values=data['PaperlessBilling'].value_counts(), name='PaperlessBilling', marker_colors=['LightCoral', '#CCCCFF']), 1, 1) fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2))) fig.update_layout( title_text='PaperlessBilling Distribution', annotations=[dict(text='PaperlessBilling Security', x=0.5, y=0.5, font_size=14, showarrow=False)]) iplot(fig) # In[86]: fig = px.histogram(data, x='Churn', color='PaperlessBilling', barmode='group', title='Paperless Billing Distribution w.r.t. Churn', color_discrete_sequence = ['#9FE2BF', '#FF7F50'], text_auto=True) fig.update_layout(width=1100, height=500, bargap=0.3) fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # In[87]: print(f'A customer with PaperlessBilling has a probability of {round(data[(data["PaperlessBilling"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaperlessBilling"] == "Yes")].count()[0] *100,2)} % churn') print(f'A customer without PaperlessBilling has a probability of {round(data[(data["PaperlessBilling"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["PaperlessBilling"] == "No")].count()[0]*100,2)} % churn') # * Customers with Paperless Billing are most likely to churn. # In[88]: fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]]) fig.add_trace(go.Pie(labels=data['TechSupport'].unique(), values=data['TechSupport'].value_counts(), name='TechSupport', marker_colors=['#DE3163', '#DFFF00', '#40E0D0']), 1, 1) fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2))) fig.update_layout( title_text='TechSupport Distribution', annotations=[dict(text='Tech Support', x=0.5, y=0.5, font_size=18, showarrow=False)]) iplot(fig) # In[89]: fig = px.histogram(data, x='Churn', color='TechSupport', barmode='group', title='Tech Support Distribution w.r.t. Churn', color_discrete_sequence = ['#FFBF00', 'IndianRed', 'red'], text_auto=True) fig.update_layout(width=1100, height=500, bargap=0.3) fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # In[90]: print(f'A customer with a tech support has a probability of {round(data[(data["TechSupport"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["TechSupport"] == "Yes")].count()[0] *100,2)} % churn') print(f'A customer without a tech support has a probability of {round(data[(data["TechSupport"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["TechSupport"] == "No")].count()[0]*100,2)} % churn') print(f'A customer with no internet service has a probability of {round(data[(data["TechSupport"] == "No internet service") & (data["Churn"] == "Yes")].count()[0] / data[(data["TechSupport"] == "No internet service")].count()[0]*100,2)} % churn') # * Customers with no TechSupport are most likely to migrate to another service provider. # In[91]: fig = make_subplots(rows=1, cols=1, specs=[[{'type':'domain'}]]) fig.add_trace(go.Pie(labels=data['PhoneService'].unique(), values=data['PhoneService'].value_counts(), name='PhoneService', marker_colors=['LightSalmon', '#7FB3D5']), 1, 1) fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2))) fig.update_layout( title_text='Phone Service Distribution', annotations=[dict(text='Phone Service', x=0.5, y=0.5, font_size=20, showarrow=False)]) iplot(fig) # In[92]: fig = px.histogram(data, x='Churn', color='PhoneService', barmode='group', title='Phone Service Distribution w.r.t. Churn', color_discrete_sequence = ['#FFBF00', 'IndianRed'], text_auto=True) fig.update_layout(width=1100, height=500, bargap=0.3) fig.update_traces(marker_line_width=2,marker_line_color='black') iplot(fig) # In[93]: print(f'A customer with phone service has a probability of {round(data[(data["PhoneService"] == "Yes") & (data["Churn"] == "Yes")].count()[0] / data[(data["PhoneService"] == "Yes")].count()[0] *100,2)} % churn') print(f'A customer without phone service has a probability of {round(data[(data["PhoneService"] == "No") & (data["Churn"] == "Yes")].count()[0] / data[(data["PhoneService"] == "No")].count()[0]*100,2)} % churn') # * Very small fraction of customers don't have a phone service and out of that, about $25\%$ Customers are more likely to churn. # In[94]: fig = px.histogram(data, x='MonthlyCharges', color='Churn', marginal='box', title='Monthly Charges Distribution w.r.t. Churn', color_discrete_sequence = ['#84D57F', '#C959DA']) iplot(fig) # * Customers with higher Monthly Charges are more likely to churn. # In[95]: fig = px.histogram(data, x='TotalCharges', color='Churn', marginal='box', title='Total Charges Distribution w.r.t. Churn', color_discrete_sequence = ['blue', 'red']) iplot(fig) # * Customers with higher Total Charges are more likely to churn. # In[96]: fig = px.histogram(data, x='tenure', color='Churn', marginal='box', title='Tenure Distribution w.r.t. Churn', color_discrete_sequence = ['orange', 'green']) iplot(fig) # * Customers who stayed with the company for longer time are more less likely to churn now. # # # #
# #

# 5. Outlier Detection

#
# # Back to Table of Contents # The presence of outliers in a classification or regression dataset can result in a poor fit and lower predictive modeling performance, therefore we should see there are ouliers in the data. # In[97]: data=data.drop(labels=['customerID'],axis=1) # In[98]: sns.distplot(data.TotalCharges); # In[99]: sns.distplot(data.MonthlyCharges); # In[100]: sns.distplot(data.tenure); # Another way of visualising outliers is using boxplots and whiskers, # which provides the quantiles (box) and inter-quantile range (whiskers), # with the outliers sitting outside the error bars (whiskers). # # All the dots in the plot below are outliers according to the quantiles + 1.5 IQR rule # # first let's specify the datatype of `TotalCharges` as numerical. # In[101]: data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce') # In[102]: fig = make_subplots(rows=1, cols=3) fig.add_trace(go.Box(y=data['MonthlyCharges'], notched=True, name='Monthly Charges', marker_color = '#6699ff', boxmean=True, boxpoints='suspectedoutliers'), 1, 2) fig.add_trace(go.Box(y=data['TotalCharges'], notched=True, name='Total Charges', marker_color = '#ff0066', boxmean=True, boxpoints='suspectedoutliers'), 1, 1) fig.add_trace(go.Box(y=data['tenure'], notched=True, name='Tenure', marker_color = 'lightseagreen', boxmean=True, boxpoints='suspectedoutliers'), 1, 3) fig.update_layout(title_text='Box Plots for Numerical Variables') iplot(fig) #

# #

# In[103]: def detect_outliers(d): for i in d: Q3, Q1 = np.percentile(data[i], [75 ,25]) IQR = Q3 - Q1 ul = Q3+1.5*IQR ll = Q1-1.5*IQR outliers = data[i][(data[i] > ul) | (data[i] < ll)] print(f'*** {i} outlier points***', '\n', outliers, '\n') # In[104]: detect_outliers(['tenure', 'MonthlyCharges', 'TotalCharges']) # There is no outlier. # # # #
# #

# 6. Check for Rare Categories

#
# # Back to Table of Contents # Some categories may appear a lot in the dataset, whereas some other categories appear only in a few number of observations. # # * Rare values in categorical variables tend to cause over-fitting, particularly in tree based methods. # * Rare labels may be present in training set, but not in test set, therefore causing over-fitting to the train set. # * Rare labels may appear in the test set, and not in the train set. Thus, the machine learning model will not know how to evaluate it. # In[105]: categorical = [var for var in data.columns if data[var].dtype=='O'] # In[106]: # check the number of different labels for var in categorical: print(data[var].value_counts() / np.float(len(data))) print() print() # As shown above, there is no rare category in the categorical variables. # # # #
# #

# 7. Categorical Variables Encoding

#
# # Back to Table of Contents # In[107]: data['Churn'] = data['Churn'].map({'Yes':1,'No':0}) # In[108]: data.dtypes # This step is the key to achieve a high accuracy. We use `Target guided ordinal encoding`. Ordering the categories according to the target means assigning a number to the category, but this numbering, this ordering, is informed by the mean of the target within the category. Briefly, we calculate the mean of the target for each label/category, then we order the labels according to these mean from smallest to biggest, and we number them accordingly. # # Advantages: # * Capture information within the label, therefore rendering more predictive features # * Create a monotonic relationship between the variable and the target # * Does not expand the feature space # # Disadvantage: # * Prone to cause over-fitting # # This process should be done on the train data, then the ordered label will be mapped into test data. (since the data is large enough, ordered categories will be same if we consider the whole data or just the train set.) # In[109]: categorical = [var for var in data.columns if data[var].dtype=='O'] # In[110]: def category(df): for var in categorical: ordered_labels = df.groupby([var])['Churn'].mean().sort_values().index ordinal_label = {k:i for i, k in enumerate(ordered_labels, 0)} ordinal_label df[var] = df[var].map(ordinal_label) category(data) # In[111]: data.head(5) # # # #
# #

# 8. Balance Data

#
# # Back to Table of Contents # In[112]: fig = px.bar(x=data['Churn'].unique()[::-1], y=[data[data['Churn']==1].count()[0], data[data['Churn']==0].count()[0]], text=[np.round(data[data['Churn']==1].count()[0]/data.shape[0], 4), np.round(data[data['Churn']==0].count()[0]/data.shape[0], 4)] , color_discrete_sequence =['#ff9999']) fig.update_layout(title_text='Churn Count PLot', xaxis = dict(tickmode = 'linear', tick0 = 0, dtick = 1), width=700, height=400, bargap=0.4) fig.update_layout({'yaxis': {'title':'Count'}, 'xaxis': {'title':'Churn'}}) iplot(fig) # As shown in the plot above, we are dealing with an imbalanced dataset. The `BorderlineSMOTE` method is used which involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model. This method oversamples just those difficult instances, providing more resolution only where it may be required. # In[113]: X = data.drop(['Churn'], axis = 1) y = data['Churn'] oversample = BorderlineSMOTE() X, y = oversample.fit_resample(X, y) # # # #
# #

# 9. Dataset Splitting

#
# # Back to Table of Contents # let's separate the data into training and testing set. # In[114]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42) X_train.shape, X_test.shape # # # #
# #

# 10. Feature Scaling

#
# # Back to Table of Contents # In this section, numerical features are scaled. # # StandardScaler = $\frac{x-\mu}{s}$ # In[115]: scaler = StandardScaler() X_train[['TotalCharges','MonthlyCharges','tenure']] = scaler.fit_transform(X_train[['TotalCharges','MonthlyCharges','tenure']]) X_test[['TotalCharges','MonthlyCharges','tenure']] = scaler.transform(X_test[['TotalCharges','MonthlyCharges','tenure']]) # # # #
# #

# 11. Modeling and Parameter Optimization

#
# # Back to Table of Contents # In[116]: CV = StratifiedKFold(n_splits=10, random_state=0, shuffle=True) # **Model 1 : LR** # In[117]: LR_S = LogisticRegression(random_state = 42) params_LR = {'C': list(np.arange(1,12)), 'penalty': ['l2', 'elasticnet', 'none'], 'class_weight': ['balanced','None']} grid_LR = RandomizedSearchCV(LR_S, param_distributions=params_LR, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True) grid_LR.fit(X_train, y_train) print('Best parameters:', grid_LR.best_estimator_) # In[118]: LR = LogisticRegression(random_state = 42, penalty= 'l2', class_weight= 'balanced', C=6) cross_val_LR_Acc = cross_val_score(LR, X_train, y_train, cv = CV, scoring = 'accuracy') cross_val_LR_f1 = cross_val_score(LR, X_train, y_train, cv = CV, scoring = 'f1') cross_val_LR_AUC = cross_val_score(LR, X_train, y_train, cv = CV, scoring = 'roc_auc') # **Model 2: Random Forest** # In[119]: RF_S = RandomForestClassifier(random_state = 42) params_RF = {'n_estimators': list(range(50,100)), 'min_samples_leaf': list(range(1,5)), 'min_samples_split': list(range(1,5))} grid_RF = RandomizedSearchCV(RF_S, param_distributions=params_RF, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True) grid_RF.fit(X_train, y_train) print('Best parameters:', grid_RF.best_estimator_) # In[120]: RF = RandomForestClassifier(n_estimators=70, random_state=42) cross_val_RF_Acc = cross_val_score(RF, X_train, y_train, cv = CV, scoring = 'accuracy') cross_val_RF_f1 = cross_val_score(RF, X_train, y_train, cv = CV, scoring = 'f1') cross_val_RF_AUC = cross_val_score(RF, X_train, y_train, cv = CV, scoring = 'roc_auc') # **Model 3: KNN** # In[121]: KNN_S = KNeighborsClassifier() params_KNN = {'n_neighbors': list(range(1,20))} grid_KNN = RandomizedSearchCV(KNN_S, param_distributions=params_KNN, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True) grid_KNN.fit(X_train, y_train) print('Best parameters:', grid_KNN.best_estimator_) # In[122]: KNN = KNeighborsClassifier(n_neighbors=1) cross_val_KNN_Acc = cross_val_score(KNN, X_train, y_train, cv = CV, scoring = 'accuracy') cross_val_KNN_f1 = cross_val_score(KNN, X_train, y_train, cv = CV, scoring = 'f1') cross_val_KNN_AUC = cross_val_score(KNN, X_train, y_train, cv = CV, scoring = 'roc_auc') # **Model 4: Decision Tree** # In[123]: DT_S = DecisionTreeClassifier(random_state=42) params_DT = {'min_samples_leaf': list(range(1,6)), 'min_samples_split': list(range(1,6))} grid_DT = RandomizedSearchCV(DT_S, param_distributions=params_DT, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True) grid_DT.fit(X_train, y_train) print('Best parameters:', grid_DT.best_estimator_) # In[124]: DT = DecisionTreeClassifier(random_state=42) cross_val_DT_Acc = cross_val_score(DT, X_train, y_train, cv = CV, scoring = 'accuracy') cross_val_DT_f1 = cross_val_score(DT, X_train, y_train, cv = CV, scoring = 'f1') cross_val_DT_AUC = cross_val_score(DT, X_train, y_train, cv = CV, scoring = 'roc_auc') # **Model 5: Ada Boost** # In[125]: AB_S = AdaBoostClassifier(random_state=42) params_AB = {'n_estimators': list(np.arange(50,100,10)), 'learning_rate':[0.01, 0.1, 1]} grid_AB = RandomizedSearchCV(AB_S, param_distributions=params_AB, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True) grid_AB.fit(X_train, y_train) print('Best parameters:', grid_AB.best_estimator_) # In[126]: AB = AdaBoostClassifier(learning_rate=1, n_estimators=90, random_state=42) cross_val_AB_Acc = cross_val_score(AB, X_train, y_train, cv = CV, scoring = 'accuracy') cross_val_AB_f1 = cross_val_score(AB, X_train, y_train, cv = CV, scoring = 'f1') cross_val_AB_AUC = cross_val_score(AB, X_train, y_train, cv = CV, scoring = 'roc_auc') # **Model 6: XG Boost** # In[127]: XG_S = XGBClassifier(random_state=42) params_XG = {'n_estimators': list(np.arange(50,150,10)), 'learning_rate':[0.01, 0.1, 1]} grid_XG = RandomizedSearchCV(XG_S, param_distributions=params_XG, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True) grid_XG.fit(X_train, y_train) print('Best parameters:', grid_XG.best_estimator_) # In[128]: XG = XGBClassifier(learning_rate=1, n_estimators=120, random_state=42) cross_val_XG_Acc = cross_val_score(XG, X_train, y_train, cv = CV, scoring = 'accuracy') cross_val_XG_f1 = cross_val_score(XG, X_train, y_train, cv = CV, scoring = 'f1') cross_val_XG_AUC = cross_val_score(XG, X_train, y_train, cv = CV, scoring = 'roc_auc') # **Model 7: Extra Tree Classifier** # In[129]: ET_S = ExtraTreesClassifier(random_state=42) params_ET = {'n_estimators': list(np.arange(50,150,10))} grid_ET = RandomizedSearchCV(XG_S, param_distributions=params_ET, cv=5, n_jobs=-1, n_iter=20, random_state=42, return_train_score=True) grid_ET.fit(X_train, y_train) print('Best parameters:', grid_ET.best_estimator_) # In[130]: ET = ExtraTreesClassifier(n_estimators=140, random_state=42) cross_val_ET_Acc = cross_val_score(ET, X_train, y_train, cv = CV, scoring = 'accuracy') cross_val_ET_f1 = cross_val_score(ET, X_train, y_train, cv = CV, scoring = 'f1') cross_val_ET_AUC = cross_val_score(ET, X_train, y_train, cv = CV, scoring = 'roc_auc') # **Super Learner** # In[131]: SL = SuperLearner(folds=5, random_state=42) # In[132]: SL.add([RF, XG, ET]) # In[133]: SL.add_meta(MLPClassifier()) # In[134]: cross_val_SL_Acc = cross_val_score(SL, X_train, y_train, cv = CV, scoring = 'accuracy') cross_val_SL_f1 = cross_val_score(SL, X_train, y_train, cv = CV, scoring = 'f1') cross_val_SL_AUC = cross_val_score(SL, X_train, y_train, cv = CV, scoring = 'roc_auc') # **Stacking** # In[135]: estimators = [('DT', DT), ('RF', RF), ('ET', ET), ('LR', LR), ('KNN', KNN), ('XG', XG), ('AB', AB)] Stack = StackingClassifier(estimators = estimators, final_estimator = MLPClassifier()) # In[136]: cross_val_ST_Acc = cross_val_score(Stack, X_train, y_train, cv = CV, scoring = 'accuracy') cross_val_ST_f1 = cross_val_score(Stack, X_train, y_train, cv = CV, scoring = 'f1') cross_val_ST_AUC = cross_val_score(Stack, X_train, y_train, cv = CV, scoring = 'roc_auc') # # # #
# #

# 12. Feature Importance

#
# # Back to Table of Contents # What features contribute more to predict the target (Churn)? let's find out how useful they are at predicting the target variable. # # Random Forest algorithm offer importance scores based on the reduction in the criterion used to select split points, like Gini or entropy. # In[153]: RF_I = RandomForestClassifier(n_estimators=70, random_state=42) RF_I.fit(X, y) # In[154]: d = {'Features': X_train.columns, 'Feature Importance': RF_I.feature_importances_} df = pd.DataFrame(d) df_sorted = df.sort_values(by='Feature Importance', ascending = True) df_sorted df_sorted.style.background_gradient(cmap='Blues') # In[155]: fig = px.bar(x=df_sorted['Feature Importance'], y=df_sorted['Features'], color_continuous_scale=px.colors.sequential.Blues, title='Feature Importance Based on Random Forest', text_auto='.4f', color=df_sorted['Feature Importance']) fig.update_traces(marker=dict(line=dict(color='black', width=2))) fig.update_layout({'yaxis': {'title':'Features'}, 'xaxis': {'title':'Feature Importance'}}) iplot(fig) # # # #
# #

# 13. Results

#
# # Back to Table of Contents # In[156]: compare_models = [('Logistic Regression', cross_val_LR_Acc.mean(),cross_val_LR_f1.mean(),cross_val_LR_AUC.mean(), ''), ('Random Forest', cross_val_RF_Acc.mean(),cross_val_RF_f1.mean(),cross_val_RF_AUC.mean(), ''), ('KNN', cross_val_KNN_Acc.mean(),cross_val_KNN_f1.mean(),cross_val_KNN_AUC.mean(), ''), ('Decision Tree', cross_val_DT_Acc.mean(), cross_val_DT_f1.mean(),cross_val_DT_AUC.mean(), ''), ('Ada Boost', cross_val_AB_Acc.mean(), cross_val_AB_f1.mean(),cross_val_AB_AUC.mean(), ''), ('XG Boost', cross_val_XG_Acc.mean(), cross_val_XG_f1.mean(),cross_val_XG_AUC.mean(), ''), ('Extra Tree', cross_val_ET_Acc.mean(), cross_val_ET_f1.mean(),cross_val_ET_AUC.mean(), ''), ('Super Learner', cross_val_SL_Acc.mean(), cross_val_SL_f1.mean(),cross_val_SL_AUC.mean(), ''), ('Stacking', cross_val_ST_Acc.mean(), cross_val_ST_f1.mean(),cross_val_ST_AUC.mean(), 'best model')] # In[157]: compare = pd.DataFrame(data = compare_models, columns=['Model','Accuracy Mean', 'F1 Score Mean', 'AUC Score Mean', 'Description']) compare.style.background_gradient(cmap='YlGn') # In[158]: d1 = {'Logistic Regression':cross_val_LR_Acc, 'Random Forest':cross_val_RF_Acc, 'KNN':cross_val_KNN_Acc, 'Decision Tree':cross_val_DT_Acc, 'Ada Boost':cross_val_AB_Acc, 'XG Boost':cross_val_XG_Acc, 'Extra Tree':cross_val_ET_Acc, 'Super Learner':cross_val_SL_Acc, 'Stacking':cross_val_ST_Acc} d_accuracy = pd.DataFrame(data = d1) # In[159]: d2 = {'Logistic Regression':cross_val_LR_f1, 'Random Forest':cross_val_RF_f1, 'KNN':cross_val_KNN_f1, 'Decision Tree':cross_val_DT_f1, 'Ada Boost':cross_val_AB_f1, 'XG Boost':cross_val_XG_f1, 'Extra Tree':cross_val_ET_f1, 'Super Learner':cross_val_SL_f1, 'Stacking':cross_val_ST_f1} d_f1 = pd.DataFrame(data = d2) # In[160]: d3 = {'Logistic Regression':cross_val_LR_AUC, 'Random Forest':cross_val_RF_AUC, 'KNN':cross_val_KNN_AUC, 'Decision Tree':cross_val_DT_AUC, 'Ada Boost':cross_val_AB_AUC, 'XG Boost':cross_val_XG_AUC, 'Extra Tree':cross_val_ET_AUC, 'Super Learner':cross_val_SL_AUC, 'Stacking':cross_val_ST_AUC} d_auc = pd.DataFrame(data = d3) # In[161]: fig = go.Figure() fig.add_trace(go.Box(name='Logistic Regression', y=d_accuracy.iloc[:,0])) fig.add_trace(go.Box(name='Random Forest', y=d_accuracy.iloc[:,1])) fig.add_trace(go.Box(name='KNN', y=d_accuracy.iloc[:,2])) fig.add_trace(go.Box(name='Decision Tree', y=d_accuracy.iloc[:,3])) fig.add_trace(go.Box(name='Ada Boost', y=d_accuracy.iloc[:,4])) fig.add_trace(go.Box(name='XG Boost', y=d_accuracy.iloc[:,5])) fig.add_trace(go.Box(name='Extra Tree', y=d_accuracy.iloc[:,6])) fig.add_trace(go.Box(name='Super Learner', y=d_accuracy.iloc[:,7])) fig.add_trace(go.Box(name='Stacking', y=d_accuracy.iloc[:,8])) fig.update_traces(boxpoints='all', boxmean=True) fig.update_layout(title_text='Box Plots for Models Accuracy (train)') iplot(fig) # In[162]: fig = go.Figure() fig.add_trace(go.Box(name='Logistic Regression', y=d_f1.iloc[:,0])) fig.add_trace(go.Box(name='Random Forest', y=d_f1.iloc[:,1])) fig.add_trace(go.Box(name='KNN', y=d_f1.iloc[:,2])) fig.add_trace(go.Box(name='Decision Tree', y=d_f1.iloc[:,3])) fig.add_trace(go.Box(name='Ada Boost', y=d_f1.iloc[:,4])) fig.add_trace(go.Box(name='XG Boost', y=d_f1.iloc[:,5])) fig.add_trace(go.Box(name='Extra Tree', y=d_f1.iloc[:,6])) fig.add_trace(go.Box(name='Super Learner', y=d_f1.iloc[:,7])) fig.add_trace(go.Box(name='Stacking', y=d_f1.iloc[:,8])) fig.update_traces(boxpoints='all', boxmean=True) fig.update_layout(title_text='Box Plots for Models F1 Score (train)') iplot(fig) # In[163]: fig = go.Figure() fig.add_trace(go.Box(name='Logistic Regression', y=d_auc.iloc[:,0])) fig.add_trace(go.Box(name='Random Forest', y=d_auc.iloc[:,1])) fig.add_trace(go.Box(name='KNN', y=d_auc.iloc[:,2])) fig.add_trace(go.Box(name='Decision Tree', y=d_auc.iloc[:,3])) fig.add_trace(go.Box(name='Ada Boost', y=d_auc.iloc[:,4])) fig.add_trace(go.Box(name='XG Boost', y=d_auc.iloc[:,5])) fig.add_trace(go.Box(name='Extra Tree', y=d_auc.iloc[:,6])) fig.add_trace(go.Box(name='Stacking', y=d_auc.iloc[:,8])) fig.update_traces(boxpoints='all', boxmean=True) fig.update_layout(title_text='Box Plots for Models AUC (train)') iplot(fig) # Stacking model is the most stable and accurate model. As a result, Stacking is selected for the purpose of predicting Churn. # In[164]: Stack.fit(X_train, y_train) y_pred = Stack.predict(X_test) # In[165]: print(classification_report(y_test,y_pred)) # In[166]: y_prob = Stack.predict_proba(X_test) roc_auc_score(y_test, y_prob[:,1],average='macro') # In[167]: fpr, tpr, thresholds = roc_curve(y_test, y_prob[:,1]) fig = px.area( x=fpr, y=tpr, title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})', labels=dict(x='False Positive Rate', y='True Positive Rate'), width=700, height=500, color_discrete_sequence=['#DA598A']) fig.add_shape( type='line', line=dict(dash='dash'), x0=0, x1=1, y0=0, y1=1 ) fig.update_yaxes(scaleanchor="x", scaleratio=1) fig.update_xaxes(constrain='domain') iplot(fig) # In[168]: cm = confusion_matrix(y_test, y_pred) cm = cm.astype(int) fig = ff.create_annotated_heatmap(z=cm[::-1], x=['No','Yes'], y=['Yes', 'No'], colorscale='Blues', annotation_text=cm[::-1]) fig.update_layout(title_text='Confusion Matrix of Stacking Model', xaxis_title = 'Predicted value', yaxis_title = 'Real value', width=800, height=500) iplot(fig) # We achieved about $86\%$ accuracy on the test. # Customer churn is definitely bad to a firm ’s profitability. Various strategies can be implemented to eliminate customer churn. The best way to avoid customer churn is for a company to truly know its customers. This includes identifying customers who are at risk of churning and working to improve their satisfaction. Improving customer service is, of course, at the top of the priority for tackling this issue. Building customer loyalty through relevant experiences and specialized service is another strategy to reduce customer churn. Some firms survey customers who have already churned to understand their reasons for leaving in order to adopt a proactive approach to avoiding future customer churn.