We will analyse the data and answer the following questions:
#importing all the relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import seaborn as sns
We will start by loading the dataset.
os.chdir("/Users/sarabjotsingh/Documents/Github/Projects/Logistic")
data=pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn 2.csv")
data.head()
customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
data.dtypes
customerID object gender object SeniorCitizen int64 Partner object Dependents object tenure int64 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float64 TotalCharges object Churn object dtype: object
TotalCharges should be float rather than object. Let us change its datatype. But before that, let us explore the column more.
data["TotalCharges"].isnull().sum()
0
At this point, we cannoy observe any missing value. Let us explore the frequencies of various charges in the column.
data["TotalCharges"].value_counts()
11 20.2 11 19.75 9 20.05 8 19.9 8 .. 6849.4 1 692.35 1 130.15 1 3211.9 1 6844.5 1 Name: TotalCharges, Length: 6531, dtype: int64
The first category is blank. It is not coming up as a missing value because its actually a string " ". Let us replace that with a numpy missing value i.e. np.nan .
data["TotalCharges"]=data["TotalCharges"].replace(r'^\s*$',np.nan,regex=True)
data["TotalCharges"].isnull().mean()
0.001561834445548772
data.dropna(inplace=True)
Thus the percentage of missing values in TotalCharges which were masked by emply space character is 0.15% Let us now change the data type of the TotalCharges column.
data["TotalCharges"]=data["TotalCharges"].astype(float)
Also, SeniorCitizen has int as its type. Let us change it to object
data["SeniorCitizen"]=data["SeniorCitizen"].astype(object)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7032 entries, 0 to 7042 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customerID 7032 non-null object 1 gender 7032 non-null object 2 SeniorCitizen 7032 non-null object 3 Partner 7032 non-null object 4 Dependents 7032 non-null object 5 tenure 7032 non-null int64 6 PhoneService 7032 non-null object 7 MultipleLines 7032 non-null object 8 InternetService 7032 non-null object 9 OnlineSecurity 7032 non-null object 10 OnlineBackup 7032 non-null object 11 DeviceProtection 7032 non-null object 12 TechSupport 7032 non-null object 13 StreamingTV 7032 non-null object 14 StreamingMovies 7032 non-null object 15 Contract 7032 non-null object 16 PaperlessBilling 7032 non-null object 17 PaymentMethod 7032 non-null object 18 MonthlyCharges 7032 non-null float64 19 TotalCharges 7032 non-null float64 20 Churn 7032 non-null object dtypes: float64(2), int64(1), object(18) memory usage: 1.2+ MB
The data does not have any missing values.
Let us look for outliers
data.boxplot("TotalCharges")
<AxesSubplot:>
data.boxplot("MonthlyCharges")
<AxesSubplot:>
No outliers
Let us have a look at the distribution of numerical variables
data.hist("MonthlyCharges",bins=10)
array([[<AxesSubplot:title={'center':'MonthlyCharges'}>]], dtype=object)
data.hist("TotalCharges",bins=10)
array([[<AxesSubplot:title={'center':'TotalCharges'}>]], dtype=object)
data.hist("tenure",bins=10)
array([[<AxesSubplot:title={'center':'tenure'}>]], dtype=object)
The distributions of the numerical variables are highly skewed.
There are many categorical variables in our data. Let us visualize the effect of these categories on the churn rate.
sns.set(style='darkgrid')
from scipy.stats import chi2_contingency
#creating a function to create a stacked barplot using cross tabulations.
def count_barplot(category):
CrosstabResult=pd.crosstab(index=data[category],columns=data['Churn'],normalize="index")
%matplotlib inline
CrosstabResult.plot.bar(stacked=True)
print(CrosstabResult)
count_barplot("gender")
Churn No Yes gender Female 0.730405 0.269595 Male 0.737954 0.262046
The churn rate doesn't really change from male to female. So gender does not seem like a significant variable for our analysis.
count_barplot("SeniorCitizen")
Churn No Yes SeniorCitizen 0 0.763497 0.236503 1 0.583187 0.416813
The churn rate is a bit high for senior citizens as compared to that of non senior citizens.
count_barplot("Partner")
Churn No Yes Partner No 0.670239 0.329761 Yes 0.802829 0.197171
The churn rate is a bit high for people not having a partner.
count_barplot("Dependents")
Churn No Yes Dependents No 0.687209 0.312791 Yes 0.844688 0.155312
The churn rate is high for people with no dependents.
count_barplot("PhoneService")
Churn No Yes PhoneService No 0.750000 0.250000 Yes 0.732525 0.267475
count_barplot("MultipleLines")
Churn No Yes MultipleLines No 0.749188 0.250812 No phone service 0.750000 0.250000 Yes 0.713515 0.286485
The churn rate seems more or less similar, for customers having or not having multiple lines and no phone service.
count_barplot("InternetService")
Churn No Yes InternetService DSL 0.810017 0.189983 Fiber optic 0.581072 0.418928 No 0.925658 0.074342
The churn rate is high for customers using Fiber Optic, as compared to those using DSL or not having an internet service.
count_barplot("OnlineSecurity")
Churn No Yes OnlineSecurity No 0.582213 0.417787 No internet service 0.925658 0.074342 Yes 0.853598 0.146402
The churn rate is high for customers with no online security as compared to those opting for online security or even people with no internet service.
count_barplot("OnlineBackup")
Churn No Yes OnlineBackup No 0.600583 0.399417 No internet service 0.925658 0.074342 Yes 0.784330 0.215670
The churn rate is high for customers with no online backup as compared to those opting for online backup or even people with no internet service.
count_barplot("DeviceProtection")
Churn No Yes DeviceProtection No 0.608597 0.391403 No internet service 0.925658 0.074342 Yes 0.774607 0.225393
The churn rate is high for customers with no device protection as compared to those opting for device protection or even people with no internet service.
count_barplot("TechSupport")
Churn No Yes TechSupport No 0.583525 0.416475 No internet service 0.925658 0.074342 Yes 0.848039 0.151961
The churn rate is high for customers with no technical support as compared to those opting for technical support or even people with no internet service.
count_barplot("StreamingTV")
Churn No Yes StreamingTV No 0.664649 0.335351 No internet service 0.925658 0.074342 Yes 0.698853 0.301147
The churn rate is quite low for customers who did not buy an internet service with us. However, people with no streaming tv service or with, have similar churn rates.
count_barplot("StreamingMovies")
Churn No Yes StreamingMovies No 0.662711 0.337289 No internet service 0.925658 0.074342 Yes 0.700476 0.299524
The churn rate is quite low for customers who did not buy an internet service with us. However, people with or without streaming movies service, have similar churn rates.
count_barplot("Contract")
Churn No Yes Contract Month-to-month 0.572903 0.427097 One year 0.887228 0.112772 Two year 0.971513 0.028487
The churn rates are high for customers with month-to-month contract and decreases as the contract term increases. This was kind of expected as people with higher contract lengths don't have high flexibility of leaving the subscription as compared to those with month-to-month conracts.
count_barplot("PaperlessBilling")
Churn No Yes PaperlessBilling No 0.836243 0.163757 Yes 0.664107 0.335893
Churn rate for customers using paperless billing is higher than that of customers not using paperless billing.
count_barplot("PaymentMethod")
Churn No Yes PaymentMethod Bank transfer (automatic) 0.832685 0.167315 Credit card (automatic) 0.847469 0.152531 Electronic check 0.547146 0.452854 Mailed check 0.807980 0.192020
Churn rate for customers using electronic check is higher than that of customers using other methods of payment.
Although, the visualizations do give us some insights on the relationship between churn rates and different categories, it is important that we use something more robust for our feature selection. We will use the Chi-Square contingency test to determine, if the relationship between the category and churn rate is statistically significant or not.
Null Hypotheses: Ho: There is no significant relationship between the category and churn rate
Alternate Hypothesis: H1: There is a significant relationship between the category and churn rate
Threshold of p-value-0.05
# replacing no internet service and no phone service by no
data["StreamingMovies"]=np.where(data["StreamingMovies"]=="No internet service","No",data["StreamingMovies"])
data["StreamingTV"]=np.where(data["StreamingTV"]=="No internet service","No",data["StreamingTV"])
data["TechSupport"]=np.where(data["TechSupport"]=="No internet service","No",data["TechSupport"])
data["DeviceProtection"]=np.where(data["DeviceProtection"]=="No internet service","No",data["DeviceProtection"])
data["OnlineBackup"]=np.where(data["OnlineBackup"]=="No internet service","No",data["OnlineBackup"])
data["OnlineSecurity"]=np.where(data["OnlineSecurity"]=="No internet service","No",data["OnlineSecurity"])
data["MultipleLines"]=np.where(data["MultipleLines"]=="No phone service","No",data["MultipleLines"])
# creating a function to check whethere the given category
# has a significant relationship with our target variable: Churn
rel_check=[]
p_value=[]
def rel_chi(category):
CrosstabResult=pd.crosstab(data['Churn'],data[category])
chi_res = chi2_contingency(pd.crosstab(data['Churn'], data[category]))
p_value.append(chi_res[1])
if chi_res[1]<0.05:
rel_check.append("Significant relationship")
else:
rel_check.append("No Significant relationship")
categories=['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV',
'StreamingMovies', 'Contract','PaperlessBilling','PaymentMethod']
#determining significance of relationships using rel_check
for cat in categories:
rel_chi(cat)
check=pd.DataFrame(data=[categories,rel_check,p_value]).T
check.columns=["Categorical Variable","Relationship","p_value"]
check
Categorical Variable | Relationship | p_value | |
---|---|---|---|
0 | gender | No Significant relationship | 0.490488 |
1 | SeniorCitizen | Significant relationship | 0.0 |
2 | Partner | Significant relationship | 0.0 |
3 | Dependents | Significant relationship | 0.0 |
4 | PhoneService | No Significant relationship | 0.349924 |
5 | MultipleLines | Significant relationship | 0.000869 |
6 | InternetService | Significant relationship | 0.0 |
7 | OnlineSecurity | Significant relationship | 0.0 |
8 | OnlineBackup | Significant relationship | 0.0 |
9 | DeviceProtection | Significant relationship | 0.0 |
10 | TechSupport | Significant relationship | 0.0 |
11 | StreamingTV | Significant relationship | 0.0 |
12 | StreamingMovies | Significant relationship | 0.0 |
13 | Contract | Significant relationship | 0.0 |
14 | PaperlessBilling | Significant relationship | 0.0 |
15 | PaymentMethod | Significant relationship | 0.0 |
Because the relationships between Gender and Churn and Phone Service and Churn are not statistically significant, including GENDER and PHONE SERVICE in our model, would not be beneficial. So let us delete these columns.
data=data.drop(["gender","PhoneService"],axis=1)
categories.remove("gender")
categories.remove("PhoneService")
data.head()
customerID | SeniorCitizen | Partner | Dependents | tenure | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | 0 | Yes | No | 1 | No | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
1 | 5575-GNVDE | 0 | No | No | 34 | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
2 | 3668-QPYBK | 0 | No | No | 2 | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
3 | 7795-CFOCW | 0 | No | No | 45 | No | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
4 | 9237-HQITU | 0 | No | No | 2 | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
CustomerID should also be removed as it is just an unique identifier of a customer. It doesn't give any information on customer behaviour.
data.drop(["customerID"],axis=1,inplace=True)
The Chi-Square tests helped us determine which categorical variables to remove. With the help of concept of collinearity, we will see which variables are highly correlated and remove the variable which is highly correlated with the most. This will ensure that multicollinearity is minimum in our model.
sns.set_style("white")
sns.heatmap(data.corr(),cmap="BuPu",annot=True)
data.drop(["TotalCharges"],axis=1,inplace=True)
TotalCharges has a really high correlation with Tenure as well as Monthly Charges. So, we will delete this column.
Now its time for, splitting the data into test and train data.
from sklearn.model_selection import train_test_split
y=data["Churn"]
X=data.drop(["Churn"],axis=1)
x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
25% of the data is now the test data and the rest 75% is now the training data.
# label encoding (binary variables) in data
label_encoding_columns = ['Partner', 'Dependents', 'PaperlessBilling']
# ohe (binary variables) in data
one_hot_encoding_columns = ['MultipleLines','InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract','PaymentMethod']
# encode categorical binary features using label encoding in training data
for column in label_encoding_columns:
x_train[column] = x_train[column].map({'Yes': 1, 'No': 0})
y_train=y_train.map({'Yes': 1, 'No': 0})
# encode categorical variables with more than two levels using one-hot encoding n training data
x_train= pd.get_dummies(x_train, columns = one_hot_encoding_columns)
# encode categorical binary features using label encoding in test data
for column in label_encoding_columns:
x_test[column] = x_test[column].map({'Yes': 1, 'No': 0})
y_test=y_test.map({'Yes': 1, 'No': 0})
# encode categorical variables with more than two levels using one-hot encoding in test data
x_test= pd.get_dummies(x_test, columns = one_hot_encoding_columns)
I am not removing the extra dummy variables (which help us avoid dummy trap) because sklearn can handle multiple dummy variables and avoid dummy trap.
Because our numerical features, tenure and monthly charges, have different units of measurement, we will normalize them using Min-Max scaling technique. We will calculate min and max of the respective column using training data and transform training and test data using those values. This helps us protect our model from information leakage.
min_max_columns = ['tenure', 'MonthlyCharges']
# scale numerical variables using min max scaler
for column in min_max_columns:
# minimum value of the column
min_column = x_train[column].min()
# maximum value of the column
max_column = x_train[column].max()
# min max scaler
x_train[column] = (x_train[column] - min_column) / (max_column - min_column)
for column in min_max_columns:
x_test[column] = (x_test[column] - min_column) / (max_column - min_column)
Creating a Logistic Regression Model that can help us predict if a customer will churn or not.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear')
model.fit(x_train, y_train)
LogisticRegression(solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(solver='liblinear')
Predicted probabilities of Churn on our test data. According to the concept of Logistic Regression, if the probability is >0.5, the customer will churn else not.
model.predict_proba(x_test)[:,1]
array([0.32549275, 0.2250375 , 0.22242904, ..., 0.61365948, 0.13201921, 0.0199735 ])
Predicted values of Churn on our test data.
predictions = model.predict(x_test)
print(predictions)
[0 0 0 ... 1 0 0]
Variables and their coefficients.
model_features = x_train.columns
coefficient_df = pd.DataFrame({"Variable" : model_features, "Coefficient": model.coef_[0]})
print(coefficient_df.sort_values(by="Coefficient",ascending=False, key=abs))
Variable Coefficient 3 tenure -2.445148 5 MonthlyCharges -1.430608 10 InternetService_No -1.179425 9 InternetService_Fiber optic 1.109856 25 Contract_Two year -0.734331 23 Contract_Month-to-month 0.646045 4 PaperlessBilling 0.391355 0 SeniorCitizen 0.289957 28 PaymentMethod_Electronic check 0.277548 21 StreamingMovies_No -0.268153 27 PaymentMethod_Credit card (automatic) -0.257476 19 StreamingTV_No -0.244995 6 MultipleLines_No -0.226745 2 Dependents -0.220407 18 TechSupport_Yes -0.201836 12 OnlineSecurity_Yes -0.174241 22 StreamingMovies_Yes 0.156106 15 DeviceProtection_No -0.152368 20 StreamingTV_Yes 0.132948 7 MultipleLines_Yes 0.114698 14 OnlineBackup_Yes -0.106037 29 PaymentMethod_Mailed check -0.097024 17 TechSupport_No 0.089789 11 OnlineSecurity_No 0.062194 8 InternetService_DSL -0.042478 16 DeviceProtection_Yes 0.040321 1 Partner 0.040256 26 PaymentMethod_Bank transfer (automatic) -0.035096 24 Contract_One year -0.023761 13 OnlineBackup_No -0.006010
score = model.score(x_test, y_test)
print("Accuracy score: ",score)
Accuracy score: 0.7718998862343572
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
report=classification_report(y_test, predictions)
print(report,sep="\n")
precision recall f1-score support 0 0.88 0.80 0.84 1311 1 0.54 0.70 0.61 447 accuracy 0.77 1758 macro avg 0.71 0.75 0.72 1758 weighted avg 0.80 0.77 0.78 1758
true positives (TP): These are cases in which we predicted yes (the customer churned), and they actually churned.
true negatives (TN): We predicted no, and the customers did not churn.
false positives (FP): We predicted yes, but the customers did not actually churn. (Also known as a "Type I error.")
false negatives (FN): We predicted no, but the customers churned. (Also known as a "Type II error.")
cm =confusion_matrix(y_test,predictions)
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
cm.flatten()/np.sum(cm)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cm, annot=labels, fmt='', cmap='Blues')
plt.title("Confusion Matrix")
plt.show()