Binomial Logistic Regression
Waze is a mobile app that provides real-time traffic information. The app is available on both Android and iOS. The objective of this project is to predict the churn rate of Waze users so the company can make relevant decisions to improve the customer experience and retain users.
The purpose of this project is to demostrate knowledge of exploratory data analysis (EDA) and a binomial logistic regression model.
The goal is to build a binomial logistic regression model and evaluate the model's performance.
This activity has three parts:
Part 1: EDA & Checking Model Assumptions
Part 2: Model Building and Evaluation
Part 3: Interpreting Model Results
This Notebook will follow the PACE stages: Plan, Analyze, Construct, and Execute.
Import the data and packages that you've learned are needed for building logistic regression models.
# Packages for numerics + dataframes
import pandas as pd
import numpy as np
# Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Packages for Logistic Regression & Confusion Matrix
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, \
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression
Import the dataset.
# Load the dataset by running this cell
df = pd.read_csv('waze_dataset.csv')
In this stage, let's consider the following questions:
*Outliers and extreme data values can significantly impact logistic
regression models. After visualizing data, make a plan for addressing outliers by dropping rows, substituting extreme data with average data, and/or removing data values greater than 3 standard deviations.*
EDA activities also include identifying missing data to help the analyst make decisions on their exclusion or inclusion by substituting values with dataset means, medians, and other similar methods.
Additionally, it can be useful to create variables by multiplying variables together or calculating the ratio between two variables. For example, in this dataset you can create a drives_sessions_ratio variable by dividing drives by sessions.
Analyze and discover data, looking for correlations, missing data, potential outliers, columns that need to be transofrmed, and/or duplicates.
Start with shape
and info()
.
print(df.shape)
df.info()
(14999, 13) <class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 14999 non-null int64 1 label 14299 non-null object 2 sessions 14999 non-null int64 3 drives 14999 non-null int64 4 total_sessions 14999 non-null float64 5 n_days_after_onboarding 14999 non-null int64 6 total_navigations_fav1 14999 non-null int64 7 total_navigations_fav2 14999 non-null int64 8 driven_km_drives 14999 non-null float64 9 duration_minutes_drives 14999 non-null float64 10 activity_days 14999 non-null int64 11 driving_days 14999 non-null int64 12 device 14999 non-null object dtypes: float64(3), int64(8), object(2) memory usage: 1.5+ MB
Check point: Are there any missing values in your data?
- There are 700 missing values in the
label
column.
Use head()
.
df.head()
ID | label | sessions | drives | total_sessions | n_days_after_onboarding | total_navigations_fav1 | total_navigations_fav2 | driven_km_drives | duration_minutes_drives | activity_days | driving_days | device | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | retained | 283 | 226 | 296.748273 | 2276 | 208 | 0 | 2628.845068 | 1985.775061 | 28 | 19 | Android |
1 | 1 | retained | 133 | 107 | 326.896596 | 1225 | 19 | 64 | 13715.920550 | 3160.472914 | 13 | 11 | iPhone |
2 | 2 | retained | 114 | 95 | 135.522926 | 2651 | 0 | 0 | 3059.148818 | 1610.735904 | 14 | 8 | Android |
3 | 3 | retained | 49 | 40 | 67.589221 | 15 | 322 | 7 | 913.591123 | 587.196542 | 7 | 3 | iPhone |
4 | 4 | retained | 84 | 68 | 168.247020 | 1562 | 166 | 5 | 3950.202008 | 1219.555924 | 27 | 18 | Android |
Use the drop()
method to remove the ID column since you don't need this information for your analysis.
df = df.drop('ID', axis=1)
Now, check the class balance of the dependent (target) variable, label
.
df['label'].value_counts(normalize=True)
retained 0.822645 churned 0.177355 Name: label, dtype: float64
Call describe()
on the data.
df.describe()
sessions | drives | total_sessions | n_days_after_onboarding | total_navigations_fav1 | total_navigations_fav2 | driven_km_drives | duration_minutes_drives | activity_days | driving_days | |
---|---|---|---|---|---|---|---|---|---|---|
count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
mean | 80.633776 | 67.281152 | 189.964447 | 1749.837789 | 121.605974 | 29.672512 | 4039.340921 | 1860.976012 | 15.537102 | 12.179879 |
std | 80.699065 | 65.913872 | 136.405128 | 1008.513876 | 148.121544 | 45.394651 | 2502.149334 | 1446.702288 | 9.004655 | 7.824036 |
min | 0.000000 | 0.000000 | 0.220211 | 4.000000 | 0.000000 | 0.000000 | 60.441250 | 18.282082 | 0.000000 | 0.000000 |
25% | 23.000000 | 20.000000 | 90.661156 | 878.000000 | 9.000000 | 0.000000 | 2212.600607 | 835.996260 | 8.000000 | 5.000000 |
50% | 56.000000 | 48.000000 | 159.568115 | 1741.000000 | 71.000000 | 9.000000 | 3493.858085 | 1478.249859 | 16.000000 | 12.000000 |
75% | 112.000000 | 93.000000 | 254.192341 | 2623.500000 | 178.000000 | 43.000000 | 5289.861262 | 2464.362632 | 23.000000 | 19.000000 |
max | 743.000000 | 596.000000 | 1216.154633 | 3500.000000 | 1236.000000 | 415.000000 | 21183.401890 | 15851.727160 | 31.000000 | 30.000000 |
Checkpoint: Any outliers?
*The following columns all seem to have outliers as their max values are above the upper fence (Q3 + 1.5*IQR):
Create features that may be of interest to the stakeholder and/or that are needed to address the business scenario/problem.
km_per_driving_day
¶From the EDA, it shows that churn rate correlates with distance driven per driving day in the last month. It might be helpful to engineer a feature that captures this information.
Create a new column in df
called km_per_driving_day
, which represents the mean distance driven per driving day for each user.
Call the describe()
method on the new column.
# 1. Create `km_per_driving_day` column
df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days']
# 2. Call `describe()` on the new column
df['km_per_driving_day'].describe()
count 1.499900e+04 mean inf std NaN min 3.022063e+00 25% 1.672804e+02 50% 3.231459e+02 75% 7.579257e+02 max inf Name: km_per_driving_day, dtype: float64
Some values are infinite. This is the result of there being values of zero in the driving_days
column. Pandas imputes a value of infinity in the corresponding rows of the new column because division by zero is undefined.
# 1. Convert infinite values to zero
df.loc[df['km_per_driving_day']==np.inf, 'km_per_driving_day'] = 0
# 2. Confirm that it worked
df['km_per_driving_day'].describe()
count 14999.000000 mean 578.963113 std 1030.094384 min 0.000000 25% 136.238895 50% 272.889272 75% 558.686918 max 15420.234110 Name: km_per_driving_day, dtype: float64
professional_driver
¶Let's create a newbinary feature called professional_driver
that is for users who had 60 or more drives and drove on 15+ days in the last month.
Note: The objective is to create a new feature that separates professional drivers from other drivers. In this scenario, domain knowledge and intuition are used to determine these deciding thresholds, but ultimately they are arbitrary.
# Create `professional_driver` column
df['professional_driver'] = np.where((df['drives'] >= 60) & (df['driving_days'] >= 15), 1, 0)
Let's inspect the new variable.
Check the count of professional drivers and non-professionals
Within each class (professional and non-professional) calculate the churn rate
# 1. Check count of professionals and non-professionals
print(df['professional_driver'].value_counts())
# 2. Check in-class churn rate
df.groupby(['professional_driver'])['label'].value_counts(normalize=True)
0 12405 1 2594 Name: professional_driver, dtype: int64
professional_driver label 0 retained 0.801202 churned 0.198798 1 retained 0.924437 churned 0.075563 Name: label, dtype: float64
The churn rate for professional drivers is 7.6%, while the churn rate for non-professionals is 19.9%. This seems like it could add predictive signal to the model.
In this stage, we will consider the following question:
Initially, columns were dropped based on high multicollinearity. Later, variable selection can be fine-tuned by running and rerunning models to look at changes in accuracy, recall, and precision.
Initial variable selection was based on the business objective and insights from prior EDA.
Call info()
on the dataframe to check the data type of the label
variable and to verify if there are any missing values.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 label 14299 non-null object 1 sessions 14999 non-null int64 2 drives 14999 non-null int64 3 total_sessions 14999 non-null float64 4 n_days_after_onboarding 14999 non-null int64 5 total_navigations_fav1 14999 non-null int64 6 total_navigations_fav2 14999 non-null int64 7 driven_km_drives 14999 non-null float64 8 duration_minutes_drives 14999 non-null float64 9 activity_days 14999 non-null int64 10 driving_days 14999 non-null int64 11 device 14999 non-null object 12 km_per_driving_day 14999 non-null float64 13 professional_driver 14999 non-null int64 dtypes: float64(4), int64(8), object(2) memory usage: 1.6+ MB
Since there is no evidence of a non-random cause of the 700 missing values in the label
column, and these observations comprise less than 5% of the data, let's drop the rows that are missing this data.
# Drop rows with missing data in `label` column
df = df.dropna(subset=['label'])
Generally, we don't drop outliers unless it's necessary.
At times, outliers can be changed to the median, mean, 95th percentile, etc.
The potential outliers are:
sessions
drives
total_sessions
total_navigations_fav1
total_navigations_fav2
driven_km_drives
duration_minutes_drives
For this analysis, impute the outlying values for these columns. Calculate the 95th percentile of each column and change to this value any value in the column that exceeds it.
# Impute outliers
for column in ['sessions', 'drives', 'total_sessions', 'total_navigations_fav1',
'total_navigations_fav2', 'driven_km_drives', 'duration_minutes_drives']:
threshold = df[column].quantile(0.95)
df.loc[df[column] > threshold, column] = threshold
Call describe()
.
df.describe()
sessions | drives | total_sessions | n_days_after_onboarding | total_navigations_fav1 | total_navigations_fav2 | driven_km_drives | duration_minutes_drives | activity_days | driving_days | km_per_driving_day | professional_driver | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 14299.000000 | 14299.000000 | 14299.000000 | 14299.000000 | 14299.000000 | 14299.000000 | 14299.000000 | 14299.000000 | 14299.000000 | 14299.000000 | 14299.000000 | 14299.000000 |
mean | 76.539688 | 63.964683 | 183.717304 | 1751.822505 | 114.562767 | 27.187216 | 3944.558631 | 1792.911210 | 15.544653 | 12.182530 | 581.942399 | 0.173998 |
std | 67.243178 | 55.127927 | 118.720520 | 1008.663834 | 124.378550 | 36.715302 | 2218.358258 | 1224.329759 | 9.016088 | 7.833835 | 1038.254509 | 0.379121 |
min | 0.000000 | 0.000000 | 0.220211 | 4.000000 | 0.000000 | 0.000000 | 60.441250 | 18.282082 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 23.000000 | 20.000000 | 90.457733 | 878.500000 | 10.000000 | 0.000000 | 2217.319909 | 840.181344 | 8.000000 | 5.000000 | 136.168003 | 0.000000 |
50% | 56.000000 | 48.000000 | 158.718571 | 1749.000000 | 71.000000 | 9.000000 | 3496.545617 | 1479.394387 | 16.000000 | 12.000000 | 273.301012 | 0.000000 |
75% | 111.000000 | 93.000000 | 253.540450 | 2627.500000 | 178.000000 | 43.000000 | 5299.972162 | 2466.928876 | 23.000000 | 19.000000 | 558.018761 | 0.000000 |
max | 243.000000 | 200.000000 | 455.439492 | 3500.000000 | 422.000000 | 124.000000 | 8898.716275 | 4668.180092 | 31.000000 | 30.000000 | 15420.234110 | 1.000000 |
Change the data type of the label
column to be binary. This change is needed to train a logistic regression model.
Assign a 0
for all retained
users.
Assign a 1
for all churned
users.
Save this variable as label2
as to not overwrite the original label
variable.
# Create binary `label2` column
df['label2'] = np.where(df['label']=='churned', 1, 0)
df[['label', 'label2']].tail()
label | label2 | |
---|---|---|
14994 | retained | 0 |
14995 | retained | 0 |
14996 | retained | 0 |
14997 | churned | 1 |
14998 | retained | 0 |
The following are the assumptions for logistic regression:
Independent observations (This refers to how the data was collected.)
No extreme outliers (This has been addressed above)
Little to no multicollinearity among X predictors (we are about to look into this)
Linear relationship between X and the logit of y (This will be verified after modeling)
Check the correlation among predictor variables. First, generate a correlation matrix.
# Generate a correlation matrix
df.corr(method='pearson')
/var/folders/zs/57ypm_kn1bb231bb9st1xg6r0000gn/T/ipykernel_93687/3629950414.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. df.corr(method='pearson')
sessions | drives | total_sessions | n_days_after_onboarding | total_navigations_fav1 | total_navigations_fav2 | driven_km_drives | duration_minutes_drives | activity_days | driving_days | km_per_driving_day | professional_driver | label2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sessions | 1.000000 | 0.996942 | 0.597189 | 0.007101 | 0.001858 | 0.008536 | 0.002996 | -0.004545 | 0.025113 | 0.020294 | -0.011569 | 0.443654 | 0.034911 |
drives | 0.996942 | 1.000000 | 0.595285 | 0.006940 | 0.001058 | 0.009505 | 0.003445 | -0.003889 | 0.024357 | 0.019608 | -0.010989 | 0.444425 | 0.035865 |
total_sessions | 0.597189 | 0.595285 | 1.000000 | 0.006596 | 0.000187 | 0.010371 | 0.001016 | -0.000338 | 0.015755 | 0.012953 | -0.016167 | 0.254433 | 0.024568 |
n_days_after_onboarding | 0.007101 | 0.006940 | 0.006596 | 1.000000 | -0.002450 | -0.004968 | -0.004652 | -0.010167 | -0.009418 | -0.007321 | 0.011764 | 0.003770 | -0.129263 |
total_navigations_fav1 | 0.001858 | 0.001058 | 0.000187 | -0.002450 | 1.000000 | 0.002866 | -0.007368 | 0.005646 | 0.010902 | 0.010419 | -0.000197 | -0.000224 | 0.052322 |
total_navigations_fav2 | 0.008536 | 0.009505 | 0.010371 | -0.004968 | 0.002866 | 1.000000 | 0.003559 | -0.003009 | -0.004425 | 0.002000 | 0.006751 | 0.007126 | 0.015032 |
driven_km_drives | 0.002996 | 0.003445 | 0.001016 | -0.004652 | -0.007368 | 0.003559 | 1.000000 | 0.690515 | -0.007441 | -0.009549 | 0.344811 | -0.000904 | 0.019767 |
duration_minutes_drives | -0.004545 | -0.003889 | -0.000338 | -0.010167 | 0.005646 | -0.003009 | 0.690515 | 1.000000 | -0.007895 | -0.009425 | 0.239627 | -0.012128 | 0.040407 |
activity_days | 0.025113 | 0.024357 | 0.015755 | -0.009418 | 0.010902 | -0.004425 | -0.007441 | -0.007895 | 1.000000 | 0.947687 | -0.397433 | 0.453825 | -0.303851 |
driving_days | 0.020294 | 0.019608 | 0.012953 | -0.007321 | 0.010419 | 0.002000 | -0.009549 | -0.009425 | 0.947687 | 1.000000 | -0.407917 | 0.469776 | -0.294259 |
km_per_driving_day | -0.011569 | -0.010989 | -0.016167 | 0.011764 | -0.000197 | 0.006751 | 0.344811 | 0.239627 | -0.397433 | -0.407917 | 1.000000 | -0.165966 | 0.148583 |
professional_driver | 0.443654 | 0.444425 | 0.254433 | 0.003770 | -0.000224 | 0.007126 | -0.000904 | -0.012128 | 0.453825 | 0.469776 | -0.165966 | 1.000000 | -0.122312 |
label2 | 0.034911 | 0.035865 | 0.024568 | -0.129263 | 0.052322 | 0.015032 | 0.019767 | 0.040407 | -0.303851 | -0.294259 | 0.148583 | -0.122312 | 1.000000 |
Now, plot a correlation heatmap.
# Plot correlation heatmap
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(method='pearson'), vmin=-1, vmax=1, annot=True, cmap='coolwarm')
plt.title('Correlation heatmap indicates many low correlated variables',
fontsize=18)
plt.show();
/var/folders/zs/57ypm_kn1bb231bb9st1xg6r0000gn/T/ipykernel_93687/1955559285.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. sns.heatmap(df.corr(method='pearson'), vmin=-1, vmax=1, annot=True, cmap='coolwarm')
If there are predictor variables that have a Pearson correlation coefficient value greater than the absolute value of 0.7, these variables are strongly multicollinear. Therefore, only one of these variables should be used in your model.
Note: 0.7 is an arbitrary threshold. Some industries may use 0.6, 0.8, etc.
The following variables are multicollinear with each other:
sessions
anddrives
: 1.0
device
¶Let's create a binary column called device2
that encodes user devices as follows:
Android
-> 0
iPhone
-> 1
# Create new `device2` variable
df['device2'] = np.where(df['device']=='Android', 0, 1)
df[['device', 'device2']].tail()
device | device2 | |
---|---|---|
14994 | iPhone | 1 |
14995 | Android | 0 |
14996 | iPhone | 1 |
14997 | iPhone | 1 |
14998 | iPhone | 1 |
To build our model, we need to determine what X variables we want to include in your model to predict your target—label2
.
Drop the following variables and assign the results to X
:
label
(this is the target)label2
(this is the target)device
(this is the non-binary-encoded categorical variable)sessions
(this had high multicollinearity)driving_days
(this had high multicollinearity)sessions
and driving_days
were selected to be dropped, rather than drives
and activity_days
. The reason for this is that the features that were kept for modeling had slightly stronger correlations with the target variable than the features that were dropped.
# Isolate predictor variables
X = df.drop(columns = ['label', 'label2', 'device', 'sessions', 'driving_days'])
Now, isolate the dependent (target) variable. Assign it to a variable called y
.
# Isolate target variable
y = df['label2']
Use scikit-learn's train_test_split()
function to perform a train/test split on your data using the X and y variables assigned above.
*Let's fit our training set and evaluate the model on test set to avoid data leakage.
*IMPORTANT: Because the target class is imbalanced (82% retained vs. 18% churned), set the function's stratify
parameter to y
to ensure that the minority class appears in both train and test sets in the same proportion that it does in the overall dataset.*
# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
# Use .head()
X_train.head()
drives | total_sessions | n_days_after_onboarding | total_navigations_fav1 | total_navigations_fav2 | driven_km_drives | duration_minutes_drives | activity_days | km_per_driving_day | professional_driver | device2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
152 | 108 | 186.192746 | 3116 | 243 | 124 | 8898.716275 | 4668.180092 | 24 | 612.305861 | 1 | 1 |
11899 | 2 | 3.487590 | 794 | 114 | 18 | 3286.545691 | 1780.902733 | 5 | 3286.545691 | 0 | 1 |
10937 | 139 | 347.106403 | 331 | 4 | 7 | 7400.838975 | 2349.305267 | 15 | 616.736581 | 0 | 0 |
669 | 108 | 455.439492 | 2320 | 11 | 4 | 6566.424830 | 4558.459870 | 18 | 410.401552 | 1 | 1 |
8406 | 10 | 89.475821 | 2478 | 135 | 0 | 1271.248661 | 938.711572 | 27 | 74.779333 | 0 | 1 |
Use scikit-learn to instantiate a logistic regression model. Add the argument penalty = None
.
It is important to add penalty = 'none'
since your predictors are unscaled.
Fit the model on X_train
and y_train
.
model = LogisticRegression(penalty=None, max_iter=400)
model.fit(X_train, y_train)
LogisticRegression(max_iter=400, penalty=None)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=400, penalty=None)
Call the .coef_
attribute on the model to get the coefficients of each variable. The coefficients are in order of how the variables are listed in the dataset. Remember that the coefficients represent the change in the log odds of the target variable for every one unit increase in X.
pd.Series(model.coef_[0], index=X.columns)
drives 0.001911 total_sessions 0.000328 n_days_after_onboarding -0.000407 total_navigations_fav1 0.001231 total_navigations_fav2 0.000933 driven_km_drives -0.000015 duration_minutes_drives 0.000109 activity_days -0.106034 km_per_driving_day 0.000018 professional_driver -0.001529 dtype: float64
Call the model's intercept_
attribute to get the intercept of the model.
model.intercept_
array([-0.00170711])
Verify the linear relationship between X and the estimated log odds (known as logits) by making a regplot.
Call the model's predict_proba()
method to generate the probability of response for each sample in the training data. The first column is the probability of the user not churning, and the second column is the probability of the user churning.
# Get the predicted probabilities of the training data
training_probabilities = model.predict_proba(X_train)
training_probabilities
array([[0.93964321, 0.06035679], [0.6195027 , 0.3804973 ], [0.76473108, 0.23526892], ..., [0.91906683, 0.08093317], [0.85087326, 0.14912674], [0.93515221, 0.06484779]])
In logistic regression, the relationship between a predictor variable and the dependent variable does not need to be linear, however, the log-odds (a.k.a., logit) of the dependent variable with respect to the predictor variable should be linear.
Create a dataframe called logit_data
that is a copy of df
.
Create a new column called logit
in the logit_data
dataframe. The data in this column should represent the logit for each user.
# 1. Copy the `X_train` dataframe and assign to `logit_data`
logit_data = X_train.copy()
# 2. Create a new `logit` column in the `logit_data` df
logit_data['logit'] = [np.log(prob[1] / prob[0]) for prob in training_probabilities]
Plot a regplot where the x-axis represents an independent variable and the y-axis represents the log-odds of the predicted probabilities.
In an exhaustive analysis, this would be plotted for each continuous or discrete predictor variable. Here we show only activity_days
.
# Plot regplot of `activity_days` log-odds
sns.regplot(x='activity_days', y='logit', data=logit_data, scatter_kws={'s': 2, 'alpha': 0.5})
plt.title('Log-odds: activity_days');
If the logistic assumptions are met, the model results can be appropriately interpreted.
# Generate predictions on X_test
y_preds = model.predict(X_test)
Now, use the score()
method on the model with X_test
and y_test
as its two arguments. The default score in scikit-learn is accuracy.
# Score the model (accuracy) on the test data
model.score(X_test, y_test)
0.8237762237762237
Lets use the confusion_matrix
function to obtain a confusion matrix. Use y_test
and y_preds
as arguments.
cm = confusion_matrix(y_test, y_preds)
Next, use the ConfusionMatrixDisplay()
function to display the confusion matrix from the above cell, passing the confusion matrix you just created as its argument.
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=['retained', 'churned'],
)
disp.plot();
Then we compute the precision and recall as follows:
# Calculate precision
precision = precision_score(y_test, y_preds)
precision
0.5178571428571429
# Calculate recall
recall = recall_score(y_test, y_preds)
recall
0.0914826498422713
# Create a classification report
target_labels = ['retained', 'churned']
print(classification_report(y_test, y_preds, target_names=target_labels))
precision recall f1-score support retained 0.83 0.98 0.90 2941 churned 0.52 0.09 0.16 634 accuracy 0.82 3575 macro avg 0.68 0.54 0.53 3575 weighted avg 0.78 0.82 0.77 3575
Note: The model has mediocre precision and very low recall, which means that it makes a lot of false negative predictions and fails to capture users who will churn.
Let's generate a bar graph
# Create a list of (column_name, coefficient) tuples
feature_importance = list(zip(X_train.columns, model.coef_[0]))
# Sort the list by coefficient value
feature_importance = sorted(feature_importance, key=lambda x: x[1], reverse=True)
feature_importance
[('drives', 0.001910838978860374), ('total_navigations_fav1', 0.0012305740479785518), ('total_navigations_fav2', 0.0009325794127833079), ('total_sessions', 0.0003281500839471898), ('duration_minutes_drives', 0.00010908852029217041), ('km_per_driving_day', 1.826097364101309e-05), ('driven_km_drives', -1.4932626945818142e-05), ('n_days_after_onboarding', -0.0004065688207715244), ('professional_driver', -0.001528706592921126), ('activity_days', -0.10603360669096709)]
# Plot the feature importances
import seaborn as sns
sns.barplot(x=[x[1] for x in feature_importance],
y=[x[0] for x in feature_importance],
orient='h')
plt.title('Feature importance');
Now that we have built the logistic regression model, it's time to share our findings with the Waze leadership team.
Highlights:
activity_days
was by far the most important feature in the model. It had a negative correlation with user churn. This was not surprising, as this variable was very strongly correlated withdriving_days
, which was known from EDA to have a negative correlation with churn.
In previous EDA, user churn rate increased as the values in
km_per_driving_day
increased. The correlation heatmap here in this notebook revealed this variable to have the strongest positive correlation with churn of any of the predictor variables by a relatively large margin. In the model, it was the second-least-important variable.
In a multiple logistic regression model, features can interact with each other and these interactions can result in seemingly counterintuitive relationships. This is both a strength and a weakness of predictive models, as capturing these interactions typically makes a model more predictive while at the same time making the model more difficult to explain.
It depends. What would the model be used for? If it's used to drive consequential business decisions, then no. The model is not a strong enough predictor, as made clear by its poor recall score. However, if the model is only being used to guide further exploratory efforts, then it can have value.
New features could be engineered to try to generate better predictive signal, as they often do if you have domain knowledge. In the case of this model, one of the engineered features (
professional_driver
) was the third-most-predictive predictor. It could also be helpful to scale the predictor variables, and/or to reconstruct the model with different combinations of predictor variables to reduce noise from unpredictive features.
It would be helpful to have drive-level information for each user (such as drive times, geographic locations, etc.). It would probably also be helpful to have more granular data to know how users interact with the app. For example, how often do they report or confirm road hazard alerts? Finally, it could be helpful to know the monthly count of unique starting and ending locations each driver inputs.