#!/usr/bin/env python # coding: utf-8 # # **Waze Project** # **Binomial Logistic Regression** # ## About Waze # Waze is a mobile app that provides real-time traffic information. The app is available on both Android and iOS. The objective of this project is to predict the churn rate of Waze users so the company can make relevant decisions to improve the customer experience and retain users. # # **Objective** # # **The purpose** of this project is to demostrate knowledge of exploratory data analysis (EDA) and a binomial logistic regression model. # # **The goal** is to build a binomial logistic regression model and evaluate the model's performance. #
# # *This activity has three parts:* # # **Part 1:** EDA & Checking Model Assumptions # # **Part 2:** Model Building and Evaluation # # **Part 3:** Interpreting Model Results # #
# # **Build a regression model** # # **PACE stages** # # This Notebook will follow the PACE stages: Plan, Analyze, Construct, and Execute. # ## **PACE: Plan** # ### **Step 1. Imports and data loading** # Import the data and packages that you've learned are needed for building logistic regression models. # In[2]: # Packages for numerics + dataframes import pandas as pd import numpy as np # Packages for visualization import matplotlib.pyplot as plt import seaborn as sns # Packages for Logistic Regression & Confusion Matrix from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report, accuracy_score, precision_score, \ recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay from sklearn.linear_model import LogisticRegression # Import the dataset. # In[3]: # Load the dataset by running this cell df = pd.read_csv('waze_dataset.csv') # ## **PACE: Analyze** # # In this stage, let's consider the following questions: # # # * Here are some of the reasons why we should conduct EDA before building our binomial logistic regression model: # # > *Outliers and extreme data values can significantly impact logistic # regression models. After visualizing data, make a plan for addressing outliers by dropping rows, substituting extreme data with average data, and/or removing data values greater than 3 standard deviations.* # # > *EDA activities also include identifying missing data to help the analyst make decisions on their exclusion or inclusion by substituting values with dataset means, medians, and other similar methods.* # # > *Additionally, it can be useful to create variables by multiplying variables together or calculating the ratio between two variables. For example, in this dataset you can create a drives_sessions_ratio variable by dividing drives by sessions.* # ### **Step 2a. Explore data with EDA** # # Analyze and discover data, looking for correlations, missing data, potential outliers, columns that need to be transofrmed, and/or duplicates. # # # Start with `shape` and `info()`. # In[4]: print(df.shape) df.info() # **Check point:** Are there any missing values in your data? # # > * There are 700 missing values in the `label` column. # Use `head()`. # In[5]: df.head() # Use the `drop()` method to remove the ID column since you don't need this information for your analysis. # In[6]: df = df.drop('ID', axis=1) # Now, check the class balance of the dependent (target) variable, `label`. # In[7]: df['label'].value_counts(normalize=True) # Call `describe()` on the data. # In[8]: df.describe() # **Checkpoint:** Any outliers? # # > ***The following columns all seem to have outliers as their max values are above the upper fence (Q3 + 1.5*IQR):** #
# * `sessions` # * `drives` # * `total_sessions` # * `total_navigations_fav1` # * `total_navigations_fav2` # * `driven_km_drives` # * `duration_minutes_drives` # ### **Step 2b. Create features** # # Create features that may be of interest to the stakeholder and/or that are needed to address the business scenario/problem. # #### **`km_per_driving_day`** # # From the EDA, it shows that churn rate correlates with distance driven per driving day in the last month. It might be helpful to engineer a feature that captures this information. # # 1. Create a new column in `df` called `km_per_driving_day`, which represents the mean distance driven per driving day for each user. # # 2. Call the `describe()` method on the new column. # In[9]: # 1. Create `km_per_driving_day` column df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days'] # 2. Call `describe()` on the new column df['km_per_driving_day'].describe() # Some values are infinite. This is the result of there being values of zero in the `driving_days` column. Pandas imputes a value of infinity in the corresponding rows of the new column because division by zero is undefined. # In[10]: # 1. Convert infinite values to zero df.loc[df['km_per_driving_day']==np.inf, 'km_per_driving_day'] = 0 # 2. Confirm that it worked df['km_per_driving_day'].describe() # #### **`professional_driver`** # # Let's create a newbinary feature called `professional_driver` that is for users who had 60 or more drives **and** drove on 15+ days in the last month. # # **Note:** The objective is to create a new feature that separates professional drivers from other drivers. In this scenario, domain knowledge and intuition are used to determine these deciding thresholds, but ultimately they are arbitrary. # In[11]: # Create `professional_driver` column df['professional_driver'] = np.where((df['drives'] >= 60) & (df['driving_days'] >= 15), 1, 0) # Let's inspect the new variable. # # 1. Check the count of professional drivers and non-professionals # # 2. Within each class (professional and non-professional) calculate the churn rate # In[12]: # 1. Check count of professionals and non-professionals print(df['professional_driver'].value_counts()) # 2. Check in-class churn rate df.groupby(['professional_driver'])['label'].value_counts(normalize=True) # The churn rate for professional drivers is 7.6%, while the churn rate for non-professionals is 19.9%. This seems like it could add predictive signal to the model. # ## **PACE: Construct** # # In this stage, we will consider the following question: # # * Why did we select the X variables? # # > *Initially, columns were dropped based on high multicollinearity. Later, variable selection can be fine-tuned by running and rerunning models to look at changes in accuracy, recall, and precision.* #

# > *Initial variable selection was based on the business objective and insights from prior EDA.* # ### **Step 3a. Preparing variables** # Call `info()` on the dataframe to check the data type of the `label` variable and to verify if there are any missing values. # In[13]: df.info() # Since there is no evidence of a non-random cause of the 700 missing values in the `label` column, and these observations comprise less than 5% of the data, let's drop the rows that are missing this data. # In[14]: # Drop rows with missing data in `label` column df = df.dropna(subset=['label']) # #### **Impute outliers** # # Generally, we don't drop outliers unless it's necessary. # # At times, outliers can be changed to the **median, mean, 95th percentile, etc.** # # The potential outliers are: # # * `sessions` # * `drives` # * `total_sessions` # * `total_navigations_fav1` # * `total_navigations_fav2` # * `driven_km_drives` # * `duration_minutes_drives` # # For this analysis, impute the outlying values for these columns. Calculate the **95th percentile** of each column and change to this value any value in the column that exceeds it. # # # In[15]: # Impute outliers for column in ['sessions', 'drives', 'total_sessions', 'total_navigations_fav1', 'total_navigations_fav2', 'driven_km_drives', 'duration_minutes_drives']: threshold = df[column].quantile(0.95) df.loc[df[column] > threshold, column] = threshold # Call `describe()`. # In[16]: df.describe() # #### **Encode categorical variables** # Change the data type of the `label` column to be binary. This change is needed to train a logistic regression model. # # Assign a `0` for all `retained` users. # # Assign a `1` for all `churned` users. # # Save this variable as `label2` as to not overwrite the original `label` variable. # # In[20]: # Create binary `label2` column df['label2'] = np.where(df['label']=='churned', 1, 0) df[['label', 'label2']].tail() # ### **Step 3b. Determine whether assumptions have been met** # # The following are the assumptions for logistic regression: # # * Independent observations (This refers to how the data was collected.) # # * No extreme outliers (This has been addressed above) # # * Little to no multicollinearity among X predictors (we are about to look into this) # # * Linear relationship between X and the **logit** of y (This will be verified after modeling) # #### **Collinearity** # # Check the correlation among predictor variables. First, generate a correlation matrix. # In[22]: # Generate a correlation matrix df.corr(method='pearson') # Now, plot a correlation heatmap. # In[23]: # Plot correlation heatmap plt.figure(figsize=(15,10)) sns.heatmap(df.corr(method='pearson'), vmin=-1, vmax=1, annot=True, cmap='coolwarm') plt.title('Correlation heatmap indicates many low correlated variables', fontsize=18) plt.show(); # If there are predictor variables that have a Pearson correlation coefficient value greater than the **absolute value of 0.7**, these variables are strongly multicollinear. Therefore, only one of these variables should be used in your model. # # **Note:** 0.7 is an arbitrary threshold. Some industries may use 0.6, 0.8, etc. # # **The following variables are multicollinear with each other:** # # > * *`sessions` and `drives`: 1.0* #
# > * *`driving_days` and `activity_days`: 0.95* # ### **Step 3c. Create a binary variable for `device`** # # Let's create a binary column called `device2` that encodes user devices as follows: # # * `Android` -> `0` # * `iPhone` -> `1` # In[19]: # Create new `device2` variable df['device2'] = np.where(df['device']=='Android', 0, 1) df[['device', 'device2']].tail() # ### **Step 3d. Model building** # #### **Assign predictor variables and target** # # To build our model, we need to determine what X variables we want to include in your model to predict your target—`label2`. # # Drop the following variables and assign the results to `X`: # # * `label` (this is the target) # * `label2` (this is the target) # * `device` (this is the non-binary-encoded categorical variable) # * `sessions` (this had high multicollinearity) # * `driving_days` (this had high multicollinearity) # # `sessions` and `driving_days` were selected to be dropped, rather than `drives` and `activity_days`. The reason for this is that the features that were kept for modeling had slightly stronger correlations with the target variable than the features that were dropped. # In[24]: # Isolate predictor variables X = df.drop(columns = ['label', 'label2', 'device', 'sessions', 'driving_days']) # Now, isolate the dependent (target) variable. Assign it to a variable called `y`. # In[25]: # Isolate target variable y = df['label2'] # #### **Split the data** # # Use scikit-learn's `train_test_split()` function to perform a train/test split on your data using the X and y variables assigned above. # # *Let's fit our training set and evaluate the model on test set to avoid data leakage. # # ***IMPORTANT:** Because the target class is imbalanced (82% retained vs. 18% churned), set the function's `stratify` parameter to `y` to ensure that the minority class appears in both train and test sets in the same proportion that it does in the overall dataset.* # In[26]: # Perform the train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42) # In[23]: # Use .head() X_train.head() # Use scikit-learn to instantiate a logistic regression model. Add the argument `penalty = None`. # # It is important to add `penalty = 'none'` since your predictors are unscaled. # # Fit the model on `X_train` and `y_train`. # In[28]: model = LogisticRegression(penalty=None, max_iter=400) model.fit(X_train, y_train) # Call the `.coef_` attribute on the model to get the coefficients of each variable. The coefficients are in order of how the variables are listed in the dataset. Remember that the coefficients represent the change in the **log odds** of the target variable for **every one unit increase in X**. # In[29]: pd.Series(model.coef_[0], index=X.columns) # Call the model's `intercept_` attribute to get the intercept of the model. # In[30]: model.intercept_ # #### **Check final assumption** # # Verify the linear relationship between X and the estimated log odds (known as logits) by making a regplot. # # Call the model's `predict_proba()` method to generate the probability of response for each sample in the training data. The first column is the probability of the user not churning, and the second column is the probability of the user churning. # In[31]: # Get the predicted probabilities of the training data training_probabilities = model.predict_proba(X_train) training_probabilities # In logistic regression, the relationship between a predictor variable and the dependent variable does not need to be linear, however, the log-odds (a.k.a., logit) of the dependent variable with respect to the predictor variable should be linear. # # 1. Create a dataframe called `logit_data` that is a copy of `df`. # # 2. Create a new column called `logit` in the `logit_data` dataframe. The data in this column should represent the logit for each user. # # In[28]: # 1. Copy the `X_train` dataframe and assign to `logit_data` logit_data = X_train.copy() # 2. Create a new `logit` column in the `logit_data` df logit_data['logit'] = [np.log(prob[1] / prob[0]) for prob in training_probabilities] # Plot a regplot where the x-axis represents an independent variable and the y-axis represents the log-odds of the predicted probabilities. # # In an exhaustive analysis, this would be plotted for each continuous or discrete predictor variable. Here we show only `activity_days`. # In[29]: # Plot regplot of `activity_days` log-odds sns.regplot(x='activity_days', y='logit', data=logit_data, scatter_kws={'s': 2, 'alpha': 0.5}) plt.title('Log-odds: activity_days'); # ## **PACE: Execute** # ### **Step 4a. Results and evaluation** # # If the logistic assumptions are met, the model results can be appropriately interpreted. # # In[32]: # Generate predictions on X_test y_preds = model.predict(X_test) # Now, use the `score()` method on the model with `X_test` and `y_test` as its two arguments. The default score in scikit-learn is **accuracy**. # In[33]: # Score the model (accuracy) on the test data model.score(X_test, y_test) # ### **Step 4b. Show results with a confusion matrix** # Lets use the `confusion_matrix` function to obtain a confusion matrix. Use `y_test` and `y_preds` as arguments. # In[32]: cm = confusion_matrix(y_test, y_preds) # Next, use the `ConfusionMatrixDisplay()` function to display the confusion matrix from the above cell, passing the confusion matrix you just created as its argument. # In[33]: disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['retained', 'churned'], ) disp.plot(); # Then we compute the precision and recall as follows: # In[36]: # Calculate precision precision = precision_score(y_test, y_preds) precision # In[37]: # Calculate recall recall = recall_score(y_test, y_preds) recall # In[38]: # Create a classification report target_labels = ['retained', 'churned'] print(classification_report(y_test, y_preds, target_names=target_labels)) # **Note:** The model has mediocre precision and very low recall, which means that it makes a lot of false negative predictions and fails to capture users who will churn. # ### **Model's coefficients and feature importance** # # Let's generate a bar graph # In[39]: # Create a list of (column_name, coefficient) tuples feature_importance = list(zip(X_train.columns, model.coef_[0])) # Sort the list by coefficient value feature_importance = sorted(feature_importance, key=lambda x: x[1], reverse=True) feature_importance # In[40]: # Plot the feature importances import seaborn as sns sns.barplot(x=[x[1] for x in feature_importance], y=[x[0] for x in feature_importance], orient='h') plt.title('Feature importance'); # ### **Step4c. Conclusion** # # Now that we have built the logistic regression model, it's time to share our findings with the Waze leadership team. # # **Highlights:** # # # 1. The variable that influcneced the model's prediction the most: # # > _`activity_days` was by far the most important feature in the model. It had a negative correlation with user churn. This was not surprising, as this variable was very strongly correlated with `driving_days`, which was known from EDA to have a negative correlation with churn._ # # 2. Variables expected to be stronger predictors: # # > _In previous EDA, user churn rate increased as the values in `km_per_driving_day` increased. The correlation heatmap here in this notebook revealed this variable to have the strongest positive correlation with churn of any of the predictor variables by a relatively large margin. In the model, it was the second-least-important variable._ # # > _In a multiple logistic regression model, features can interact with each other and these interactions can result in seemingly counterintuitive relationships. This is both a strength and a weakness of predictive models, as capturing these interactions typically makes a model more predictive while at the same time making the model more difficult to explain._ # # 3. Should we recommend this model to the Waze leadership team? # # > _It depends. What would the model be used for? If it's used to drive consequential business decisions, then no. The model is not a strong enough predictor, as made clear by its poor recall score. However, if the model is only being used to guide further exploratory efforts, then it can have value._ # # 4. Potential ways to improve the model: # # > _New features could be engineered to try to generate better predictive signal, as they often do if you have domain knowledge. In the case of this model, one of the engineered features (`professional_driver`) was the third-most-predictive predictor. It could also be helpful to scale the predictor variables, and/or to reconstruct the model with different combinations of predictor variables to reduce noise from unpredictive features._ # # 5. Additional features that might help improve the model: # # > _It would be helpful to have drive-level information for each user (such as drive times, geographic locations, etc.). It would probably also be helpful to have more granular data to know how users interact with the app. For example, how often do they report or confirm road hazard alerts? Finally, it could be helpful to know the monthly count of unique starting and ending locations each driver inputs._ # In[ ]: