#!/usr/bin/env python # coding: utf-8 # #Churn Case Study # # ##Introduction and Motivation #
In this notebook we present an end-to-end analysis on a data mining task that is fairly common in the field. Most companies that provide subscription (whether free or not) services face the issue of customer churn. This is loosely defined as customers disengaging with the company's service, and in many cases, no longer paying for monthly or yearly fees. This issue is pervasise across industries (such as Telecom, Media, Music/Video Streaming, Gym Memberships, Credit Cards, Insurance, etc.), and it is common to find entire departments focusing on improving churn rates.
#
#
# Churn reduction is a perfect environment for employing data mining techniques. Most companies maintain very detailed customer databases that host all of the data required to predict churn. An analysis of customer churn usually involves exploring the drivers of churn as well as predicting the likelihood. The goals are often multi-fold: on one level, predicting churn enables companies to proactively reach out to at-risk customers to prevent it. Additionally, understanding the drivers (re: causes) of churn gives valuable feedback for the company to improve the underlying product or service issues that may be causing customer dissatisfaction.
#
#
# This notebook more than anything aims to illustrate a complete data mining workflow. Churn is the application but the emphasis here is on the steps one might take when perfoming a data mining task. There is Python code that generates the models and illustrations shown, but the emphasis here is more on the process than the code, so the code will mostly be hidden behind unexposed functions. We also provide generous discussion on how the results of the various analyses impact decisions one would make when focusing on churn reduction.
#
Using a relatively straightforward churn dataset, we cover the following steps: #
The first thing we should, as this guides the design of the entire analysis, is define the problem more formally. We know in general we want to reduce churn for the telecom firm. This could be implemented in several ways across the firm.
#
#
The first thing one should do when starting an analysis is understand the data. "Data understanding" can represent multiple things of course. Here are some common questions to ask when approaching a data set:
#
# How was the data sampled, collected and aggregated?
# Unless you planned and executed every step of the data sampling, logging and transformation processes, you should always spend some time exploring this question. The best analyses can be sabotaged by poorly sampled and logged data, so a data scientist always needs to understand what biases may exist in the data. The data for this churn analysis is comprised of all churners + a subset of non-churners, sampled so that there is a roughly even distribution between the two groups. Each sampled customer is represented by a vector of predictors $X =
# When creating a dataset for churn modeling and analysis, one has to be mindful of the future application of the results. I.e., we don't want to predict past churn, we want to predict potential future churn. In building a data set to predict future churn, one can sample customers today and wait some period of time to observe whether or not the consumer churns. While this is technically accurate, you might not want to wait what could be weeks or months to start the analysis. Additionally, you might want to study several cohorts of sampled customers so your analysis isn't biased towards any time/seasonal dependent factors that are happening at the moment. The other option then is to simulate the future by going into the past. The following scheme shows how this is done:
#
The point "t=now" is the day you're doing the analysis. We need to go back in time and create a virtual present, which we call $t=0$. We then look backwards to aggregate the predictor variables, and we look forwards in time to observe churn. The above represents one cohort, but this can be replicated across multiple cohorts (or multiple dates considered $t=0$ to either get access to more data or build dataset that has less dependency on any single seasonal effect).
#
#
# What are the variables in the dataset?
# This is a simple question with a lot of implication behind it. Before digging into the distribution of the variables, one should first understand what they are from a domain perspective. Technically, you can build a model having no qualitative understanding of the data set. This works fine in homework assignments or data mining competitions. In industrial settings however, having the contextual and qualitative understanding of the data is generally critical to conducting an effective analysis, and even more critical in communicating and applying the results. The following links to a schema for the dataset we'll be using:
#
# Churn Data Schema
#
# Each of the independent predictor variables represents either some static demographic of the customer, or a summary of some aspect of the customer's usage of a cellular phone service (where the summary ends at time $t=0$).
#
# An important question to highlight at this point is: why are we using these specific variables?
#
# The implications of this question are often the difference between a decent model and a great model. Separate from the more technical topic of variable selection, is the topic of feature engineering (note: a "variable" and a "feature" are generally the same thing, but the more common phrase is "feature engineering"). The variables in this dataset were conceived and programmed by someone, and this part of the process is as much art as it is science. Most data infrastructure systems are designed to accommodate fast, production-level transaction processing. Analytics and modeling is often a secondary consideration. A data scientist is often responsible for pulling raw, log-level data from the transaction systems and designing the feature engineering pipelines that turn raw data into modeling variables. Designing the right variables requires domain knowledge of the system and insight into the behavior one is trying to predict. In this case, we'll assume that someone with deep knowledge of the problem has designed these specific variables with the hypothesis that these would both be effective predictors and scalable to create. Data scientists often find themselves working with pre-existing modeling systems. A very common research question is what new variables can be added to the existing set to improve prediction performance (note: we won't actually be covering this question in this case study)?
#
# What does the data look like?
# This is where we start to explore the data from a quantitative perspective. More formally, we want to gain an understanding of the joint distribution of $X$ and $Y$ (because this is ultimately a predictive modeling problem, we follow convention and label the dependent variable/outcome $Y$). We'll focus here on three aspects of the joint distribution: 1). descriptive statistics of each variable, 2). covariance of the predictors, 3). covariance of each predictor with the outcome. Within each aspect we'll cover implications of the results.
#
The first thing we'll do is look at descriptive statistics of the data. Common statistics to look at are the mean, variance, range, and distribution (via the various quantiles of the distribution). # #
# In[1]: import sys sys.path.append("./utils/") from course_utils import * import churn_analysis as churn from IPython.display import display, HTML reload(churn) data = pd.read_csv('/Users/briand/Desktop/ds course/hw/Cell2Cell_data.csv', header=0, sep=',') data_descr = churn.getDfSummary(data) HTML(data_descr.to_html()) #Putting in tabular form is convenient and allows for easy comparison. The table here shows the descriptive statistics we discussed above, and also includes useful points such as count of missing values and the cardinality of the data. This last point is helpful for identifying which are binary variables.
#
# So what is the use of looking at this data?
# Just glancing at the above, we can make the following observations of the data:
#
Another useful exploratory technique is to understand the correlation of the variables. If the dimensionality of the data is reasonable enough (a very subjective call admittedly), we can visualize the correlation matrix in a nicely intuitive way: # In[2]: get_ipython().run_line_magic('matplotlib', 'inline') reload(churn) churn.plotCorr(data, 'churndep', 18, 8) #
The above illustration let's us quickly see which variables have high degrees of correlation amongst themselves. In particular, we see that 'incalls' and 'outcalls' have alarmingly high correlation. Intuitively this might make sense, as someone who makes frequent calls would also be more likely to receive more calls. An important question of course is, does this matter? The answer of course (and this is the answer to many data science questions), is it depends. In general, we usually want to be cautious when several variables are highly correlated (and even more so if they are collinear). Certain supervised learning algorithms, such as tree-based algorithms, can automatically deal with highly correlated variables. Tree-based algorithms are particularly good at handling data 'as-is,' and one might choose to use them based on that property. However, many industrial systems are built around linear models, such as logistic regression. These algorithms are definitely sensitive to data irregularities, and can certainly give strange results when two or more features are highly correlated.
#
# For the purpose of this case study, we're going to assume that we don't have restrictions on our choice of supervised learning algorithm. We can thus think ahead and acknowledge that our final model may likely be tree based, and as a consequence we won't explicitly correct for the high degree of correlation present here. If we wanted to though, some possible solutions could be simply dropping one of the correlated variables, or transforming the data using Principal Components.
#
Above we looked at how the various predictor variables correlated with each other. That is an exercise with its own motivation and implications. When faced with a supervised learning task, we often also care deeply about how all of the covariates correlate with our outcome of interest. This step of exploratory analysis has several benefits.
#
This chart makes it pretty obvious which variables have a more correlated relationship with the churn outcome. For this analysis, we can ignore the absolute value of the mutual information and focus more on the relative values. Of the 11 variables, 3 look like they carry very little information about the Churn outcome. Of the remaining 8, 'revenue' clearly is the leader by far. This is interesting, but to truly interpret this, we need to know one more piece of information - does the variable in question have a positive or negative relationship with Churn? One issue with Mutual Information is that it doesn't tell us this piece.
#
# With a little hacking, we can show the same plot but color the bars accoring to the direction of the correlation.
#
The above chart shows Mutual Information, but with the bars colored according to the direction of the correlation. So to read this, we can say that as 'revenue' increases, the likelihood to churn goes down. We'll leave it as a thought experiment for the to do the following: "How would you explain the qualitative implications of this chart to your marketing colleagues?" #
# # ##Predictive Modeling # # ###Setup # #At this stage we should have a strong sense of where our data came from and what the data looks like. We know from our planning that we aim to learn a supervised model with a binary outcome. When we set up the problem, we set it up without specifying any specific constraints around the learning algorithm we can use. There are many industrial situations where one might only be able to choose from a small class of algorithms. This could be motivated by scalability concerns, legacy issues or regulatory mandates.
#
# This is the stage where we need to decide whether or not to act on any of the data learnings we have observed above. We know we have at least one highly redundant feature as well as three features that seem to have no relationship with our outcome. If we were constrained to using a linear model like logistic regression, we would definitely want to consider dropping these variables. We know though that we have the allowance to use many more sophisticated tree-based algorithms, which conveniently enables us to just use the data as is. Nonetheless, we'll test in this study the effect of dropping these variables.
#
Building the final model is one area where we can put the 'science' in data science. Specifically, choosing a model can be thought of as running a controlled experiment against the multitude of design options that can be built into modeling process. It generally helpful to be organized and specify the exact modeling options we will test. Here is the plan for this dataset:
#
# Modeling Design Parameters
#
Now let's plot the results of each algorithm against each other #
# In[325]: reload(churn) churn.plotGridSearchMulti([('All', summ_all), ('Sub', summ_set)], True) #The above bar chart shows the cross-validation results of all algorithms testedand their respective hyper-parameter configurations. For convenience, we plotted two reference lines. The green line is just the LogLoss of our best classifier. The red line is the value of the best option plus 1 standard error of the mean of that option. The red line gives us the ability to choose a classifier using the 1-Standard Error rule, a generally good rule of thumb for being more conservative in one's selection. In our case here, the optimal option uses all of the data, but the option using the same algorithm and hyper-parmaters on less data is well within the 1-standard error here. When selecting a model, we generally want to promote parsimony (i.e., don't let things be more complex than they need to), so we'll go with the option that uses less data. In the long-run, this could have multiple benefits: 1). cheaper to execute and maintain, 2). less over-fitting risk.
#
# Now let us refit the model on the entire training set do evaluations on the test set.
#
We discussed above that our core metric here will be LogLoss. This worked well for model selection, however, simply looking at the value of the LogLoss metric without any reference points isn't always helpful. Our main motivation for using LogLoss was that we ultimately want good estimates of churn probabilities. One tool for understanding how well we achieved this goal is the probability calibration plot. Our evaluation plan will be to start with calibration plots. For additional insight into our model capabilities, we'll also produce a lift chart. The lift chart is a good evaluation tool that is fairly accessible to non data scientists. # # #
# In[405]: preds_train = best_model.predict_proba(train_subset.drop(lab,1))[:, 1] preds_test = best_model.predict_proba(test_subset.drop(lab,1))[:, 1] val.plotCalib(train_subset[lab].values.ravel(), preds_train, bins = 30, f = 4, l = 'Train', w = 16, h = 8, fig_i = 1, fig_j = 2, fig_k = 1) val.plotCalib(test_subset[lab].values.ravel(), preds_test, bins = 30, f = 4, l = 'Test', w = 16, h = 8, fig_i = 1, fig_j = 2, fig_k = 2) #
#
# Interpreting the calibration plots
#
# Remember, the point of the calibration plot is to understand whether or not our probability predictions actually result in probabilities that are near the prediction. Because we want to use our model predictions to estimate expected values, we want the calibration plot to be as close to the identity line as possible. Our main focus should be on the right chart above, which shows results for our out of sample dataset. The dominant trend we see here is that for the extremes (lower and higher estimated probabilities), our estimator tends to be more conservative. I.e., it underestimates the higher probabilities and overestimates the lower probabilities. We plot the 95% confidence intervals of the empirical probabilities (y-axis), and most of these overlap with the identity line. However, this might just be driven by sample size. We can see the same trend in the training, where the sample sizes are sufficient.
#
# What should we do about this?
#
# When model calibration is an issue, we can deploy one of two tools to correct the problem. The first is a non-parametric method called Isotonic Regression, and the other is a parametric method called the Platt Correction. More details of these methods (in sklearn) can be found here. Because our calibration error here is very mild, we'll hold off on running these corrections for now. This topic will be shelved in our 'further research' to-do list.
#
# We'll now make a secondary evaluation using lift and recall curves. These curves can help us answer other questions about our model, and they are generally accessible to non data scientists.
#
#
#
# Interpreting the lift and recall plots
#
# These plots look pretty normal and they don't suggest any major hiccups or issues with our model. Apart from being diagnostic, they can help answer questions that might come up when we apply our model. For instance, if we wanted to design an intervention for the most at-risk churners, we'll of course need to choose a cut-off point. There are advanced ways to do this (we'll discuss this later), but one could be to reach out to the at-risk churners subject to some budget constraint. For instance, we might have a call center that can reach out to 20% of the customer base up for renewal. If we chose our cut-off as the top 20%, then with the recall plot we could see that we'd be reaching about 34% of the would be churners. Using the lift curve, we could say that this would make us about 36% more efficient at reaching would be churners than if we had just randomly reached out to people.
#
Since this is a case study motivated by a single, static dataset, we cannot actually deploy our churn prediction model. However, we can certainly consider how we might intend to put the model to use. There are many important considerations when deploying a prediction model in an industrial application. They two we'll discuss specifically are: # #
Here is a quick description of each component in the above:
#
This extension should be fairly cheap, as we have already built out the feature engineering pipeline that performs much of the data processing. One interesting implication of the above design is that we need to automate a lot of the decision making above that we originally did manually. This is out of the scope of this case study, but this is one of the most exciting areas of industrial data science research! # # #
# # ##Wrap Up # #We have said a lot already, so this will be short! In this case study we tried to be thorough enough to cover all stages of defining, building and using a data science product in an industrial setting. Of course, an entire course can be taught on every element of this case study, so invariably, certain details have been left out. If you feel you have any suggestions on how to improve this case study, feel free to reach out at: briand at gmail dot com. # #
# # In[ ]: