#!/usr/bin/env python # coding: utf-8 # # "Decision intelligence from historical observations for optimal marketing resource use" # > "In this article we analyze direct marketing data and prototype a decision model to optimize future marketing uplift." # - toc: true # - branch: master # - badges: true # - comments: true # - categories: [python, numpy, scikit-learn, marketing, causal inference, uplift] # # Summary # # Marketing is a key success and revenue driver in B2C markets: An appropriate message placed at the appropriate time with a prospective customer will increase your business success. # # However, marketing is also a major cost driver for businesses: Marketing efforts that are too broad, target the wrong audience, or convey the wrong message waste resources. # # In the case of direct marketing via phone conversations a key cost factor is the amount of time a sales call agent spends with the prospective customer on the phone. # # In this article we explore, rudimentarily, direct marketing data of a Portuguese financial institution. # # We explore the relationship between call duration and success (purchase of offered financial product), and show that consideration of customer-specific factors influences how you should allocate your marketing resources. # # Our prototypical analysis can be usueful in devising **data-driven marketing and sales** strategies that offer **decision intelligence** for your call agents. # # Fetch the data # # For our prototype we use the openly accessible [Bank Marketing Data Set](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) from the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). # In[ ]: get_ipython().system('wget --quiet https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip') # In[ ]: get_ipython().system('unzip -oqq bank.zip') # # Load Python libraries # In[1]: import graphviz import numpy as np import pandas as pd import seaborn as sns from sklearn.cluster import KMeans from sklearn.preprocessing import LabelEncoder # In[2]: np.random.seed(42) # In[3]: #hide def gv(s): return graphviz.Source('digraph G{ rankdir="LR"' + s + '; }') # # Prepare data # # The data we work with here contain a number of categorical and numerical variables. To keep our analysis and prototype simple we will focus on only a handful and remove the remainder. # In[ ]: #collapse df = pd.read_csv('bank.csv', delimiter=';') df['success'] = df['y'] del df['y'] df['success'] = df['success'].replace('no', 0) df['success'] = df['success'].replace('yes', 1) del df['education'] del df['default'] del df['housing'] del df['loan'] del df['contact'] del df['day'] del df['month'] del df['campaign'] del df['pdays'] del df['previous'] del df['poutcome'] # Our tabular data set now looks as follows: Each prospective (and in some cases eventual) customer whom a call agent conversed with fills a row. On each row we have numerical variables (age, account balance, duration of sales interaction) and categorical variables (job / employment status and marital status). Our data set contains **4,521 sales interactions**. # In[67]: df # # High-level model: more is better # # A blanket approach to marketing and sales may be: More resources lead to greater success. # # So in the case of direct marketing on the phone we could expect that the more time we spend with a prospective customer on the phone, the bigger our success rate. # In[6]: #hide_input gv(''' duration->success ''') # To test our model, we discretize the duration of our interaction with the customer into six duration buckets: bucket 1 holds the shortest interactions while bucket 6 holds the longest interactions. # In[ ]: no_buckets = 6 df['duration_bucket'] = pd.qcut(df['duration'], no_buckets, labels=[f'bucket {b + 1}' for b in range(no_buckets)]) # In[201]: df.groupby('duration_bucket').agg({'success': 'mean'}) # Looking at the average success rate in each duration bucket shows us that there is positive correlation between the duration of a sales interaction and our success rate - just as our model predicted. # # Hence, more marketing spend appears to lead to greater success in general. # # From a data perspective this is a pretty disappointing result as we expect to glean more intelligent insights from all the data we collected. # # Nuanced model: more isn't always better and there are always tradeoffs # # Let's dig deeper into what is going on here: Yes, the duration of the interaction between call agent and prospective customer likely influences our success rate. # # However, call agents also probably choose to spend more time on the phone with customers whose account balance is higher - hoping for a greater chance of a sale. That same account balance also likely influences how affine the customer is for spending more money on financial products. # # Both present job status and marital status are also likely candidates for influencing an affinity for financial products. # # And age of the customer probably influences both their job and marital status. # In[202]: #hide_input gv(''' age->job; age->marital; job->balance; balance->duration; marital->success; job->success; balance->success; duration->success ''') # Since a customer's account balance probably influences both how much time we spend with them on the phone and their likelihood of purchasing another financial product we will **control for account balance**. # # We control for account balance by training a cluster algorithm that segments our data set into three groups of similar account balance. # In[ ]: df['job'] = LabelEncoder().fit_transform(df['job']) df['marital'] = LabelEncoder().fit_transform(df['marital']) # In[ ]: segmenter = KMeans(n_clusters=3, random_state=42) # In[ ]: df['segment'] = segmenter.fit_predict(df[['balance']]) # In[ ]: df['segment'] = df['segment'].replace({0: 'low balance', 1: 'high balance', 2: 'medium balance'}) # Looking at both the average account balance and age in our three segments, we notice that our clustering algorithm picked out low, medium, and high balance segments. # # We also notice that average age correlates with average balance in these three segments hence our intuition codified in our above model seems valid. # In[225]: df.groupby('segment').agg({'age': 'mean', 'balance': 'mean'}) # Now, what about the effectiveness of our marketing resources in each segment? # # Visualizing our rate of success in the six duration buckets broken down by account balance segment we see a more nuanced picture: # # - Customers with low account balances really need to be worked on and only show success rates greater than 20% in the highest duration bucket 6, # - customers with medium balances already show a greater than 20% purchase likelihood in duration bucket 4, and # - customers with high balances actually max out in duration bucket 5 and drop below a 20% success rate in bucket 6. # In[ ]: success_rates = df.groupby(['segment', 'duration_bucket']).agg({'success': 'mean'}).reset_index() # In[228]: sns.set(rc={'figure.figsize': (10,6)}) sns.barplot( x='duration_bucket', y='success', hue='segment', data=success_rates, hue_order=['low balance', 'medium balance', 'high balance'] ); # Our more nuanced model and analysis provide us with **data-driven insights that provide actionable and testable advice**: # # - We should probably re-evaluate whether low balance individuals are sensible targets for our marketing campaigns given how resource-intensive they are, # - compute the profit and loss tradeoff between spending bucket 4 and bucket 6 resources on medium balance individuals, and # - ensure that we do not overdo it with our calls for high balance individuals.