As a Junior Data Analyst in the analytical department at Yandex. I have been given the following data:
My first task is to help optimize marketing expenses.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from simple_colors import *
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
def data_preprocessing (visits_data_path, costs_data_path, orders_data_path):
visits = pd.read_csv(visits_data_path, dtype={'Device': 'category', 'source Id':'int32'},
parse_dates=['Start Ts', 'End Ts'])
costs= pd.read_csv(costs_data_path, dtype={'source Id':'int32', 'costs':'float32'},
parse_dates= ["dt"])
orders= pd.read_csv(orders_data_path, dtype={ 'Revenue':'float32'},
parse_dates= ["Buy Ts"])
visits.columns= visits.columns.str.replace(' ','_').str.lower()
orders.columns= orders.columns.str.replace(' ','_').str.lower()
visits.drop_duplicates(inplace=True)
costs.drop_duplicates(inplace=True)
orders.drop_duplicates(inplace=True)
visits['session_date'] = visits['start_ts'].astype("datetime64[D]")
visits['session_week'] = visits['start_ts'].astype("datetime64[W]")
visits['session_month'] = visits['start_ts'].astype("datetime64[M]")
visits['session_year'] = visits['start_ts'].astype("datetime64[Y]")
return visits, costs, orders
visits, costs, orders = data_preprocessing("visits_log_us.csv","costs_us.csv","orders_log_us.csv")
# Displaying the data
print()
print(green("Visits data:", 'bold'))
print()
display(visits.info())
display(visits.head(2))
print()
print(green("Costs data:", 'bold'))
print()
display(costs.info())
display(costs.head(2))
print()
print(green("Orders data:", 'bold'))
print()
display(orders.info())
display(orders.head(2))
print()
print(green("Summary Discription:", 'bold'))
print()
print(green("Orders:", 'bold'))
display(orders.describe(exclude= ['uint64', 'datetime64[ns]']).round().T)
print(green("Costs:", 'bold'))
display(costs.describe().round().T)
Visits data:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 359400 entries, 0 to 359399
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 device 359400 non-null category
1 end_ts 359400 non-null datetime64[ns]
2 source_id 359400 non-null int64
3 start_ts 359400 non-null datetime64[ns]
4 uid 359400 non-null uint64
5 session_date 359400 non-null datetime64[ns]
6 session_week 359400 non-null datetime64[ns]
7 session_month 359400 non-null datetime64[ns]
8 session_year 359400 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](6), int64(1), uint64(1)
memory usage: 25.0 MB
None
device | end_ts | source_id | start_ts | uid | session_date | session_week | session_month | session_year | |
---|---|---|---|---|---|---|---|---|---|
0 | touch | 2017-12-20 17:38:00 | 4 | 2017-12-20 17:20:00 | 16879256277535980062 | 2017-12-20 | 2017-12-14 | 2017-12-01 | 2017-01-01 |
1 | desktop | 2018-02-19 17:21:00 | 2 | 2018-02-19 16:53:00 | 104060357244891740 | 2018-02-19 | 2018-02-15 | 2018-02-01 | 2018-01-01 |
Costs data:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2542 entries, 0 to 2541
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 source_id 2542 non-null int64
1 dt 2542 non-null datetime64[ns]
2 costs 2542 non-null float32
dtypes: datetime64[ns](1), float32(1), int64(1)
memory usage: 69.5 KB
None
source_id | dt | costs | |
---|---|---|---|
0 | 1 | 2017-06-01 | 75.199997 |
1 | 1 | 2017-06-02 | 62.250000 |
Orders data:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50415 entries, 0 to 50414
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 buy_ts 50415 non-null datetime64[ns]
1 revenue 50415 non-null float32
2 uid 50415 non-null uint64
dtypes: datetime64[ns](1), float32(1), uint64(1)
memory usage: 1.3 MB
None
buy_ts | revenue | uid | |
---|---|---|---|
0 | 2017-06-01 00:10:00 | 17.00 | 10329302124590727494 |
1 | 2017-06-01 00:25:00 | 0.55 | 11627257723692907447 |
Summary Discription: Orders:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
revenue | 50415.0 | 5.0 | 22.0 | 0.0 | 1.0 | 2.0 | 5.0 | 2633.0 |
Costs:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
source_id | 2542.0 | 5.0 | 3.0 | 1.0 | 2.0 | 4.0 | 9.0 | 10.0 |
costs | 2542.0 | 129.0 | 156.0 | 1.0 | 22.0 | 77.0 | 170.0 | 1788.0 |
On average, revenue generated from customers was 5.00 and a maximum of 2633. Cost has a mean of about 129.00, and a maximum of about 1788. The data have been optimized, I have ensured each column contains the correct data type, and duplicates have been checked; the data is ready for analysis.
Number of visits in a day
dau_total = visits.groupby('session_date').agg({'uid': 'nunique'}).mean()
print(int(dau_total))
907
On average, about 907 people use Yandex.Afisha every day
Visualizing daily visits
fig = px.line(visits.groupby(visits['start_ts'].dt.date)['uid'].nunique().reset_index(),
x="start_ts", y="uid", title="<b>Daily Visits to Yandex Afisha</b>")
fig.update_layout(
showlegend=True,
margin=dict(t=30,l=8,b=8,r=10))
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Date</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Visits</b>")
fig.show()
print("Date with the highest number of visit is", visits.groupby(visits['start_ts'].dt.date)['uid']\
.nunique().reset_index().sort_values(by= "uid", ascending=False).iloc[0,0])
print("Date with the lowest number of visit is", visits.groupby(visits['start_ts'].dt.date)['uid']\
.nunique().reset_index().sort_values(by= "uid", ascending=False).iloc[-1,0])
Date with the highest number of visit is 2017-11-24 Date with the lowest number of visit is 2018-03-31
Amazingly, 24 November, 2017, was a black friday. It had the highest number of visits to Yandex Afesha. March 31, 2018 was a popular holiday plus observances Worldwide. For example, it was a Easter Saturday across the globe, a Freedom Day in Malta, and a National Sunflower Day in Argentina. This means, holidays affected visits to Yandex Afisha but black friday stimulated visits.
Number of visits in a week
wau_total = visits.groupby(['session_year', 'session_week']).agg({'uid': 'nunique'}).mean()
print(int(wau_total))
5621
On average, about 5621 people use Yandex.Afisha every week
Number of visits in a month
mau_total = visits.groupby(['session_year', 'session_month']).agg({'uid': 'nunique'}).mean()
print(int(mau_total))
23228
On average, about 23228 people use Yandex.Afisha every month.
The number of sessions per day
sessions_per_user = visits.groupby('session_date').agg({'uid': ['count','nunique']})
sessions_per_user.columns = ['n_sessions', 'n_users']
sessions_per_user['sessions_per_user'] = sessions_per_user['n_sessions'] / sessions_per_user['n_users']
print()
print("Average session Per Day:",sessions_per_user['sessions_per_user'].mean())
Average session Per Day: 1.0821696440039723
On average, there is about 1 session per day.
Comparing sessions per day and daily visits with a graph
fig = go.Figure()
fig.add_trace(go.Scatter(x=visits.groupby(visits['start_ts'].dt.date)['uid'].nunique().reset_index()['start_ts'],
y=visits.groupby(visits['start_ts'].dt.date)['uid'].nunique().reset_index()['uid'],
mode='lines',
name='DAU',line = dict(color='blue', width=2 )))
fig.add_trace(go.Scatter(x=visits.groupby(visits['start_ts'].dt.date)['uid'].count().reset_index()['start_ts'],
y=visits.groupby(visits['start_ts'].dt.date)['uid'].count().reset_index()['uid'],
mode='lines',
name='sessions',line = dict(color='black', width=2 )))
fig.update_layout(
showlegend=True,
plot_bgcolor="gray",
margin=dict(t=30,l=8,b=8,r=10),title= "<b>Sessions Per Day Vs Visits Per Day</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Date</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Sessions/Visits</b>")
fig.show()
Unsuprisingly, sessions per day mimics visits per day.
Length of each session
visits['session_length_sec'] = (visits['end_ts']- visits['start_ts']).dt.seconds
print(visits['session_length_sec'].mean())
643.506488592098
Visualising session length
fig = px.histogram(data_frame=visits,x= "session_length_sec", nbins=25, title= "<b>Session Length</b>")
fig.update_layout(
showlegend=False,
margin=dict(t=30,l=8,b=8,r=1))
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Session Length(seconds)</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Frequency</b>")
fig.show()
Session lenght is highly skewed therefore, average session length would not be mean. I would use mode.
print(int(visits['session_length_sec'].mode()))
60
On average, each session lasts about 60 seconds
Visualising sessions by device type
visits['session_length']=((pd.to_datetime(visits['end_ts'])-visits['start_ts'])/ np.timedelta64(1, 'm')).round()
fig = px.line(visits.groupby([visits['start_ts'].dt.date,'device'])['session_length'].mean().reset_index(),
x="start_ts", y="session_length",color='device', title = "<b>Sessions by Device Type</b>")
fig.update_layout(
showlegend=True,plot_bgcolor="gray",
margin=dict(t=30,l=8,b=8,r=10))
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Date</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Session Length</b>")
fig.show()
Destop device consistently had more sessions than touch devices.
How often do users come back? I will do rentention rate cohort analysis, and use sticky factor and check for per week and month.
Retention rate cohort analysis.
Here the focus is on the number of users users from a cohort that have remained active compared to their initial number.
first_activity_date = visits.groupby(['uid'])['start_ts'].min()
first_activity_date.name = 'first_activity_date'
visits = visits.join(first_activity_date, on='uid')
visits['activity_month'] = visits['start_ts'].astype("datetime64[M]")
visits['first_month'] = visits['first_activity_date'].astype("datetime64[M]")
visits['cohort_lifetime'] = ( visits['activity_month'] - visits['first_month'] )
visits['cohort_lifetime'] = (visits[ 'cohort_lifetime' ] / np.timedelta64(1, 'M')).round().astype(int)
cohorts = (visits.groupby(['first_month', 'cohort_lifetime']).agg({'uid': 'nunique'})\
.reset_index())
initial_visits_count = cohorts[cohorts['cohort_lifetime'] == 0][['first_month', 'uid']]
initial_visits_count = initial_visits_count.rename(columns={'uid': 'cohort_users'})
cohorts = cohorts.merge(initial_visits_count, on='first_month')
cohorts['retention'] = cohorts['uid'] / cohorts['cohort_users']
retention_pivot = cohorts.pivot_table(index=cohorts['first_month'].dt.date,columns='cohort_lifetime',values='retention',\
aggfunc='sum')
Visualizing the cohort user retention
sns.set(style='white')
plt.figure(figsize=(13, 9))
plt.title('Cohorts: User Retention')
sns.heatmap(retention_pivot, annot=True, fmt='.1%', linewidths=1, linecolor='gray')
plt.show()
The June 2017 cohort had the highest retention rate as at month 11 (and the only month that had a rentention rate as at month 11). By the first month (month 1), all cohorts had retention rates less than 10%. None of the May 2018 cohort came back after their first visit. About 4.8% of the April 2018 cohort only came back in month 1.
Adding sticky factor to the analysis
sticky_wau = dau_total / wau_total * 100
sticky_mau = dau_total / mau_total * 100
print("Sticky factor weekly:", round(int(sticky_wau)))
print("Sticky factor Monthly:", round(int(sticky_mau)))
Sticky factor weekly: 16 Sticky factor Monthly: 3
About 16% of users come back in a week, and about 3% of users come back in a month
Interim Conclusion
On average, about 907 people use Yandex.Afisha every day, the highest daily visits occur on a black friday and the lowest daily visits occur on a holiday, about 5621 people use Yandex.Afisha every week, about 23228 people use Yandex.Afisha every month. On average, there is about 1 session per day, and each session lasts about 60 seconds. By the first month (month 1), all cohorts had retention rates less than 10%. None of the May 2018 cohort came back after their first visit. Only June 2017 cohort were retained till month 11.
When do people start buying?
Since data is given for almost a year. It will be best to make monthly cohorts. I will deduct the month of first order from the month of first visits.
First order month
orders['order_month'] = orders['buy_ts'].astype('datetime64[M]')
first_orders = orders.groupby('uid').agg({'order_month': 'min'}).reset_index()
first_orders.columns = ['uid', 'first_month']
first_orders.head()
uid | first_month | |
---|---|---|
0 | 313578113262317 | 2018-01-01 |
1 | 1575281904278712 | 2017-06-01 |
2 | 2429014661409475 | 2017-10-01 |
3 | 2464366381792757 | 2018-01-01 |
4 | 2551852515556206 | 2017-11-01 |
Merging first orders and orders
orders_ = pd.merge(orders,first_orders, on='uid')
print("Total rows in the data:", orders_.shape[0])
orders_.head()
Total rows in the data: 50415
buy_ts | revenue | uid | order_month | first_month | |
---|---|---|---|---|---|
0 | 2017-06-01 00:10:00 | 17.00 | 10329302124590727494 | 2017-06-01 | 2017-06-01 |
1 | 2017-06-01 00:25:00 | 0.55 | 11627257723692907447 | 2017-06-01 | 2017-06-01 |
2 | 2017-06-01 00:27:00 | 0.37 | 17903680561304213844 | 2017-06-01 | 2017-06-01 |
3 | 2017-06-01 00:29:00 | 0.55 | 16109239769442553005 | 2017-06-01 | 2017-06-01 |
4 | 2017-06-01 07:58:00 | 0.37 | 14200605875248379450 | 2017-06-01 | 2017-06-01 |
Calculating duration in minutes and visualising
orders_['span'] = (orders_['order_month'] - orders_['first_month']) / np.timedelta64(1, 'm')
orders_['span'] = orders_['span'].round().astype('int')
fig = px.histogram(data_frame=orders_,x= "span", nbins=25, title= "<b>Initial Buying Timespan</b>")
fig.update_layout(
showlegend=False,
margin=dict(t=30,l=8,b=8,r=1))
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Minutes</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Frequency</b>")
fig.show()
print("Minutes:", round(int(orders_['span'].mode())))
Minutes: 0
On average, people start buying within 0 minutes (immediately) on the Yandex.afisha.
How many orders do they make during a given period of time?
cohort_sizes = first_orders.groupby('first_month').agg({'uid': 'nunique'}).reset_index()
cohort_sizes.columns = ['first_month', 'n_buyers']
cohort_orders=cohort_sizes.sort_values(ascending =False, by= "n_buyers",ignore_index=True )
fig = px.bar(cohort_orders, x="first_month", y="n_buyers", title= "<b>Number of Orders Per Month</b>", text='n_buyers')
fig.update_layout(
showlegend=False,
margin=dict(t=30,l=8,b=8,r=1))
fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Number of orders</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Order month</b>")
fig.show()
On December 2017, users made about 4400 orders. This is followed by October 2017, and November 2017. Implying users order more during or a period getting to christmas. The months June, July, and August have the least number of orders. In the summer, users go on holidays and do not enjoy the app a lot.
What is the average purchase size?
print("Total revenue:", orders_.revenue.sum())
print("Average purchase size:",orders_.revenue.mean())
Total revenue: 252057.19 Average purchase size: 4.999446868896484
On average, the purchase size is 5.00 dollars.
How much money do they bring? (LTV) To answer this question, I will creat cohorts and calculate how much each cohort brings on average.
LTV cohort analysis with heatmap
cohort_size = orders_.groupby('first_month').agg({'uid': 'nunique'}).reset_index().rename(columns={'uid': 'n_buyers'})
cohorts= orders_.groupby(['first_month', 'order_month']).agg({'revenue': 'sum'}).reset_index()
monthly_orders = pd.merge(cohorts,cohort_size, on='first_month')
margin_rate = .5
monthly_orders['gp'] = monthly_orders['revenue'] * margin_rate
monthly_orders['age'] = (monthly_orders['order_month'] - monthly_orders['first_month']) / np.timedelta64(1, 'M')
monthly_orders['age'] = monthly_orders['age'].round().astype('int')
monthly_orders['ltv'] = monthly_orders['gp'] / monthly_orders['n_buyers']
ltv_pivot= monthly_orders.pivot_table(
index=monthly_orders['first_month'].dt.date,
columns='age',
values='ltv',
aggfunc='mean').cumsum(axis=1)
plt.figure(figsize=(13, 9))
plt.title('Cohorts Analysis: LTV')
sns.heatmap(ltv_pivot, annot=True, fmt='.2f', linewidths=1, linecolor='gray')
plt.show()
The June 2017 cohort had the longest duration of LTV (upto age 11), thus, contributed the longest time. However, the September 2017 cohort had the highest LTV (cummulative sum of 6.72 over seven months).Th e cohorts that had the least LTVs are May and June 2018.
Calculating and visualising LTV by source
Retrieve source_id from visits and merge with the orders_ table. This will help analyse Ltv by source
get_source = visits[["uid", "source_id"]]
get_source.shape[0]
359400
Droping dulicates
get_source =get_source.drop_duplicates()
get_source.shape[0]
277870
get_source= get_source.drop_duplicates(subset=["uid"], keep="first")
get_source.shape[0]
228169
Merging orders_ and get_source to have souce id in one table
orders_ = orders_.merge(get_source, on= "uid")
orders_=orders_.drop_duplicates()
orders_.shape[0]
50415
Ltv analysis by source and visualising
cohort_ltv_size = orders_.groupby(['source_id'], as_index=False).agg({'uid':'nunique'}).\
rename(columns={'uid':'n_buyers', 'source_id': 'source_id'})
source_revenue = orders_.groupby('source_id').agg({'revenue': 'sum'}).reset_index()
source_ltv_table = cohort_ltv_size.merge(source_revenue, on='source_id')
margin_rate = .5
source_ltv_table['gp'] = source_ltv_table['revenue'] * margin_rate
source_ltv_table['ltv'] = source_ltv_table['gp'] / source_ltv_table['n_buyers']
fig = px.bar(source_ltv_table.sort_values(ascending =True, by= "ltv" ),x="ltv", y= "source_id", title= "<b>Lifetime Value by Source</b>",
text='ltv',orientation ="h", color_continuous_scale=px.colors.diverging.Temps)
fig.update_layout(
showlegend=False,
margin=dict(t=30,l=8,b=8,r=1))
fig.update_yaxes(type='category')
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>LTV</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source Id</b>")
fig.show()
Platforms/source 1 and 2 brings in the highest revenue, platform 3, 7, and 10 bring in the lowest revenue (platform 7 is the least).
Interim Conclusion
The December 2017 users made about 4400 orders (the highest), this is followed by October 2017, and November 2017. Users in the months June, July, and August have the least number of orders. On average, people start buying within 0 minutes (immediately), and the average purchase size is about 5.00 dollars. The June 2017 cohort contributed the longest time but the September 2017 cohort had the highest LTV. May and June 2018 cohorts had the least LTVs. Users of platforms/source 1 and 2 bring in the highest revenue, and users of platform 7 bring in the least.
How much money was spent? Overall/per source/over time
costs.costs.sum()
329131.6
Overall, about 329131.6 was spent.
Per Source
n_users =costs.agg({"source_id": "nunique"})
per_source= costs.pivot_table(index = "source_id", values="costs",aggfunc= "sum").reset_index()
per_source.columns = ["source_id","costs"]
fig = px.bar(per_source.sort_values(ascending =True, by= "costs" ),x="costs", y= "source_id", title= "<b>Money Spent Per Source Id</b>",orientation ="h", text='costs',
color_continuous_scale=px.colors.diverging.Temps)
fig.update_layout(
showlegend=False,
margin=dict(t=30,l=8,b=8,r=1))
fig.update_yaxes(type='category')
fig.update_layout(xaxis_tickprefix = '$')
fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Cost</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source Id</b>")
fig.show()
There were about 7 different sources. The source that we spent most on is source number 3 ( about $141000.00). We spent less than 6000.00 dollars each on sources 9 and 10.
Cost over time
over_time= costs.pivot_table(index = "dt", values="costs",aggfunc= "sum").reset_index()
over_time.head()
dt | costs | |
---|---|---|
0 | 2017-06-01 | 735.260010 |
1 | 2017-06-02 | 721.190002 |
2 | 2017-06-03 | 450.849976 |
3 | 2017-06-04 | 761.159973 |
4 | 2017-06-05 | 821.440002 |
364 rows are a lot so I will regroup them into specific months
costs['cost_month'] = costs['dt'].astype('datetime64[M]')
over_time= costs.pivot_table(index = "cost_month", values="costs",aggfunc= "sum").reset_index()
fig = px.bar(over_time.sort_values(ascending =False, by= "costs" ),x="cost_month",
y= "costs", title= "<b>Money Spent Over Time</b>", text='costs')
fig.update_layout(
showlegend=False,
margin=dict(t=30,l=8,b=8,r=1))
fig.update_layout(yaxis_tickprefix = '$')
fig.update_traces(texttemplate='%{text:.4s}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Cost</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Month</b>")
fig.show()
Over time, similar to orders
, we spent most during December, November and October 2017, and Least in June, July and August 2017.
How much did customer acquisition (CAC) from each of the sources cost?
CAC by Source
cohort_cac_size = orders_.groupby(['source_id'], as_index=False).agg({'uid':'nunique'}).\
rename(columns={'uid':'n_buyers', 'source_id': 'source_id'})
source_costs = costs.groupby('source_id').agg({'costs': 'sum'}).reset_index()
source_cac_table = cohort_cac_size.merge(source_costs, on='source_id')
source_cac_table['cac'] = source_cac_table['costs'] / source_cac_table['n_buyers']
fig = px.bar(source_cac_table.sort_values(ascending =True, by= "cac" ),y="source_id", x= "cac",orientation ="h",
title= "<b>Customer Aquisition Cost by Source </b>", text='cac')
fig.update_layout(
showlegend=False,
margin=dict(t=30,l=8,b=8,r=1))
fig.update_yaxes(type='category')
fig.update_layout(xaxis_tickprefix = '$', xaxis_tickformat = ',.2f')
fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Customer Aquisition Cost</b>")
fig.show()
The platform/source with the highest cost is 3; interestinly, source 3 is amongst the platforms with the least revenues. Platform 9 has the least cost and also amongst the least revenue generators. Platforms 1 and 2 are comparatively having lower costs than platform 3 (they have the highest revenue). This gives a hint that platform 1 and 2 would be the most profible and platform 3 the least.
CAC by month/cohort
Grouping cost by month
monthly_costs = costs.groupby('cost_month')["costs"].sum()
monthly_costs.head()
cost_month 2017-06-01 18015.000000 2017-07-01 18240.589844 2017-08-01 14790.540039 2017-09-01 24368.910156 2017-10-01 36322.878906 Name: costs, dtype: float32
Merging monthly cost with report on orders
report_ = pd.merge(monthly_orders, monthly_costs, left_on='first_month', right_on='cost_month')
report_.head()
first_month | order_month | revenue | n_buyers | gp | age | ltv | costs | |
---|---|---|---|---|---|---|---|---|
0 | 2017-06-01 | 2017-06-01 | 9557.490234 | 2023 | 4778.745117 | 0 | 2.362207 | 18015.0 |
1 | 2017-06-01 | 2017-07-01 | 981.820007 | 2023 | 490.910004 | 1 | 0.242664 | 18015.0 |
2 | 2017-06-01 | 2017-08-01 | 885.340027 | 2023 | 442.670013 | 2 | 0.218819 | 18015.0 |
3 | 2017-06-01 | 2017-09-01 | 1931.300049 | 2023 | 965.650024 | 3 | 0.477336 | 18015.0 |
4 | 2017-06-01 | 2017-10-01 | 2068.580078 | 2023 | 1034.290039 | 4 | 0.511265 | 18015.0 |
Cac cohort analysis
report_['cac'] = report_['costs'] / report_['n_buyers']
cac_pivot= report_.pivot_table(
index=report_['first_month'].dt.date,
columns='age',
values='cac',
aggfunc='mean')
plt.figure(figsize=(13, 9))
plt.title('Cohorts Analysis: CAC')
sns.heatmap(cac_pivot, annot=True, fmt='.2f', linewidths=1, linecolor='gray')
plt.show()
CAC per month/cohorts shows a uniform but different costs for each cohort.The August 2017 cohort had the highest cost in a given month while the May 2018 cohort had the least.
How worthwhile were the investments?
ROI(ROMI) = LTV / CAC
ROI(ROMI) by source
romi_df = source_cac_table.merge(source_ltv_table, on= "source_id")
romi_df["romi"] = romi_df['ltv']/ romi_df["cac"]
fig = px.bar(data_frame=romi_df.sort_values(ascending=True, by="romi"),x="romi", y= "source_id", title= "<b>Return on Marketing Investment by Source </b>",
orientation ="h" , text='romi')
fig.update_layout(
showlegend=False,
margin=dict(t=30,l=8,b=8,r=1))
fig.update_layout( xaxis_tickformat = ',.0%')
fig.update_yaxes(type='category')
fig.update_traces(texttemplate='%{text:.1%}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source Id</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Return on Marketing Investment</b>")
fig.show()
The return on investments on plaform 1 and 2 are the highest. Platform 3 has the lowest return on investment. The investments in all the sources are not yet worthwhile as the highest (source 1) is yet to recoup 100% of investment.
ROI(ROMI) by cohort
report_['romi'] = report_['ltv'] / report_['cac']
output = report_.pivot_table(
index=report_['first_month'].dt.date,
columns='age',
values='romi',
aggfunc='mean').cumsum(axis=1).round(2)
plt.figure(figsize=(13, 9))
plt.title('Cohorts Analysis: ROMI')
sns.heatmap(output, annot=True, fmt='.0%', linewidths=1, linecolor='gray')
plt.show()
The September 2017 cohort had the highest return on investments, followed by the June 2017 cohort. May 2018 cohort had the lowest return on investments. No cohort has recouped 100% of invesments.
How worthwhile where the investments? Return on Investments (ROI) in general
ROI = (income − expenses) / expenses
ROI = (orders.revenue.sum() - costs.costs.sum())/costs.costs.sum()
print("Return on Investments in General: {:.2%}".format(ROI))
Return on Investments in General: -23.42%
We are yet to recoup about 23.24% of our investment in Yandex.Afisha. As at now, the investment has not been worthwile but hopefully with improved marketing stragegy, we would recoup all our investments in no time.
Comparing results
This will help answer the question in the conclusion part , "What sources/platforms would you recommend? Back up your choice: what metrics did you focus on? Why? What conclusions did you draw after finding the metric values?"
Total profit per source
romi_df["profit"] = romi_df["revenue"]-romi_df["costs"]
romi_df["colors"] = np.where(romi_df["profit"]<0, 'lightred', 'lightblue')
fig = px.bar(data_frame=romi_df,x="profit", y= "source_id", orientation ="h", title= "<b>Profit by Source</b>", color='colors', text='profit',
)
fig.update_layout(
showlegend=False,
margin=dict(t=30,l=8,b=8,r=1))
fig.update_yaxes(type='category')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source Id</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Profit ($)</b>")
fig.show()
As already hinted, the most profitable platforms are sources 1 and 2. Platform 3 is amongst the least profitable (making loses of about -$97003.95).
Interim Conclusion
Overall, about 329,131.60 dollars was spent (total revenue was about 252,057.19 dollars).
source 3 is amongst the platforms with the least revenues, however, it has the highest cost. Platform 9 has the least cost and also amongst the least revenue generators. Platforms 1 and 2 have the highest revenue, comparatively they also have lower costs than platform 3.
The September 2017 cohort had the highest return on investments, followed by the June 2017 cohort. May 2018 cohort had the lowest return on investments. No cohort has recouped 100% of invesments. The return on investment on plaform 1 and 2 are the highest. Platform 3 has the lowest return on investment.The most profitable platforms are sources 1 and 2. Platform 3 is even making the highest losses. The investments in all the sources are not yet worthwhile as the highest (source 1) is yet to recoup 100% of investment.
Conclusion
Recommendations