Business Metrics of Yandex Afisha¶

Introduction¶

As a Junior Data Analyst in the analytical department at Yandex. I have been given the following data:

Server logs with data on Yandex.Afisha visits from June 2017 through May 2018
Dump file with all orders for the period
Marketing expenses statistics

My first task is to help optimize marketing expenses.

I would import libraries, and write a funtion that would import the data together.
Would convert colums to the right data types, optimise the data size, remove duplicates, replace heading spaces to a meaningful format, and add additional columns.
Would view the general information of the data to make sure each column contains the correct data type, would as well view the summary description of the numerical columns, and the first two rows of the data.

In [1]:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from simple_colors import *
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


def data_preprocessing (visits_data_path, costs_data_path, orders_data_path):
    visits = pd.read_csv(visits_data_path, dtype={'Device': 'category', 'source Id':'int32'},
                       parse_dates=['Start Ts', 'End Ts'])
    costs= pd.read_csv(costs_data_path, dtype={'source Id':'int32', 'costs':'float32'},
                       parse_dates= ["dt"])
    orders= pd.read_csv(orders_data_path, dtype={ 'Revenue':'float32'},
                       parse_dates= ["Buy Ts"])
    visits.columns= visits.columns.str.replace(' ','_').str.lower()
    orders.columns= orders.columns.str.replace(' ','_').str.lower()
    visits.drop_duplicates(inplace=True)
    costs.drop_duplicates(inplace=True)
    orders.drop_duplicates(inplace=True)
    visits['session_date'] = visits['start_ts'].astype("datetime64[D]")
    visits['session_week']  = visits['start_ts'].astype("datetime64[W]")
    visits['session_month'] = visits['start_ts'].astype("datetime64[M]")
    visits['session_year'] = visits['start_ts'].astype("datetime64[Y]")
    return visits, costs, orders
visits, costs, orders = data_preprocessing("visits_log_us.csv","costs_us.csv","orders_log_us.csv")

# Displaying the data
print()
print(green("Visits data:", 'bold'))
print()

display(visits.info())
display(visits.head(2))
print()

print(green("Costs data:", 'bold'))

print()
display(costs.info())
display(costs.head(2))
print()

print(green("Orders data:", 'bold'))
print()
display(orders.info())
display(orders.head(2))
print()

print(green("Summary Discription:", 'bold'))
print()
print(green("Orders:", 'bold'))
display(orders.describe(exclude= ['uint64',  'datetime64[ns]']).round().T)
print(green("Costs:", 'bold'))
display(costs.describe().round().T)

Visits data:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 359400 entries, 0 to 359399
Data columns (total 9 columns):
 #   Column         Non-Null Count   Dtype         
---  ------         --------------   -----         
 0   device         359400 non-null  category      
 1   end_ts         359400 non-null  datetime64[ns]
 2   source_id      359400 non-null  int64         
 3   start_ts       359400 non-null  datetime64[ns]
 4   uid            359400 non-null  uint64        
 5   session_date   359400 non-null  datetime64[ns]
 6   session_week   359400 non-null  datetime64[ns]
 7   session_month  359400 non-null  datetime64[ns]
 8   session_year   359400 non-null  datetime64[ns]
dtypes: category(1), datetime64[ns](6), int64(1), uint64(1)
memory usage: 25.0 MB

None

	device	end_ts	source_id	start_ts	uid	session_date	session_week	session_month	session_year
0	touch	2017-12-20 17:38:00	4	2017-12-20 17:20:00	16879256277535980062	2017-12-20	2017-12-14	2017-12-01	2017-01-01
1	desktop	2018-02-19 17:21:00	2	2018-02-19 16:53:00	104060357244891740	2018-02-19	2018-02-15	2018-02-01	2018-01-01

Costs data:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2542 entries, 0 to 2541
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   source_id  2542 non-null   int64         
 1   dt         2542 non-null   datetime64[ns]
 2   costs      2542 non-null   float32       
dtypes: datetime64[ns](1), float32(1), int64(1)
memory usage: 69.5 KB

None

	source_id	dt	costs
0	1	2017-06-01	75.199997
1	1	2017-06-02	62.250000

Orders data:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50415 entries, 0 to 50414
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   buy_ts   50415 non-null  datetime64[ns]
 1   revenue  50415 non-null  float32       
 2   uid      50415 non-null  uint64        
dtypes: datetime64[ns](1), float32(1), uint64(1)
memory usage: 1.3 MB

None

	buy_ts	revenue	uid
0	2017-06-01 00:10:00	17.00	10329302124590727494
1	2017-06-01 00:25:00	0.55	11627257723692907447

Summary Discription:

Orders:

	count	mean	std	min	25%	50%	75%	max
revenue	50415.0	5.0	22.0	0.0	1.0	2.0	5.0	2633.0

Costs:

	count	mean	std	min	25%	50%	75%	max
source_id	2542.0	5.0	3.0	1.0	2.0	4.0	9.0	10.0
costs	2542.0	129.0	156.0	1.0	22.0	77.0	170.0	1788.0

On average, revenue generated from customers was 5.00 and a maximum of 2633. Cost has a mean of about 129.00, and a maximum of about 1788. The data have been optimized, I have ensured each column contains the correct data type, and duplicates have been checked; the data is ready for analysis.

Step 2. Reports and Metrics¶

Products¶

Number of visits in a day

In [2]:

dau_total = visits.groupby('session_date').agg({'uid': 'nunique'}).mean()
print(int(dau_total))

On average, about 907 people use Yandex.Afisha every day

Visualizing daily visits

In [3]:

fig = px.line(visits.groupby(visits['start_ts'].dt.date)['uid'].nunique().reset_index(),
              x="start_ts", y="uid", title="<b>Daily Visits to Yandex Afisha</b>")

fig.update_layout(
    showlegend=True,
    margin=dict(t=30,l=8,b=8,r=10))

fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Date</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Visits</b>")
fig.show()

print("Date with the highest number of visit is", visits.groupby(visits['start_ts'].dt.date)['uid']\
      .nunique().reset_index().sort_values(by= "uid", ascending=False).iloc[0,0])

print("Date with the lowest number of visit is", visits.groupby(visits['start_ts'].dt.date)['uid']\
      .nunique().reset_index().sort_values(by= "uid", ascending=False).iloc[-1,0])

Date with the highest number of visit is 2017-11-24
Date with the lowest number of visit is 2018-03-31

Amazingly, 24 November, 2017, was a black friday. It had the highest number of visits to Yandex Afesha. March 31, 2018 was a popular holiday plus observances Worldwide. For example, it was a Easter Saturday across the globe, a Freedom Day in Malta, and a National Sunflower Day in Argentina. This means, holidays affected visits to Yandex Afisha but black friday stimulated visits.

Number of visits in a week

In [4]:

wau_total = visits.groupby(['session_year', 'session_week']).agg({'uid': 'nunique'}).mean()
print(int(wau_total))

On average, about 5621 people use Yandex.Afisha every week

Number of visits in a month

In [5]:

mau_total = visits.groupby(['session_year', 'session_month']).agg({'uid': 'nunique'}).mean()
print(int(mau_total))

On average, about 23228 people use Yandex.Afisha every month.

The number of sessions per day

In [6]:

sessions_per_user = visits.groupby('session_date').agg({'uid': ['count','nunique']})
sessions_per_user.columns = ['n_sessions', 'n_users']
sessions_per_user['sessions_per_user'] = sessions_per_user['n_sessions'] / sessions_per_user['n_users']
print()
print("Average session Per Day:",sessions_per_user['sessions_per_user'].mean())

Average session Per Day: 1.0821696440039723

On average, there is about 1 session per day.

Comparing sessions per day and daily visits with a graph

In [7]:

fig = go.Figure()

fig.add_trace(go.Scatter(x=visits.groupby(visits['start_ts'].dt.date)['uid'].nunique().reset_index()['start_ts'], 
                        y=visits.groupby(visits['start_ts'].dt.date)['uid'].nunique().reset_index()['uid'],
                    mode='lines',
                    name='DAU',line = dict(color='blue', width=2 )))

fig.add_trace(go.Scatter(x=visits.groupby(visits['start_ts'].dt.date)['uid'].count().reset_index()['start_ts'], 
                        y=visits.groupby(visits['start_ts'].dt.date)['uid'].count().reset_index()['uid'],
                    mode='lines',
                    name='sessions',line = dict(color='black', width=2 )))


fig.update_layout(
    showlegend=True,
    plot_bgcolor="gray",
    margin=dict(t=30,l=8,b=8,r=10),title= "<b>Sessions Per Day Vs Visits Per Day</b>")

fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Date</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Sessions/Visits</b>")
fig.show()

Unsuprisingly, sessions per day mimics visits per day.

Length of each session

In [8]:

visits['session_length_sec'] = (visits['end_ts']- visits['start_ts']).dt.seconds
print(visits['session_length_sec'].mean()) 

643.506488592098

Visualising session length

In [9]:

fig = px.histogram(data_frame=visits,x= "session_length_sec", nbins=25, title= "<b>Session Length</b>")

fig.update_layout(
    showlegend=False,
    margin=dict(t=30,l=8,b=8,r=1))

fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Session Length(seconds)</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Frequency</b>")

fig.show()

Session lenght is highly skewed therefore, average session length would not be mean. I would use mode.

In [10]:

print(int(visits['session_length_sec'].mode())) 

On average, each session lasts about 60 seconds

Visualising sessions by device type

In [11]:

visits['session_length']=((pd.to_datetime(visits['end_ts'])-visits['start_ts'])/ np.timedelta64(1, 'm')).round()
fig = px.line(visits.groupby([visits['start_ts'].dt.date,'device'])['session_length'].mean().reset_index(),
              x="start_ts", y="session_length",color='device', title = "<b>Sessions by Device Type</b>")
fig.update_layout(
    showlegend=True,plot_bgcolor="gray",
    margin=dict(t=30,l=8,b=8,r=10))
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Date</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Session Length</b>")
fig.show()

Destop device consistently had more sessions than touch devices.

How often do users come back? I will do rentention rate cohort analysis, and use sticky factor and check for per week and month.

Retention rate cohort analysis.

Here the focus is on the number of users users from a cohort that have remained active compared to their initial number.

In [12]:

first_activity_date = visits.groupby(['uid'])['start_ts'].min()
first_activity_date.name = 'first_activity_date'
visits = visits.join(first_activity_date, on='uid')

visits['activity_month'] = visits['start_ts'].astype("datetime64[M]")
visits['first_month'] = visits['first_activity_date'].astype("datetime64[M]")


visits['cohort_lifetime'] = ( visits['activity_month'] - visits['first_month'] ) 
visits['cohort_lifetime'] = (visits[ 'cohort_lifetime' ] / np.timedelta64(1, 'M')).round().astype(int)


cohorts = (visits.groupby(['first_month', 'cohort_lifetime']).agg({'uid': 'nunique'})\
           .reset_index())

initial_visits_count = cohorts[cohorts['cohort_lifetime'] == 0][['first_month', 'uid']]
initial_visits_count = initial_visits_count.rename(columns={'uid': 'cohort_users'}) 
cohorts = cohorts.merge(initial_visits_count, on='first_month')

cohorts['retention'] = cohorts['uid'] / cohorts['cohort_users'] 

retention_pivot = cohorts.pivot_table(index=cohorts['first_month'].dt.date,columns='cohort_lifetime',values='retention',\
                                      aggfunc='sum')

Visualizing the cohort user retention

In [13]:

sns.set(style='white')
plt.figure(figsize=(13, 9))
plt.title('Cohorts: User Retention')
sns.heatmap(retention_pivot, annot=True, fmt='.1%', linewidths=1, linecolor='gray')
plt.show()

The June 2017 cohort had the highest retention rate as at month 11 (and the only month that had a rentention rate as at month 11). By the first month (month 1), all cohorts had retention rates less than 10%. None of the May 2018 cohort came back after their first visit. About 4.8% of the April 2018 cohort only came back in month 1.

Adding sticky factor to the analysis

In [14]:

sticky_wau = dau_total / wau_total * 100
sticky_mau = dau_total / mau_total * 100
print("Sticky factor weekly:", round(int(sticky_wau)))
print("Sticky factor Monthly:", round(int(sticky_mau)))

Sticky factor weekly: 16
Sticky factor Monthly: 3

About 16% of users come back in a week, and about 3% of users come back in a month

Interim Conclusion

On average, about 907 people use Yandex.Afisha every day, the highest daily visits occur on a black friday and the lowest daily visits occur on a holiday, about 5621 people use Yandex.Afisha every week, about 23228 people use Yandex.Afisha every month. On average, there is about 1 session per day, and each session lasts about 60 seconds. By the first month (month 1), all cohorts had retention rates less than 10%. None of the May 2018 cohort came back after their first visit. Only June 2017 cohort were retained till month 11.

Sales¶

When do people start buying?

Since data is given for almost a year. It will be best to make monthly cohorts. I will deduct the month of first order from the month of first visits.

First order month

In [15]:

orders['order_month'] = orders['buy_ts'].astype('datetime64[M]') 
first_orders = orders.groupby('uid').agg({'order_month': 'min'}).reset_index()
first_orders.columns = ['uid', 'first_month']
first_orders.head()

Out[15]:

	uid	first_month
0	313578113262317	2018-01-01
1	1575281904278712	2017-06-01
2	2429014661409475	2017-10-01
3	2464366381792757	2018-01-01
4	2551852515556206	2017-11-01

Merging first orders and orders

In [16]:

orders_ = pd.merge(orders,first_orders, on='uid')
print("Total rows in the data:", orders_.shape[0])
orders_.head()         

Total rows in the data: 50415

Out[16]:

	buy_ts	revenue	uid	order_month	first_month
0	2017-06-01 00:10:00	17.00	10329302124590727494	2017-06-01	2017-06-01
1	2017-06-01 00:25:00	0.55	11627257723692907447	2017-06-01	2017-06-01
2	2017-06-01 00:27:00	0.37	17903680561304213844	2017-06-01	2017-06-01
3	2017-06-01 00:29:00	0.55	16109239769442553005	2017-06-01	2017-06-01
4	2017-06-01 07:58:00	0.37	14200605875248379450	2017-06-01	2017-06-01

Calculating duration in minutes and visualising

In [17]:

orders_['span'] = (orders_['order_month'] - orders_['first_month']) / np.timedelta64(1, 'm')
orders_['span'] = orders_['span'].round().astype('int')

fig = px.histogram(data_frame=orders_,x= "span", nbins=25, title= "<b>Initial Buying Timespan</b>")

fig.update_layout(
    showlegend=False,
    margin=dict(t=30,l=8,b=8,r=1))

fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Minutes</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Frequency</b>")

fig.show()
print("Minutes:", round(int(orders_['span'].mode())))

Minutes: 0

On average, people start buying within 0 minutes (immediately) on the Yandex.afisha.

How many orders do they make during a given period of time?

In [18]:

cohort_sizes = first_orders.groupby('first_month').agg({'uid': 'nunique'}).reset_index()
cohort_sizes.columns = ['first_month', 'n_buyers']

cohort_orders=cohort_sizes.sort_values(ascending =False, by= "n_buyers",ignore_index=True )

fig = px.bar(cohort_orders, x="first_month", y="n_buyers", title= "<b>Number of Orders Per Month</b>", text='n_buyers')

fig.update_layout(
    showlegend=False,
    margin=dict(t=30,l=8,b=8,r=1))
fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Number of orders</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Order month</b>")

fig.show()

On December 2017, users made about 4400 orders. This is followed by October 2017, and November 2017. Implying users order more during or a period getting to christmas. The months June, July, and August have the least number of orders. In the summer, users go on holidays and do not enjoy the app a lot.

What is the average purchase size?

In [19]:

print("Total revenue:", orders_.revenue.sum())
print("Average purchase size:",orders_.revenue.mean())

Total revenue: 252057.19
Average purchase size: 4.999446868896484

On average, the purchase size is 5.00 dollars.

How much money do they bring? (LTV) To answer this question, I will creat cohorts and calculate how much each cohort brings on average.

LTV cohort analysis with heatmap

In [20]:

cohort_size = orders_.groupby('first_month').agg({'uid': 'nunique'}).reset_index().rename(columns={'uid': 'n_buyers'})

cohorts= orders_.groupby(['first_month', 'order_month']).agg({'revenue': 'sum'}).reset_index()
monthly_orders = pd.merge(cohorts,cohort_size, on='first_month')

margin_rate = .5

monthly_orders['gp'] = monthly_orders['revenue'] * margin_rate
monthly_orders['age'] = (monthly_orders['order_month'] - monthly_orders['first_month']) / np.timedelta64(1, 'M')
monthly_orders['age'] = monthly_orders['age'].round().astype('int')

monthly_orders['ltv'] = monthly_orders['gp'] / monthly_orders['n_buyers']

ltv_pivot= monthly_orders.pivot_table(
    index=monthly_orders['first_month'].dt.date, 
    columns='age', 
    values='ltv', 
    aggfunc='mean').cumsum(axis=1)

plt.figure(figsize=(13, 9))
plt.title('Cohorts Analysis: LTV')
sns.heatmap(ltv_pivot, annot=True, fmt='.2f', linewidths=1, linecolor='gray')
plt.show()

The June 2017 cohort had the longest duration of LTV (upto age 11), thus, contributed the longest time. However, the September 2017 cohort had the highest LTV (cummulative sum of 6.72 over seven months).Th e cohorts that had the least LTVs are May and June 2018.

Calculating and visualising LTV by source

Retrieve source_id from visits and merge with the orders_ table. This will help analyse Ltv by source

In [21]:

get_source = visits[["uid", "source_id"]]
get_source.shape[0]

Out[21]:

Droping dulicates

In [22]:

get_source =get_source.drop_duplicates()
get_source.shape[0]

Out[22]:

In [23]:

get_source= get_source.drop_duplicates(subset=["uid"], keep="first")
get_source.shape[0]

Out[23]:

Merging orders_ and get_source to have souce id in one table

In [24]:

orders_ = orders_.merge(get_source, on= "uid")
orders_=orders_.drop_duplicates()
orders_.shape[0]

Out[24]:

Ltv analysis by source and visualising

In [25]:

cohort_ltv_size = orders_.groupby(['source_id'], as_index=False).agg({'uid':'nunique'}).\
                 rename(columns={'uid':'n_buyers', 'source_id': 'source_id'})
source_revenue = orders_.groupby('source_id').agg({'revenue': 'sum'}).reset_index()
source_ltv_table = cohort_ltv_size.merge(source_revenue, on='source_id')
margin_rate = .5

source_ltv_table['gp'] = source_ltv_table['revenue'] * margin_rate

source_ltv_table['ltv'] = source_ltv_table['gp'] / source_ltv_table['n_buyers']

fig = px.bar(source_ltv_table.sort_values(ascending =True, by= "ltv" ),x="ltv", y= "source_id", title= "<b>Lifetime Value by Source</b>", 
             text='ltv',orientation ="h", color_continuous_scale=px.colors.diverging.Temps)

fig.update_layout(
    showlegend=False,
    margin=dict(t=30,l=8,b=8,r=1))
fig.update_yaxes(type='category')
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>LTV</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source Id</b>")

fig.show()

Platforms/source 1 and 2 brings in the highest revenue, platform 3, 7, and 10 bring in the lowest revenue (platform 7 is the least).

Interim Conclusion

The December 2017 users made about 4400 orders (the highest), this is followed by October 2017, and November 2017. Users in the months June, July, and August have the least number of orders. On average, people start buying within 0 minutes (immediately), and the average purchase size is about 5.00 dollars. The June 2017 cohort contributed the longest time but the September 2017 cohort had the highest LTV. May and June 2018 cohorts had the least LTVs. Users of platforms/source 1 and 2 bring in the highest revenue, and users of platform 7 bring in the least.

Marketing¶

How much money was spent? Overall/per source/over time

In [26]:

costs.costs.sum()

Out[26]:

329131.6

Overall, about 329131.6 was spent.

Per Source

In [27]:

n_users =costs.agg({"source_id": "nunique"})

per_source= costs.pivot_table(index = "source_id", values="costs",aggfunc= "sum").reset_index()
per_source.columns = ["source_id","costs"]

fig = px.bar(per_source.sort_values(ascending =True, by= "costs" ),x="costs", y= "source_id", title= "<b>Money Spent Per Source Id</b>",orientation ="h", text='costs',
            color_continuous_scale=px.colors.diverging.Temps)

fig.update_layout(
    showlegend=False,
    margin=dict(t=30,l=8,b=8,r=1))
fig.update_yaxes(type='category')
fig.update_layout(xaxis_tickprefix = '$')
fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Cost</b>")
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source Id</b>")

fig.show()

There were about 7 different sources. The source that we spent most on is source number 3 ( about $141000.00). We spent less than 6000.00 dollars each on sources 9 and 10.

Cost over time

In [28]:

over_time= costs.pivot_table(index = "dt", values="costs",aggfunc= "sum").reset_index()
over_time.head()

Out[28]:

	dt	costs
0	2017-06-01	735.260010
1	2017-06-02	721.190002
2	2017-06-03	450.849976
3	2017-06-04	761.159973
4	2017-06-05	821.440002

364 rows are a lot so I will regroup them into specific months

In [29]:

costs['cost_month'] = costs['dt'].astype('datetime64[M]')

over_time= costs.pivot_table(index = "cost_month", values="costs",aggfunc= "sum").reset_index()


fig = px.bar(over_time.sort_values(ascending =False, by= "costs" ),x="cost_month", 
             y= "costs", title= "<b>Money Spent Over Time</b>", text='costs')

fig.update_layout(
    showlegend=False,
    margin=dict(t=30,l=8,b=8,r=1))
fig.update_layout(yaxis_tickprefix = '$')
fig.update_traces(texttemplate='%{text:.4s}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Cost</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Month</b>")

fig.show()

Over time, similar to orders, we spent most during December, November and October 2017, and Least in June, July and August 2017.

How much did customer acquisition (CAC) from each of the sources cost?

CAC by Source

In [30]:

cohort_cac_size = orders_.groupby(['source_id'], as_index=False).agg({'uid':'nunique'}).\
                 rename(columns={'uid':'n_buyers', 'source_id': 'source_id'})
source_costs = costs.groupby('source_id').agg({'costs': 'sum'}).reset_index()
source_cac_table = cohort_cac_size.merge(source_costs, on='source_id')
source_cac_table['cac'] = source_cac_table['costs'] / source_cac_table['n_buyers']


fig = px.bar(source_cac_table.sort_values(ascending =True, by= "cac" ),y="source_id", x= "cac",orientation ="h",
             title= "<b>Customer Aquisition Cost by Source </b>", text='cac')

fig.update_layout(
    showlegend=False,
    margin=dict(t=30,l=8,b=8,r=1))
fig.update_yaxes(type='category')
fig.update_layout(xaxis_tickprefix = '$', xaxis_tickformat = ',.2f')
fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Customer Aquisition Cost</b>")

fig.show()

The platform/source with the highest cost is 3; interestinly, source 3 is amongst the platforms with the least revenues. Platform 9 has the least cost and also amongst the least revenue generators. Platforms 1 and 2 are comparatively having lower costs than platform 3 (they have the highest revenue). This gives a hint that platform 1 and 2 would be the most profible and platform 3 the least.

CAC by month/cohort

Grouping cost by month

In [31]:

monthly_costs = costs.groupby('cost_month')["costs"].sum()
monthly_costs.head()

Out[31]:

cost_month
2017-06-01    18015.000000
2017-07-01    18240.589844
2017-08-01    14790.540039
2017-09-01    24368.910156
2017-10-01    36322.878906
Name: costs, dtype: float32

Merging monthly cost with report on orders

In [32]:

report_ = pd.merge(monthly_orders, monthly_costs, left_on='first_month', right_on='cost_month')
report_.head()

Out[32]:

	first_month	order_month	revenue	n_buyers	gp	age	ltv	costs
0	2017-06-01	2017-06-01	9557.490234	2023	4778.745117	0	2.362207	18015.0
1	2017-06-01	2017-07-01	981.820007	2023	490.910004	1	0.242664	18015.0
2	2017-06-01	2017-08-01	885.340027	2023	442.670013	2	0.218819	18015.0
3	2017-06-01	2017-09-01	1931.300049	2023	965.650024	3	0.477336	18015.0
4	2017-06-01	2017-10-01	2068.580078	2023	1034.290039	4	0.511265	18015.0

Cac cohort analysis

In [33]:

report_['cac'] = report_['costs'] / report_['n_buyers']
cac_pivot= report_.pivot_table(
    index=report_['first_month'].dt.date, 
    columns='age', 
    values='cac', 
    aggfunc='mean')

plt.figure(figsize=(13, 9))
plt.title('Cohorts Analysis: CAC')
sns.heatmap(cac_pivot, annot=True, fmt='.2f', linewidths=1, linecolor='gray')
plt.show()

CAC per month/cohorts shows a uniform but different costs for each cohort.The August 2017 cohort had the highest cost in a given month while the May 2018 cohort had the least.

How worthwhile were the investments?

ROI(ROMI) = LTV / CAC

ROI(ROMI) by source

In [34]:

romi_df = source_cac_table.merge(source_ltv_table, on= "source_id")
romi_df["romi"] = romi_df['ltv']/ romi_df["cac"]


fig = px.bar(data_frame=romi_df.sort_values(ascending=True, by="romi"),x="romi", y= "source_id", title= "<b>Return on Marketing Investment by Source </b>", 
           orientation ="h" , text='romi')

fig.update_layout(
    showlegend=False,
    margin=dict(t=30,l=8,b=8,r=1))
fig.update_layout( xaxis_tickformat = ',.0%')
fig.update_yaxes(type='category')
fig.update_traces(texttemplate='%{text:.1%}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source Id</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Return on Marketing Investment</b>")

fig.show()

The return on investments on plaform 1 and 2 are the highest. Platform 3 has the lowest return on investment. The investments in all the sources are not yet worthwhile as the highest (source 1) is yet to recoup 100% of investment.

ROI(ROMI) by cohort

In [35]:

report_['romi'] = report_['ltv'] / report_['cac']
output = report_.pivot_table(
    index=report_['first_month'].dt.date, 
    columns='age', 
    values='romi', 
    aggfunc='mean').cumsum(axis=1).round(2) 

plt.figure(figsize=(13, 9))
plt.title('Cohorts Analysis: ROMI')
sns.heatmap(output, annot=True, fmt='.0%', linewidths=1, linecolor='gray')
plt.show()

The September 2017 cohort had the highest return on investments, followed by the June 2017 cohort. May 2018 cohort had the lowest return on investments. No cohort has recouped 100% of invesments.

How worthwhile where the investments? Return on Investments (ROI) in general

ROI = (income − expenses) / expenses

In [36]:

ROI = (orders.revenue.sum() - costs.costs.sum())/costs.costs.sum()
print("Return on Investments in General: {:.2%}".format(ROI))

Return on Investments in General: -23.42%

We are yet to recoup about 23.24% of our investment in Yandex.Afisha. As at now, the investment has not been worthwile but hopefully with improved marketing stragegy, we would recoup all our investments in no time.

Comparing results

This will help answer the question in the conclusion part , "What sources/platforms would you recommend? Back up your choice: what metrics did you focus on? Why? What conclusions did you draw after finding the metric values?"

Total profit per source

In [37]:

romi_df["profit"] = romi_df["revenue"]-romi_df["costs"]
romi_df["colors"] = np.where(romi_df["profit"]<0, 'lightred', 'lightblue')

fig = px.bar(data_frame=romi_df,x="profit", y= "source_id", orientation ="h", title= "<b>Profit by Source</b>", color='colors', text='profit',
              )




fig.update_layout(
    showlegend=False,
    margin=dict(t=30,l=8,b=8,r=1))
fig.update_yaxes(type='category')
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_yaxes(tickfont_family="Arial Black", title_text="<b>Source Id</b>")
fig.update_xaxes(tickfont_family="Arial Black", title_text="<b>Profit ($)</b>")

fig.show()

As already hinted, the most profitable platforms are sources 1 and 2. Platform 3 is amongst the least profitable (making loses of about -$97003.95).

Interim Conclusion

Overall, about 329,131.60 dollars was spent (total revenue was about 252,057.19 dollars). source 3 is amongst the platforms with the least revenues, however, it has the highest cost. Platform 9 has the least cost and also amongst the least revenue generators. Platforms 1 and 2 have the highest revenue, comparatively they also have lower costs than platform 3.
The September 2017 cohort had the highest return on investments, followed by the June 2017 cohort. May 2018 cohort had the lowest return on investments. No cohort has recouped 100% of invesments. The return on investment on plaform 1 and 2 are the highest. Platform 3 has the lowest return on investment.The most profitable platforms are sources 1 and 2. Platform 3 is even making the highest losses. The investments in all the sources are not yet worthwhile as the highest (source 1) is yet to recoup 100% of investment.

Step 3. Conclusion and Recommendations¶

Conclusion

On average, about 907 people use Yandex Afisha every day, about 5621 people use Yandex Afisha every week, and about 23228 people use Yandex Afisha every month.
The highest daily visits occur on a black friday and the lowest daily visits occur on a holiday
On average, there is about 1 session per day, and each session lasts about 60 seconds.
By the first month (month 1), all cohorts had retention rates less than 10%. None of the May 2018 cohort came back after their first visit. Only June 2017 cohort were retained till month 11.
The December 2017 users made about 4400 orders (the highest), this is followed by October 2017, and November 2017. Users in the months June, July, and August have the least number of orders.
On average, people start buying within 0 minutes (immediately), and the average purchase size is about 5.00 dollars.
The June 2017 cohort contributed the longest time but the September 2017 cohort had the highest LTV. May and June 2018 cohorts had the least LTVs.
CAC per month/cohorts shows a uniform but different costs for each cohort.The August 2017 cohort had the highest cost in a given month while the May 2018 cohort had the least.
Users of platforms/source 1 and 2 bring in the highest revenue, and users of platform 7 bring in the least.
Plaform 3 has the highest cost but it is amongst the least revenue generators.
Platforms 1 and 2 brings in the highest revenue, and amongst the least in cost. They are the most profitable platforms.
The investments in all the sources are not yet worthwhile as the highest (source 1) is yet to recoup 100% of investment. Also, investments by the cohorts are not yet worthwhile.

Recommendations

I would recommend the marketing experts to invest more in sources 1 and 2 and cut costs in plaform 3. Without plaform 3, revenues would exceed cost by about 64K dollars ~ profitability
Introduce strategies to boost rentention rate; improve user experience with the app.

In [ ]:

Business Metrics of Yandex Afisha¶

Introduction¶

Table of Contents

Step 1. Preparing the data for analysis¶

Step 2. Reports and Metrics¶

Products¶

Sales¶

Marketing¶

Step 3. Conclusion and Recommendations¶