I work at a startup that sells food products. I need to investigate user behavior for the company's app. First, I need to study the sales funnel and find out how users reach the purchase stage.
Then, I'll look at the results of an A/A/B test. The designers would like to change the fonts for the entire app, but the managers are afraid the users might find the new design intimidating. They decide to make a decision based on the results of the test.
The users are split into three groups: two control groups get the old fonts and one test group gets the new ones. I need to find out which set of fonts produces better results.
Creating two A groups has certain advantages. We can make it a principle that we will only be confident in the accuracy of our testing when the two control groups are similar. If there are significant differences between the A groups, this can help us uncover factors that may be distorting the results. Comparing control groups also tells us how much time and data we'll need when running further tests.
/datasets/logs_exp_us.csv
Each log entry is a user action or an event:
# import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
import numpy as np
from plotly import graph_objects as go
# try-except blocks handle errors that occur from changing file directories
try:
logs = pd.read_csv('logs_exp_us.csv', sep='\t')
except:
logs = pd.read_csv('/datasets/logs_exp_us.csv', sep='\t')
# study general info
display(logs.info())
display(logs.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244126 entries, 0 to 244125 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 EventName 244126 non-null object 1 DeviceIDHash 244126 non-null int64 2 EventTimestamp 244126 non-null int64 3 ExpId 244126 non-null int64 dtypes: int64(3), object(1) memory usage: 7.5+ MB
None
EventName | DeviceIDHash | EventTimestamp | ExpId | |
---|---|---|---|---|
0 | MainScreenAppear | 4575588528974610257 | 1564029816 | 246 |
1 | MainScreenAppear | 7416695313311560658 | 1564053102 | 246 |
2 | PaymentScreenSuccessful | 3518123091307005509 | 1564054127 | 248 |
3 | CartScreenAppear | 3518123091307005509 | 1564054127 | 248 |
4 | PaymentScreenSuccessful | 6217807653094995999 | 1564055322 | 248 |
Immediately we can see there are a few issues we need to correct:
# change 'DeviceIDHash' & 'ExpId' to object
logs['DeviceIDHash'] = logs['DeviceIDHash'].astype(str)
logs['ExpId'] = logs['ExpId'].astype(str)
# change 'EventTimestamp' to datetime
logs['EventTimestamp'] = logs['EventTimestamp'].astype('datetime64[s]')
# change column names
logs.columns = ['event', 'uid', 'datetime', 'group']
# add columns & convert to datetime
logs['time'] = pd.to_datetime(logs['datetime'], infer_datetime_format=True).dt.time
logs['date'] = pd.to_datetime(logs['datetime'], infer_datetime_format=True).dt.date
# verify results
display(logs.info())
display(logs.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 244126 entries, 0 to 244125 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 event 244126 non-null object 1 uid 244126 non-null object 2 datetime 244126 non-null datetime64[ns] 3 group 244126 non-null object 4 time 244126 non-null object 5 date 244126 non-null object dtypes: datetime64[ns](1), object(5) memory usage: 11.2+ MB
None
event | uid | datetime | group | time | date | |
---|---|---|---|---|---|---|
0 | MainScreenAppear | 4575588528974610257 | 2019-07-25 04:43:36 | 246 | 04:43:36 | 2019-07-25 |
1 | MainScreenAppear | 7416695313311560658 | 2019-07-25 11:11:42 | 246 | 11:11:42 | 2019-07-25 |
2 | PaymentScreenSuccessful | 3518123091307005509 | 2019-07-25 11:28:47 | 248 | 11:28:47 | 2019-07-25 |
3 | CartScreenAppear | 3518123091307005509 | 2019-07-25 11:28:47 | 248 | 11:28:47 | 2019-07-25 |
4 | PaymentScreenSuccessful | 6217807653094995999 | 2019-07-25 11:48:42 | 248 | 11:48:42 | 2019-07-25 |
# count number of events
total_events = logs['event'].count()
print(f'There were {total_events} total events')
There were 244126 total events
# count number of users
total_users = logs['uid'].nunique()
print(f'There were {total_users} unique users')
There were 7551 unique users
# average events per user
avg_events_per_user = total_events / total_users
print(f'Each user triggered an average of {int(avg_events_per_user)} events')
Each user triggered an average of 32 events
min_datetime = logs['datetime'].min()
max_datetime = logs['datetime'].max()
print(f'The earliest event was at {min_datetime}')
print(f'The latest event was at {max_datetime}')
The earliest event was at 2019-07-25 04:43:36 The latest event was at 2019-08-07 21:15:17
# plot distribution of all events & dates
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
plt.figure(figsize=(8, 5))
plt.hist(logs['datetime'], bins=200)
plt.xlabel('Date')
plt.ylabel('Count')
plt.grid(True)
plt.xticks(logs['date'].unique(), rotation=45)
plt.show()
There is a very low number of event prior to August 1st. Maybe this was a stage where the app hadn't been fully launched yet or in was still in beta testing. We should ignore data prior to this time period by creating a new filtered DataFrame
# new filtered DataFrame
filtered_logs = logs[(logs['date'] >= pd.to_datetime('2019-08-01', infer_datetime_format=True))]
filtered_logs.head()
event | uid | datetime | group | time | date | |
---|---|---|---|---|---|---|
2828 | Tutorial | 3737462046622621720 | 2019-08-01 00:07:28 | 246 | 00:07:28 | 2019-08-01 |
2829 | MainScreenAppear | 3737462046622621720 | 2019-08-01 00:08:00 | 246 | 00:08:00 | 2019-08-01 |
2830 | MainScreenAppear | 3737462046622621720 | 2019-08-01 00:08:55 | 246 | 00:08:55 | 2019-08-01 |
2831 | OffersScreenAppear | 3737462046622621720 | 2019-08-01 00:08:58 | 246 | 00:08:58 | 2019-08-01 |
2832 | MainScreenAppear | 1433840883824088890 | 2019-08-01 00:08:59 | 247 | 00:08:59 | 2019-08-01 |
# plot distribution of filtered events & dates
plt.hist(filtered_logs['datetime'], bins=200)
plt.xlabel('Date')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
# count number of events after filter
filtered_total_events = filtered_logs['event'].count()
print(f'There were {filtered_total_events} total events after filtering')
percent_lost_events = (((total_events - filtered_total_events) / total_events) * 100).round(1)
print(f'The total number of events decreased by {percent_lost_events} % after filtering')
There were 241298 total events after filtering The total number of events decreased by 1.2 % after filtering
# count number of users after filtering
filtered_total_users = filtered_logs['uid'].nunique()
print(f'There were {filtered_total_users} unique users after filtering')
percent_lost_users = round((((total_users - filtered_total_users) / total_users) * 100))
print(f'The total number of users decreased by {percent_lost_users} % after filtering')
There were 7534 unique users after filtering The total number of users decreased by 0 % after filtering
# checking groups in filtered data
print(filtered_logs.group.unique())
['246' '247' '248']
We still have users present in all 3 groups after filtering our data to dates prior to August 1st, 2019. We can proceed with the analysis now using filtered data.
# sort events by frequency
event_counts = filtered_logs.groupby('event')['uid'].count().sort_values(ascending=False)
event_counts
event MainScreenAppear 117431 OffersScreenAppear 46350 CartScreenAppear 42365 PaymentScreenSuccessful 34113 Tutorial 1039 Name: uid, dtype: int64
# calculate number of unique users who triggered each event
event_user_counts = filtered_logs.groupby('event')['uid'].nunique().sort_values(ascending=False)
event_user_counts
event MainScreenAppear 7419 OffersScreenAppear 4593 CartScreenAppear 3734 PaymentScreenSuccessful 3539 Tutorial 840 Name: uid, dtype: int64
# calculate the percentage of total users who triggered each event
percent_users_events = (event_user_counts / filtered_total_users) * 100
percent_users_events
event MainScreenAppear 98.473586 OffersScreenAppear 60.963632 CartScreenAppear 49.561986 PaymentScreenSuccessful 46.973719 Tutorial 11.149456 Name: uid, dtype: float64
It looks like the main sequence is:
The tutorial is most likely optional and could appear just before or after the main screen. We can leave it out of the funnel for now, since it isn't part of the critical sequence of events.
# plot funnel using plotly library
fig = go.Figure(go.Funnel(
y = event_user_counts.reset_index()['event'][:4],
x = event_user_counts.reset_index()['uid'][:4]
))
fig.show()
Plotly makes it easy to plot funnel diagrams, and even does the calculations for you. Here, we can see that initially we had 7419 users who triggered the MainScreenAppear event.
Of those initial users, roughly 62% made it to the next step which was the offer screen. This represents a 38% loss in users between these two stage, the biggest in the funnel. Maybe we can try to boost conversions at this stage to help the overall conversion rate.
Of the initial 7419 users, about 48% of them get to the successful payment page. Nearly half of the users become customers.
# unique users per test group
users_per_group = filtered_logs.groupby('group')['uid'].nunique()
users_per_group
group 246 2484 247 2513 248 2537 Name: uid, dtype: int64
We can write a function that uses a z-test to compare conversion rates at every stage and determine whether or not the differences between samples is statistically significant. The arguments for this function are the two sample groups, a list of events that we want to compare conversion rates for, and the critical significance level for the z-test. The function prints the null and alternative hypotheses and returns a DataFrame summarizing the results for each test.
# function computes statistical significance between conversion rates at any given stage in the funnel with z-test
def conversions_z_test(sample_A, sample_B, events, alpha):
import math
event_list = []
sample_a_conv_list = []
sample_b_conv_list = []
conv_diff_list = []
reject_null_list = []
for event in events:
# calculate total unique users for both samples
total_users_A = sample_A['uid'].nunique()
total_users_B = sample_B['uid'].nunique()
# calculate unique users per each event for both samples
users_per_event_A = sample_A.groupby('event')['uid'].nunique().sort_values(ascending=False)
users_per_event_B = sample_B.groupby('event')['uid'].nunique().sort_values(ascending=False)
# calculate the percentage of users who triggered each event for both samples
percent_users_events_A = users_per_event_A / total_users_A
percent_users_events_B = users_per_event_B / total_users_B
# run z-test
p1 = percent_users_events_A[event]
p2 = percent_users_events_B[event]
p = (users_per_event_A[event] + users_per_event_B[event]) / (total_users_A + total_users_B)
# calculate z for samples
z = ((p1 - p2) - 0) / (math.sqrt((p*(1-p))*((1/total_users_A)+(1/total_users_B))))
# get z-score from alpha
z_score = st.norm.ppf(1 - alpha)
# get results
if z >= z_score:
reject_null = False
else:
reject_null = True
# append data & results to lists
event_list.append(event)
sample_a_conv_list.append((percent_users_events_A[event]*100).round(2))
sample_b_conv_list.append((percent_users_events_B[event]*100).round(2))
conv_diff_list = np.array(sample_a_conv_list) - np.array(sample_b_conv_list)
reject_null_list.append(reject_null)
# create dictionary of results for DataFrame
result_dict = {'event': event_list,
'sample_a_conv': sample_a_conv_list,
'sample_b_conv': sample_b_conv_list,
'difference': conv_diff_list,
'reject_null': reject_null_list
}
# print hypotheses
print(f'Null hypothesis:\n The difference in conversion rates between samples is statistically significant')
print(f'Alt hypothesis:\n The difference in conversion rates between samples is not statistically significant\n')
print('alpha =', alpha)
# return results DataFrame
return pd.DataFrame(result_dict)
# split into groups, set events to compare, & set alpha
group_A1 = filtered_logs.query('group =="246"').drop('group', axis=1)
group_A2 = filtered_logs.query('group =="247"').drop('group', axis=1)
group_B0 = filtered_logs.query('group =="248"').drop('group', axis=1)
group_A0 = filtered_logs.query('(group =="246") | (group =="247")')
events = percent_users_events.index[:4]
alpha = 0.05
# run z-tests comparing conversion rates between control groups for all events
conversions_z_test(group_A1, group_A2, events, alpha)
Null hypothesis: The difference in conversion rates between samples is statistically significant Alt hypothesis: The difference in conversion rates between samples is not statistically significant alpha = 0.05
event | sample_a_conv | sample_b_conv | difference | reject_null | |
---|---|---|---|---|---|
0 | MainScreenAppear | 98.63 | 98.53 | 0.10 | True |
1 | OffersScreenAppear | 62.08 | 60.49 | 1.59 | True |
2 | CartScreenAppear | 50.97 | 49.26 | 1.71 | True |
3 | PaymentScreenSuccessful | 48.31 | 46.08 | 2.23 | True |
After running our z-test comparing conversion rates for the control samples for each event, we can confirm that the differences in conversion rates were not statistically significant. Each difference at each stage of the funnel resulted in us being able to rejec the null hypothesis in favor of the alternative hypothesis. This is good news and tells us that the A/B test is most likely to be set up properly.
# run z-tests for comparing conversion rates for control group A1 with test group B0 for all events
conversions_z_test(group_A1, group_B0, events, alpha)
Null hypothesis: The difference in conversion rates between samples is statistically significant Alt hypothesis: The difference in conversion rates between samples is not statistically significant alpha = 0.05
event | sample_a_conv | sample_b_conv | difference | reject_null | |
---|---|---|---|---|---|
0 | MainScreenAppear | 98.63 | 98.27 | 0.36 | True |
1 | OffersScreenAppear | 62.08 | 60.35 | 1.73 | True |
2 | CartScreenAppear | 50.97 | 48.48 | 2.49 | False |
3 | PaymentScreenSuccessful | 48.31 | 46.55 | 1.76 | True |
# run z-tests for comparing conversion rates for control group A2 with test group B0 for all events
conversions_z_test(group_A2, group_B0, events, alpha)
Null hypothesis: The difference in conversion rates between samples is statistically significant Alt hypothesis: The difference in conversion rates between samples is not statistically significant alpha = 0.05
event | sample_a_conv | sample_b_conv | difference | reject_null | |
---|---|---|---|---|---|
0 | MainScreenAppear | 98.53 | 98.27 | 0.26 | True |
1 | OffersScreenAppear | 60.49 | 60.35 | 0.14 | True |
2 | CartScreenAppear | 49.26 | 48.48 | 0.78 | True |
3 | PaymentScreenSuccessful | 46.08 | 46.55 | -0.47 | True |
We have differing results for our test depending on which control group was used. Using group A1 (246) resulted in rejecting the null hypothesis for the conversion rate of the CartScreenAppear event. However, when we used group A2 (247), we failed to reject the null hypothesis. It appears that the difference of 2.49% for this combination of groups was statistically signicant.
# run z-tests for comparing conversion rates for control group A0 with test group B0 for all events
conversions_z_test(group_A0, group_B0, events, alpha)
Null hypothesis: The difference in conversion rates between samples is statistically significant Alt hypothesis: The difference in conversion rates between samples is not statistically significant alpha = 0.05
event | sample_a_conv | sample_b_conv | difference | reject_null | |
---|---|---|---|---|---|
0 | MainScreenAppear | 98.58 | 98.27 | 0.31 | True |
1 | OffersScreenAppear | 61.28 | 60.35 | 0.93 | True |
2 | CartScreenAppear | 50.11 | 48.48 | 1.63 | True |
3 | PaymentScreenSuccessful | 47.19 | 46.55 | 0.64 | True |
When comparing combined control group A0 (246 & 247) conversion rates with the test group B0 (248) we are again able to reject the null hypothesis in favor of the alternative hypothesis. The difference in conversion rates between the combined control group and the test group is not significant at any stage of the funnel.
For theses tests, we used a critical significance level of 0.05. In total, we carried out 16 tests with 4 different combinations of samples (A1 vs A2, A1 vs B0, A2 vs B0, & A0 vs B0). As the number of tests increases, the probability of one of the results being false also increases. We can counteract this by adjusting our critical significance level.
The probability of making at least one mistake in the course of k comparisons will be:
\begin{align}
\ 1 - (1-\alpha)^k \ \ \end{align}
We carrier out 16 tests and our alpha was 0.05, so the probability of at least one of the results being incorrect is:
\begin{align}
\ 1 - (1-0.05)^{16} \ \ \end{align}
# calculate probability of false results
p_false = 1 - (1 - alpha) ** 16
print(round(p_false * 100), '%')
56 %
This is roughly a 56% chance that at least one of our results was false! That's pretty high. Maybe we should adjust our critical significance level to a lower value in order to decrease the probablility that one of our test results is false. Let's change alpha to 0.01 and see what happens.
# calculate probability of false results
alpha = 0.01
p_false = 1 - (1 - alpha) ** 16
print(round(p_false * 100), '%')
15 %
A 15% chance of there being an error across 16 tests seems a little more reasonable. Now, let's rerun our tests with the new alpha value to see if this made any difference in the results of our 16 z-tests.
# A1 vs A2 with new alpha
conversions_z_test(group_A1, group_A2, events, alpha)
Null hypothesis: The difference in conversion rates between samples is statistically significant Alt hypothesis: The difference in conversion rates between samples is not statistically significant alpha = 0.01
event | sample_a_conv | sample_b_conv | difference | reject_null | |
---|---|---|---|---|---|
0 | MainScreenAppear | 98.63 | 98.53 | 0.10 | True |
1 | OffersScreenAppear | 62.08 | 60.49 | 1.59 | True |
2 | CartScreenAppear | 50.97 | 49.26 | 1.71 | True |
3 | PaymentScreenSuccessful | 48.31 | 46.08 | 2.23 | True |
# A1 vs B0 with new alpha
conversions_z_test(group_A1, group_B0, events, alpha)
Null hypothesis: The difference in conversion rates between samples is statistically significant Alt hypothesis: The difference in conversion rates between samples is not statistically significant alpha = 0.01
event | sample_a_conv | sample_b_conv | difference | reject_null | |
---|---|---|---|---|---|
0 | MainScreenAppear | 98.63 | 98.27 | 0.36 | True |
1 | OffersScreenAppear | 62.08 | 60.35 | 1.73 | True |
2 | CartScreenAppear | 50.97 | 48.48 | 2.49 | True |
3 | PaymentScreenSuccessful | 48.31 | 46.55 | 1.76 | True |
# A2 vs B0 with new alpha
conversions_z_test(group_A2, group_B0, events, alpha)
Null hypothesis: The difference in conversion rates between samples is statistically significant Alt hypothesis: The difference in conversion rates between samples is not statistically significant alpha = 0.01
event | sample_a_conv | sample_b_conv | difference | reject_null | |
---|---|---|---|---|---|
0 | MainScreenAppear | 98.53 | 98.27 | 0.26 | True |
1 | OffersScreenAppear | 60.49 | 60.35 | 0.14 | True |
2 | CartScreenAppear | 49.26 | 48.48 | 0.78 | True |
3 | PaymentScreenSuccessful | 46.08 | 46.55 | -0.47 | True |
# A0 vs B0 with new alpha
conversions_z_test(group_A0, group_B0, events, alpha)
Null hypothesis: The difference in conversion rates between samples is statistically significant Alt hypothesis: The difference in conversion rates between samples is not statistically significant alpha = 0.01
event | sample_a_conv | sample_b_conv | difference | reject_null | |
---|---|---|---|---|---|
0 | MainScreenAppear | 98.58 | 98.27 | 0.31 | True |
1 | OffersScreenAppear | 61.28 | 60.35 | 0.93 | True |
2 | CartScreenAppear | 50.11 | 48.48 | 1.63 | True |
3 | PaymentScreenSuccessful | 47.19 | 46.55 | 0.64 | True |
Changing alpha from 0.05 to 0.01 changed the result of the test between A1 and B0 for the CartScreenEvent. Previously, we were unable to reject the null hypothesis. We could say that the new font made a difference a this stage of the funnel. Now, we can reject the null hypothesis. It's likely that this was a false result that came from an alpha value that was initially too high.
We can say with certainty now that the new font didn't make a statistically significant difference in conversion rates at any stage of the funnel for any group combination. It's safe to say that changing font size isn't an effective stratgey to increase conversion rates.