Shopify Summer 2022 Data Science Challenge¶

Candidate Name: Praveen Chandrasekaran¶

Question 1:¶

Given some sample data, write a program to answer the following:

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

Think about what could be going wrong with our calculation.
Think about a better way to evaluate this data. What metric would you report for this dataset?
What is its value?

Answer 1:¶

Calculating the mean/average direclty on the order_amount column to report the Average Order Value for the given dataset leads to a wrong and a very high AOV of USD 3,145.13. This high value is due to the skeweness that is caused by the total order_amounts/revenue generated by Shop_id: 42 and 78 respectively. This skeweness has affected the mean of order_amount column or AOV. This cannot be a logical approach to come up with a metric.
Better way to evalute the given dataset is by determining the shop_ids that carrys highest order_amounts and further investigate if the respective order_amounts and total_items makes sense. This step will allow us to validate the data of shops with high order_amounts by considering the order_amount of other shops. The best metric in this case woud be the median.

Reasons for choosing median as a metric:

The mode and median price per item is USD 153
Most of the customers are buying 2 pairs of sneakers per order
From above two statements we can infer that one pair of sneakers cost around USD 300
The median value of the given dataset i:e USD 284 makes more sense as it is close to USD 300

The in-depth analysis that was performed on the dataset helped me conclude that median is the appropriate metric for given dataset and the value is USD 284

Important Definitions:¶

Order Value => Amount the customer pays per transaction

Average Order Value => Total revenue of a store/total number of orders for that store

Functions used for the program below:¶

In [1]:

# Function to convert the input cell from an array to a float value
# params: input_cell in a df 
def float_converter(input_cell):
  strings = [str(integer) for integer in input_cell]
  a_string = "".join(strings)
  a_float = float(a_string)
  return a_float

In [2]:

# Function to display/print only the mean median and mode of the given column
# params: df and the respective columns
def central_tendancy(input_df, input_col):
  print("Stats for {} column: ". format(input_col))
  print("Mean of {} --> ". format(input_col), input_df[input_col].mean())
  print("Median of {} --> ". format(input_col), input_df[input_col].median())
  print("Mode of {} --> ". format(input_col), input_df[input_col].mode())
  print()

Program used for analysis:¶

In [3]:

# Importing the essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:

# Reading the dataset
df = pd.read_csv("https://raw.githubusercontent.com/Praveen271195/Shopify-Analysis/main/2019%20Winter%20Data%20Science%20Intern%20Challenge%20Data%20Set%20-%20Sheet1.csv")
df.head()

Out[4]:

	order_id	shop_id	user_id	order_amount	total_items	payment_method	created_at
0	1	53	746	224	2	cash	2017-03-13 12:36:56
1	2	92	925	90	1	cash	2017-03-03 17:38:52
2	3	44	861	144	1	cash	2017-03-14 4:23:56
3	4	18	935	156	1	credit_card	2017-03-26 12:43:37
4	5	18	883	156	1	credit_card	2017-03-01 4:35:11

In [5]:

# To check the null values and dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   order_id        5000 non-null   int64 
 1   shop_id         5000 non-null   int64 
 2   user_id         5000 non-null   int64 
 3   order_amount    5000 non-null   int64 
 4   total_items     5000 non-null   int64 
 5   payment_method  5000 non-null   object
 6   created_at      5000 non-null   object
dtypes: int64(5), object(2)
memory usage: 273.6+ KB

In [6]:

# Observing the important column's uniqueness
cols_arr = ["order_id", "shop_id", "user_id"]
for i in cols_arr:
  print("No. of unique {} in data:".format(i), df[i].nunique())  

No. of unique order_id in data: 5000
No. of unique shop_id in data: 100
No. of unique user_id in data: 301

OBSERVATION FOR ABOVE: There are 100 uniques shops as per the instructions, 5000 different orders made and 301 different customers who have made transactions.

In [7]:

# Since the order values is in focus for calculating the AOV
# We shall describe the order_amount from the dataset
# Summary stats fo the dataset
print("Describe: order_amount")
df.describe()["order_amount"]

Describe: order_amount

Out[7]:

count      5000.000000
mean       3145.128000
std       41282.539349
min          90.000000
25%         163.000000
50%         284.000000
75%         390.000000
max      704000.000000
Name: order_amount, dtype: float64

OBSERVATION FOR ABOVE: There are 100 uniques shops as per the instructions, 5000 different orders, 301 different customers who have made transactions

The problem in focus is correcting the AOV for the dataset we have. The incorrect AOV calculated here is: $3145.13 (It is clear that the mean of the entire order_amount column is calculated. The above mentioned approach is does not make sense as the standard deviation of the order_amount column is large, which means some of the order_amounts are more separated from the mean value ie: there are outlers or anomolies to be investigated

In [8]:

# Visuvalzing the order_amount column for better comprehension
plt.figure(figsize=(5, 7))
df.boxplot(column='order_amount', return_type='axes')
plt.show()

OBSERVATION FOR ABOVE: There are some outliers in the amount column that has to be further investigated. Since information from Question1 says that each of the shops sells only one model of shoe and the AOV has to be affordable, but from the above boxplot we can see that there are purchases that are more than $700,000 which is too high!

1st answer: The outliers result in the skewness in the dataset, therefore calculating the average for the entire order_amount in the input dataset is a wrong approach!

Further investigation: Calculating price of each item purchased (adding column in dataset - Feature extraction). Reason for calculating price of each item --> To know if the price of each item makes sense (we know from the question that shoes are to be affordable)

In [9]:

# Creating a new column that carries the result of dividing each order amount_amount by each total _items 
df["price_of_each_item"] = df["order_amount"]/df["total_items"]

OBSERVATION FOR ABOVE: As per the information provided in the question1, we were able to observe that each shop sells only one item and has it's own price for that item.

In [10]:

# Creating a summary table for each shop_id to know which shop is having a strange total revenue and strange AOV
# Only top 10 (sorted) summary values are shown below
sum_revenue = df.groupby("shop_id")["order_amount"].sum()
sum_orders = df.groupby("shop_id")["total_items"].sum()
sum_revenue.name ='summed_revenue'
sum_orders.name = 'summed_orders'
price_of_each_item = df.groupby("shop_id")["price_of_each_item"].unique()
summary = pd.concat([sum_revenue, sum_orders], axis=1)
aov = summary['summed_revenue']/summary['summed_orders'] 
price_of_each_item.name = "price_per_item" 
aov.name = 'AOV'
summary = pd.concat([price_of_each_item, sum_revenue, sum_orders , aov], axis=1)
summary["price_per_item"] = summary.apply(lambda x: float_converter(x['price_per_item']),axis=1)
summary = summary.sort_values(ascending=False, by="summed_revenue")
summary[:10]

Out[10]:

	price_per_item	summed_revenue	summed_orders	AOV
shop_id
42	352.0	11990176	34063	352.0
78	25725.0	2263800	88	25725.0
89	196.0	23128	118	196.0
81	177.0	22656	128	177.0
6	187.0	22627	121	187.0
13	160.0	21760	136	160.0
59	178.0	21538	121	178.0
71	164.0	21320	130	164.0
19	163.0	20538	126	163.0
70	173.0	20241	117	173.0

OBSERVATION: From the above sorted (w.r.t summed_revenue) summary table, the shop_id 42 and 78 have very high total revenues. This is the reason why the direct calculation of AOV on the order_amount column of the input data is $3145.13 (too high) seems very wrong! From this observation, we can conclude that directly calculating the average of order_amount column present in the original dataframe will be affected by the outliers. These outliers are from the stores with shop_id: 42 and 78 respectively, with very high total revenues.

NEXT STEP: Now, it is the step to check if these high revenues are valid or not. Looking at the price_per_item for shop_id: 42 and 78, we can infer that price of a pair of sneakers is too strange for shop_id: 78. Even though the total revenue of shop_id 42 is the highest of all, it has an acceptable price_per_item. However, this is not the case for shop_id 78 where the price_per_item is $25725.0 (too high!/unacceptable!), this very high price for an item results in a very high AOV for shop_id 78, which directly affects the AOV calculated for the entire dataset.

In [11]:

# Visuvalizing the each column of the above summary table to see how the data is distributed for all 100 shops
# also to observe the where and how the outlers are present in the dataset

# Creating a subplots for multiple graphs
fig, ax = plt.subplots(nrows = 2, ncols = 2, figsize = (10,10))

# Histogram for price_per_item
freq, bins = np.histogram(summary['price_per_item'], bins = 20)
ax[0,0].hist(summary['price_per_item'], ec = 'k', bins = bins)
ax[0,0].set_ylabel('Frequency')
ax[0,0].set_xlabel('price_per_item')
ax[0,0].set_title('Distribution of Price Per Item Sold')

# Histogram for summed_revenue
freq, bins = np.histogram(summary['summed_revenue'], bins = 20)
ax[0,1].hist(summary['price_per_item'], ec = 'k', bins = bins)
ax[0,1].set_ylabel('Frequency')
ax[0,1].set_xlabel('summed_revenue')
ax[0,1].set_title('Distribution of Total Revenue Collected Per Store')

# Histogram for summed_orders
freq, bins = np.histogram(summary['summed_orders'], bins = 20)
ax[1,0].hist(summary['price_per_item'], ec = 'k', bins = bins)
ax[1,0].set_ylabel('Frequency')
ax[1,0].set_xlabel('summed_orders')
ax[1,0].set_title('Distribution of Total Orders Per Store')

# Histogram for AOV
freq, bins = np.histogram(summary['AOV'], bins = 20)
ax[1,1].hist(summary['price_per_item'], ec = 'k', bins = bins)
ax[1,1].set_ylabel('Frequency')
ax[1,1].set_xlabel('price_per_item')
ax[1,1].set_title('Distribution of Average Order Value Per Store')

plt.tight_layout()
plt.show()

OBSERVATION FOR ABOVE: The above subplots shows the distribution of each column in the summary table. It is evident from the "Distribution of Price Per Item Sold" graph that there are few observations where the price is too high and becomes outliers. The same effect can be seen in all other graphs and this visvalization proves the previsous observation from the summary table.

Possible reasons for outliers:

Could be because of a wrong data entry of price per item for shop_id: 78
Could be because of a wrong data entry of number of items sold (34,063 units) for shop_id: 42

Possible solutions:

If we assume that the per_per_item sold by shop_id: 42 is wrong, we could change the amount from $25725.0 to $257.25.
If we assume that the summed_revenue for shop_id: 78 is wrong, we could correct the 340,63 units to 340 units

But, it is not advisable to make assumptions and modify the data that we have received. Since at this stage, we cannot receive any confirmations from anyone, we can proceed the analysis without making any changes to the data.

Exploring new metrics:¶

To think about another metric to solve this issue, we can explore the median or mode of the order value column. Reason for choosing to explore the said metrics because, these are not affected by the outliers or skewness in the input data. We can observe the mean, median and mode of the order_amount column in the input dataframe.

In [12]:

df.head()

Out[12]:

	order_id	shop_id	user_id	order_amount	total_items	payment_method	created_at	price_of_each_item
0	1	53	746	224	2	cash	2017-03-13 12:36:56	112.0
1	2	92	925	90	1	cash	2017-03-03 17:38:52	90.0
2	3	44	861	144	1	cash	2017-03-14 4:23:56	144.0
3	4	18	935	156	1	credit_card	2017-03-26 12:43:37	156.0
4	5	18	883	156	1	credit_card	2017-03-01 4:35:11	156.0

In [13]:

# Displaying the central_tendancy of the input dataframe using function
print(central_tendancy(df, "order_amount"))
print("*"*35)
print(central_tendancy(df, "total_items"))
print("*"*35)
print(central_tendancy(df, "price_of_each_item"))

Stats for order_amount column: 
Mean of order_amount -->  3145.128
Median of order_amount -->  284.0
Mode of order_amount -->  0    153
dtype: int64

None
***********************************
Stats for total_items column: 
Mean of total_items -->  8.7872
Median of total_items -->  2.0
Mode of total_items -->  0    2
dtype: int64

None
***********************************
Stats for price_of_each_item column: 
Mean of price_of_each_item -->  387.7428
Median of price_of_each_item -->  153.0
Mode of price_of_each_item -->  0    153.0
dtype: float64

None

OBSERVATION FOR ABOVE: The above statistical summary tells us that we have two options, excluding the mean of the variable. I think, mode will be the best metric to choose.

In-order to visuvalize the above observation, we can again plot a distibution plot on order_amount of the original dataset, since this visivalization is zoomed out and we are missing to see more details, we can plot another distplot.

In [14]:

# Plotting a displot on order_amount which results in a zoomed out plot missing details
# because of the extreme values at the right hand side
sns.displot(df["order_amount"], bins=29, kde = True, height=10, aspect = 2)
plt.title('Zoomed out Freq. Distribution of order_amount - (Considering shop:78 & 42)', fontsize=20)
plt.xlabel('order_amount', fontsize=25)
plt.ylabel('Frequency', fontsize=25)
plt.show()

Below visuvalization does not include the data that belongs to shop_id:78 and 42. This filtering allows us to look at a better picture at the frequency distibution of the order_amount.

In [15]:

# Filtering out shop_id:42 and 78 and storing into a df to use for detailed visuvalization
df_without_42_78 = df[(df["shop_id"] != 42) & (df["shop_id"] != 78)] 
# Plotting the above filtered dataframe
sns.displot(df_without_42_78["order_amount"], bins=29, kde = True, height=10, aspect = 2)
plt.title('Zoomed in Freq. Distribution of order_amount - (removing shop:78 & 42)', fontsize=20)

median = df_without_42_78["order_amount"].median()
plt.axvline(median, color="black",linestyle="dashed") #Median line

plt.xlabel('order_amount', fontsize=25)
plt.ylabel('Frequency', fontsize=25)
plt.show()

OBSERVATION FOR ABOVE: From the above graph, we can see that the most frequently occuring order_amount after removing the extreme vaues caused by shop_id: 42 and 78 is around 150. Even after removing the extreme values, we can observe that the order_amount distribution plot is skewed to the right.

Since the graph is skewed to the right, the usual central tendancy considered is the median of the data. Here, the median value is $284. Which is way less than the first/naively calculated AOV of 3145.13. Which seems like a good metric to fix.

However, when analysing AOV, we have to be wary of the context. According to the article posted in https://www.shopify.com/blog/average-order-value we have to consider all the three central tendancies when trying to understand/analyse AOV. Since we are analysing a business data, we have to keep marketing context in mind. Let's perform some more exploration to see if the mode can be seen as a metric to start with.

In [16]:

# Importing the Counter from python collections
from collections import Counter
cnt = Counter(df["total_items"])

# Visuvalizing the sorted tota items purchased per order
df2=pd.DataFrame({'Total items':cnt.keys(),'Frequency':cnt.values()})
df2.sort_values('Frequency',inplace=True)
fig = plt.figure(figsize=(20, 10))
df2.plot(kind='barh',y='Frequency',x='Total items',color='b')
plt.title( "Total items purchased per order (Sorted)", fontsize="15")
plt.ylabel('Total items purchased', fontsize=15)
plt.xlabel('Frequency', fontsize=15)
plt.show()

<Figure size 1440x720 with 0 Axes>

OBSERVATION FOR ABOVE: We can infer from the above graph that the most frequent number of items purchased in a single order is 2 followed by 1.

In [17]:

# Displaying the difference between the number of orders with 1 and 2 items
print("Total No. of orders with with 1 item: ",str(len(df[df["total_items"] == 1])))
print("Total No. of orders with with 2 items: ",str(len(df[df["total_items"] == 2])))
print("Difference between no. of orders with with 2 items and 1 item: ",
      len(df[df["total_items"] == 2])-len(df[df["total_items"] == 1]))

Total No. of orders with with 1 item:  1830
Total No. of orders with with 2 items:  1832
Difference between no. of orders with with 2 items and 1 item:  2

From the above horizontal bar graph and the differences between the number of orders with 1 and 2 items, we can infer that orders with 1 and 2 items are almost the same and is significantly higher that the other orders with 3, 4, 5, 6 and 8 as number of items purchased together. Since this is the case, we can now analyse the central tendencies, specifically the median and mode of the dataframe with total_items purchased as 2 and 1

In [18]:

# Looking at the median and mode of order_amount in original dataframe
central_tendancy(df,"order_amount")

Stats for order_amount column: 
Mean of order_amount -->  3145.128
Median of order_amount -->  284.0
Mode of order_amount -->  0    153
dtype: int64

In [19]:

# Filtering out the dataframe with total_items as 2
df3 = df[df["total_items"] == 2]
# Looking at the median and mode of the order_amount, not looking at the mean,
# because of skewness by shop_id: 42 & 78
central_tendancy(df3,"order_amount")

Stats for order_amount column: 
Mean of order_amount -->  750.2150655021834
Median of order_amount -->  306.0
Mode of order_amount -->  0    306
dtype: int64

In [20]:

# Filtering out the dataframe with total_items as 1
df4 = df[df["total_items"] == 1]
# Looking at the median and mode of the order_amount, not looking at the mean, 
# because of skewness by shop_id: 42 & 78
central_tendancy(df4,"order_amount")

Stats for order_amount column: 
Mean of order_amount -->  417.36448087431694
Median of order_amount -->  153.0
Mode of order_amount -->  0    153
dtype: int64

In [21]:

# Looking at the median and mode of the price per item in original dataframe,
# to understand the price per item, not looking at the mean as it is skewed by the presence of 
# shop_id: 42 and 78
central_tendancy(df,"price_of_each_item")

Stats for price_of_each_item column: 
Mean of price_of_each_item -->  387.7428
Median of price_of_each_item -->  153.0
Mode of price_of_each_item -->  0    153.0
dtype: float64

OBSERVATION FOR ABOVE: We can clearly see that the medians and modes of the order_amount column after filtering all data except for data with total_items 1 and 2 are 153 and 306 respectively.

Conclusion:¶

Since the median value of order_amount belonging to the original dataframe is 284 USD and the median value order amount belonging to the filtered df with data of total_items 2 is 306 USD, we can conclude that the best metric in this case is the median. But, it is suggested to start from the mode which is 150 USD. The reason for this is there is no big difference when the total items purchased per order is 1 and 2. And, the median and mode price per pair of sneakers is around 153 USD according to the data and this makes more sense. This observation makes me think that any marketing statergy that is thought to be implemented must start by choosing the mode into consideration and then move on to the median with respect to the marketing context.

References: https://www.shopify.com/blog/average-order-value