Given some sample data, write a program to answer the following:
On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.
Reasons for choosing median as a metric:
Order Value => Amount the customer pays per transaction
Average Order Value => Total revenue of a store/total number of orders for that store
# Function to convert the input cell from an array to a float value
# params: input_cell in a df
def float_converter(input_cell):
strings = [str(integer) for integer in input_cell]
a_string = "".join(strings)
a_float = float(a_string)
return a_float
# Function to display/print only the mean median and mode of the given column
# params: df and the respective columns
def central_tendancy(input_df, input_col):
print("Stats for {} column: ". format(input_col))
print("Mean of {} --> ". format(input_col), input_df[input_col].mean())
print("Median of {} --> ". format(input_col), input_df[input_col].median())
print("Mode of {} --> ". format(input_col), input_df[input_col].mode())
print()
# Importing the essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Reading the dataset
df = pd.read_csv("https://raw.githubusercontent.com/Praveen271195/Shopify-Analysis/main/2019%20Winter%20Data%20Science%20Intern%20Challenge%20Data%20Set%20-%20Sheet1.csv")
df.head()
order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | |
---|---|---|---|---|---|---|---|
0 | 1 | 53 | 746 | 224 | 2 | cash | 2017-03-13 12:36:56 |
1 | 2 | 92 | 925 | 90 | 1 | cash | 2017-03-03 17:38:52 |
2 | 3 | 44 | 861 | 144 | 1 | cash | 2017-03-14 4:23:56 |
3 | 4 | 18 | 935 | 156 | 1 | credit_card | 2017-03-26 12:43:37 |
4 | 5 | 18 | 883 | 156 | 1 | credit_card | 2017-03-01 4:35:11 |
# To check the null values and dtypes
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 5000 non-null int64 1 shop_id 5000 non-null int64 2 user_id 5000 non-null int64 3 order_amount 5000 non-null int64 4 total_items 5000 non-null int64 5 payment_method 5000 non-null object 6 created_at 5000 non-null object dtypes: int64(5), object(2) memory usage: 273.6+ KB
# Observing the important column's uniqueness
cols_arr = ["order_id", "shop_id", "user_id"]
for i in cols_arr:
print("No. of unique {} in data:".format(i), df[i].nunique())
No. of unique order_id in data: 5000 No. of unique shop_id in data: 100 No. of unique user_id in data: 301
OBSERVATION FOR ABOVE: There are 100 uniques shops as per the instructions, 5000 different orders made and 301 different customers who have made transactions.
# Since the order values is in focus for calculating the AOV
# We shall describe the order_amount from the dataset
# Summary stats fo the dataset
print("Describe: order_amount")
df.describe()["order_amount"]
Describe: order_amount
count 5000.000000 mean 3145.128000 std 41282.539349 min 90.000000 25% 163.000000 50% 284.000000 75% 390.000000 max 704000.000000 Name: order_amount, dtype: float64
OBSERVATION FOR ABOVE: There are 100 uniques shops as per the instructions, 5000 different orders, 301 different customers who have made transactions
The problem in focus is correcting the AOV for the dataset we have. The incorrect AOV calculated here is: $3145.13 (It is clear that the mean of the entire order_amount column is calculated. The above mentioned approach is does not make sense as the standard deviation of the order_amount column is large, which means some of the order_amounts are more separated from the mean value ie: there are outlers or anomolies to be investigated
# Visuvalzing the order_amount column for better comprehension
plt.figure(figsize=(5, 7))
df.boxplot(column='order_amount', return_type='axes')
plt.show()
OBSERVATION FOR ABOVE: There are some outliers in the amount column that has to be further investigated. Since information from Question1 says that each of the shops sells only one model of shoe and the AOV has to be affordable, but from the above boxplot we can see that there are purchases that are more than $700,000 which is too high!
1st answer: The outliers result in the skewness in the dataset, therefore calculating the average for the entire order_amount in the input dataset is a wrong approach!
Further investigation: Calculating price of each item purchased (adding column in dataset - Feature extraction). Reason for calculating price of each item --> To know if the price of each item makes sense (we know from the question that shoes are to be affordable)
# Creating a new column that carries the result of dividing each order amount_amount by each total _items
df["price_of_each_item"] = df["order_amount"]/df["total_items"]
OBSERVATION FOR ABOVE: As per the information provided in the question1, we were able to observe that each shop sells only one item and has it's own price for that item.
# Creating a summary table for each shop_id to know which shop is having a strange total revenue and strange AOV
# Only top 10 (sorted) summary values are shown below
sum_revenue = df.groupby("shop_id")["order_amount"].sum()
sum_orders = df.groupby("shop_id")["total_items"].sum()
sum_revenue.name ='summed_revenue'
sum_orders.name = 'summed_orders'
price_of_each_item = df.groupby("shop_id")["price_of_each_item"].unique()
summary = pd.concat([sum_revenue, sum_orders], axis=1)
aov = summary['summed_revenue']/summary['summed_orders']
price_of_each_item.name = "price_per_item"
aov.name = 'AOV'
summary = pd.concat([price_of_each_item, sum_revenue, sum_orders , aov], axis=1)
summary["price_per_item"] = summary.apply(lambda x: float_converter(x['price_per_item']),axis=1)
summary = summary.sort_values(ascending=False, by="summed_revenue")
summary[:10]
price_per_item | summed_revenue | summed_orders | AOV | |
---|---|---|---|---|
shop_id | ||||
42 | 352.0 | 11990176 | 34063 | 352.0 |
78 | 25725.0 | 2263800 | 88 | 25725.0 |
89 | 196.0 | 23128 | 118 | 196.0 |
81 | 177.0 | 22656 | 128 | 177.0 |
6 | 187.0 | 22627 | 121 | 187.0 |
13 | 160.0 | 21760 | 136 | 160.0 |
59 | 178.0 | 21538 | 121 | 178.0 |
71 | 164.0 | 21320 | 130 | 164.0 |
19 | 163.0 | 20538 | 126 | 163.0 |
70 | 173.0 | 20241 | 117 | 173.0 |
OBSERVATION: From the above sorted (w.r.t summed_revenue) summary table, the shop_id 42 and 78 have very high total revenues. This is the reason why the direct calculation of AOV on the order_amount column of the input data is $3145.13 (too high) seems very wrong! From this observation, we can conclude that directly calculating the average of order_amount column present in the original dataframe will be affected by the outliers. These outliers are from the stores with shop_id: 42 and 78 respectively, with very high total revenues.
NEXT STEP: Now, it is the step to check if these high revenues are valid or not. Looking at the price_per_item for shop_id: 42 and 78, we can infer that price of a pair of sneakers is too strange for shop_id: 78. Even though the total revenue of shop_id 42 is the highest of all, it has an acceptable price_per_item. However, this is not the case for shop_id 78 where the price_per_item is $25725.0 (too high!/unacceptable!), this very high price for an item results in a very high AOV for shop_id 78, which directly affects the AOV calculated for the entire dataset.
# Visuvalizing the each column of the above summary table to see how the data is distributed for all 100 shops
# also to observe the where and how the outlers are present in the dataset
# Creating a subplots for multiple graphs
fig, ax = plt.subplots(nrows = 2, ncols = 2, figsize = (10,10))
# Histogram for price_per_item
freq, bins = np.histogram(summary['price_per_item'], bins = 20)
ax[0,0].hist(summary['price_per_item'], ec = 'k', bins = bins)
ax[0,0].set_ylabel('Frequency')
ax[0,0].set_xlabel('price_per_item')
ax[0,0].set_title('Distribution of Price Per Item Sold')
# Histogram for summed_revenue
freq, bins = np.histogram(summary['summed_revenue'], bins = 20)
ax[0,1].hist(summary['price_per_item'], ec = 'k', bins = bins)
ax[0,1].set_ylabel('Frequency')
ax[0,1].set_xlabel('summed_revenue')
ax[0,1].set_title('Distribution of Total Revenue Collected Per Store')
# Histogram for summed_orders
freq, bins = np.histogram(summary['summed_orders'], bins = 20)
ax[1,0].hist(summary['price_per_item'], ec = 'k', bins = bins)
ax[1,0].set_ylabel('Frequency')
ax[1,0].set_xlabel('summed_orders')
ax[1,0].set_title('Distribution of Total Orders Per Store')
# Histogram for AOV
freq, bins = np.histogram(summary['AOV'], bins = 20)
ax[1,1].hist(summary['price_per_item'], ec = 'k', bins = bins)
ax[1,1].set_ylabel('Frequency')
ax[1,1].set_xlabel('price_per_item')
ax[1,1].set_title('Distribution of Average Order Value Per Store')
plt.tight_layout()
plt.show()
OBSERVATION FOR ABOVE: The above subplots shows the distribution of each column in the summary table. It is evident from the "Distribution of Price Per Item Sold" graph that there are few observations where the price is too high and becomes outliers. The same effect can be seen in all other graphs and this visvalization proves the previsous observation from the summary table.
Possible reasons for outliers:
Possible solutions:
But, it is not advisable to make assumptions and modify the data that we have received. Since at this stage, we cannot receive any confirmations from anyone, we can proceed the analysis without making any changes to the data.
To think about another metric to solve this issue, we can explore the median or mode of the order value column. Reason for choosing to explore the said metrics because, these are not affected by the outliers or skewness in the input data. We can observe the mean, median and mode of the order_amount column in the input dataframe.
df.head()
order_id | shop_id | user_id | order_amount | total_items | payment_method | created_at | price_of_each_item | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 53 | 746 | 224 | 2 | cash | 2017-03-13 12:36:56 | 112.0 |
1 | 2 | 92 | 925 | 90 | 1 | cash | 2017-03-03 17:38:52 | 90.0 |
2 | 3 | 44 | 861 | 144 | 1 | cash | 2017-03-14 4:23:56 | 144.0 |
3 | 4 | 18 | 935 | 156 | 1 | credit_card | 2017-03-26 12:43:37 | 156.0 |
4 | 5 | 18 | 883 | 156 | 1 | credit_card | 2017-03-01 4:35:11 | 156.0 |
# Displaying the central_tendancy of the input dataframe using function
print(central_tendancy(df, "order_amount"))
print("*"*35)
print(central_tendancy(df, "total_items"))
print("*"*35)
print(central_tendancy(df, "price_of_each_item"))
Stats for order_amount column: Mean of order_amount --> 3145.128 Median of order_amount --> 284.0 Mode of order_amount --> 0 153 dtype: int64 None *********************************** Stats for total_items column: Mean of total_items --> 8.7872 Median of total_items --> 2.0 Mode of total_items --> 0 2 dtype: int64 None *********************************** Stats for price_of_each_item column: Mean of price_of_each_item --> 387.7428 Median of price_of_each_item --> 153.0 Mode of price_of_each_item --> 0 153.0 dtype: float64 None
OBSERVATION FOR ABOVE: The above statistical summary tells us that we have two options, excluding the mean of the variable. I think, mode will be the best metric to choose.
In-order to visuvalize the above observation, we can again plot a distibution plot on order_amount of the original dataset, since this visivalization is zoomed out and we are missing to see more details, we can plot another distplot.
# Plotting a displot on order_amount which results in a zoomed out plot missing details
# because of the extreme values at the right hand side
sns.displot(df["order_amount"], bins=29, kde = True, height=10, aspect = 2)
plt.title('Zoomed out Freq. Distribution of order_amount - (Considering shop:78 & 42)', fontsize=20)
plt.xlabel('order_amount', fontsize=25)
plt.ylabel('Frequency', fontsize=25)
plt.show()
Below visuvalization does not include the data that belongs to shop_id:78 and 42. This filtering allows us to look at a better picture at the frequency distibution of the order_amount.
# Filtering out shop_id:42 and 78 and storing into a df to use for detailed visuvalization
df_without_42_78 = df[(df["shop_id"] != 42) & (df["shop_id"] != 78)]
# Plotting the above filtered dataframe
sns.displot(df_without_42_78["order_amount"], bins=29, kde = True, height=10, aspect = 2)
plt.title('Zoomed in Freq. Distribution of order_amount - (removing shop:78 & 42)', fontsize=20)
median = df_without_42_78["order_amount"].median()
plt.axvline(median, color="black",linestyle="dashed") #Median line
plt.xlabel('order_amount', fontsize=25)
plt.ylabel('Frequency', fontsize=25)
plt.show()
OBSERVATION FOR ABOVE: From the above graph, we can see that the most frequently occuring order_amount after removing the extreme vaues caused by shop_id: 42 and 78 is around 150. Even after removing the extreme values, we can observe that the order_amount distribution plot is skewed to the right.
Since the graph is skewed to the right, the usual central tendancy considered is the median of the data. Here, the median value is $284. Which is way less than the first/naively calculated AOV of 3145.13. Which seems like a good metric to fix.
However, when analysing AOV, we have to be wary of the context. According to the article posted in https://www.shopify.com/blog/average-order-value we have to consider all the three central tendancies when trying to understand/analyse AOV. Since we are analysing a business data, we have to keep marketing context in mind. Let's perform some more exploration to see if the mode can be seen as a metric to start with.
# Importing the Counter from python collections
from collections import Counter
cnt = Counter(df["total_items"])
# Visuvalizing the sorted tota items purchased per order
df2=pd.DataFrame({'Total items':cnt.keys(),'Frequency':cnt.values()})
df2.sort_values('Frequency',inplace=True)
fig = plt.figure(figsize=(20, 10))
df2.plot(kind='barh',y='Frequency',x='Total items',color='b')
plt.title( "Total items purchased per order (Sorted)", fontsize="15")
plt.ylabel('Total items purchased', fontsize=15)
plt.xlabel('Frequency', fontsize=15)
plt.show()
<Figure size 1440x720 with 0 Axes>
OBSERVATION FOR ABOVE: We can infer from the above graph that the most frequent number of items purchased in a single order is 2 followed by 1.
# Displaying the difference between the number of orders with 1 and 2 items
print("Total No. of orders with with 1 item: ",str(len(df[df["total_items"] == 1])))
print("Total No. of orders with with 2 items: ",str(len(df[df["total_items"] == 2])))
print("Difference between no. of orders with with 2 items and 1 item: ",
len(df[df["total_items"] == 2])-len(df[df["total_items"] == 1]))
Total No. of orders with with 1 item: 1830 Total No. of orders with with 2 items: 1832 Difference between no. of orders with with 2 items and 1 item: 2
From the above horizontal bar graph and the differences between the number of orders with 1 and 2 items, we can infer that orders with 1 and 2 items are almost the same and is significantly higher that the other orders with 3, 4, 5, 6 and 8 as number of items purchased together. Since this is the case, we can now analyse the central tendencies, specifically the median and mode of the dataframe with total_items purchased as 2 and 1
# Looking at the median and mode of order_amount in original dataframe
central_tendancy(df,"order_amount")
Stats for order_amount column: Mean of order_amount --> 3145.128 Median of order_amount --> 284.0 Mode of order_amount --> 0 153 dtype: int64
# Filtering out the dataframe with total_items as 2
df3 = df[df["total_items"] == 2]
# Looking at the median and mode of the order_amount, not looking at the mean,
# because of skewness by shop_id: 42 & 78
central_tendancy(df3,"order_amount")
Stats for order_amount column: Mean of order_amount --> 750.2150655021834 Median of order_amount --> 306.0 Mode of order_amount --> 0 306 dtype: int64
# Filtering out the dataframe with total_items as 1
df4 = df[df["total_items"] == 1]
# Looking at the median and mode of the order_amount, not looking at the mean,
# because of skewness by shop_id: 42 & 78
central_tendancy(df4,"order_amount")
Stats for order_amount column: Mean of order_amount --> 417.36448087431694 Median of order_amount --> 153.0 Mode of order_amount --> 0 153 dtype: int64
# Looking at the median and mode of the price per item in original dataframe,
# to understand the price per item, not looking at the mean as it is skewed by the presence of
# shop_id: 42 and 78
central_tendancy(df,"price_of_each_item")
Stats for price_of_each_item column: Mean of price_of_each_item --> 387.7428 Median of price_of_each_item --> 153.0 Mode of price_of_each_item --> 0 153.0 dtype: float64
OBSERVATION FOR ABOVE: We can clearly see that the medians and modes of the order_amount column after filtering all data except for data with total_items 1 and 2 are 153 and 306 respectively.
Since the median value of order_amount belonging to the original dataframe is 284 USD and the median value order amount belonging to the filtered df with data of total_items 2 is 306 USD, we can conclude that the best metric in this case is the median. But, it is suggested to start from the mode which is 150 USD. The reason for this is there is no big difference when the total items purchased per order is 1 and 2. And, the median and mode price per pair of sneakers is around 153 USD according to the data and this makes more sense. This observation makes me think that any marketing statergy that is thought to be implemented must start by choosing the mode into consideration and then move on to the median with respect to the marketing context.
References: https://www.shopify.com/blog/average-order-value