import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy import stats
from scipy.stats import skew, kurtosis, kstest, shapiro
from scipy.stats import levene
from scipy.stats import kruskal
from scikit_posthocs import posthoc_dunn
import matplotlib.pyplot as plt
import seaborn as sns
import ptitprince as pt
plt.style.use('ggplot')
In today's competitive fast-food industry, choosing effective marketing strategies play crucial roles in maximizing the impact of new product launch on sales. Recently, a fast-food chain launched three different campaigns to promote its new menu items in 137 locations, selected randomly, in four weeks for a trial phase. Based on this experiment, the Marketing Manager aims to identify which promotional campaign should be chosen for the larger scale campaign. The ideal result is a campaign which can generate the largest amount of sales during the testing phase.
Due to the criticality of the decision for selecting the most optimal campaign to promote new menu items, the challenge here is to find which campaign generated the highest sale in the experiment phase. For this reason, a systematic approach to evaluate performance of marketing campaign is crucial for a successful product launch. My goal in this project is to evaluate the effectiveness of three marketing campaigns in promoting new menu items by comparing the sales performance of each campaign to determine which promotion received the highest sales.
Additionally, for clarity purposes, below is the null hypothesis ($H_0$) and the alternative hypothesis ($H_1$) in this project.
The dataset was obtained from Kaggle. This dataset contains information about marketing campaign captured via IBM Watson Analytics. It consists of seven columns but two variables will be mainly used for this project, namely Promotion
and SalesInThousands
, to achieve the objective.
MarketingID
: unique identifier for marketMarketSize
: size of market area by salesLocationID
: unique identifier for store locationAgeOfStore
: age of store in yearsPromotion
: one of three promotions that were testedWeek
: one of four weeks when the promotions were runSalesInThousands
: sales amount for a specific location, promotion, and weekThis section deals with data preparation, including data cleaning. This broad term "data cleaning" covers whether there are missing values, how many the duplicates are, and how many unique values are. These will be required because the cleanliness of the dataset will affect the trustworthiness of the conclusions drawn from this A/B test.
# load dataset
df = pd.read_csv("../data/Watson_Marketing_Campaign.csv")
# create function to inspect df
def inspect_dataframe(df):
print(f'The dataframe contains {df.shape[0]} rows and {df.shape[1]} cols.')
print(f"- {len(df.select_dtypes(include='number').columns)} are numeric cols")
print(f"- {len(df.select_dtypes(include='O').columns)} are object cols")
summary = {
'ColumnName': df.columns.values.tolist(),
'Nrow': df.shape[0],
'DataType': df.dtypes.values.tolist(),
'NAPct': (df.isna().mean() * 100).round(2).tolist(),
'DuplicatePct': (df.duplicated().sum()/len(df)*100).round(2),
'UniqueValue': df.nunique().tolist(),
'Sample': [df[col].unique() for col in df.columns]
}
return pd.DataFrame(summary)
# inspect df
inspect_dataframe(df)
The dataframe contains 548 rows and 7 cols. - 6 are numeric cols - 1 are object cols
ColumnName | Nrow | DataType | NAPct | DuplicatePct | UniqueValue | Sample | |
---|---|---|---|---|---|---|---|
0 | MarketID | 548 | int64 | 0.0 | 0.0 | 10 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] |
1 | MarketSize | 548 | object | 0.0 | 0.0 | 3 | [Medium, Small, Large] |
2 | LocationID | 548 | int64 | 0.0 | 0.0 | 137 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 10... |
3 | AgeOfStore | 548 | int64 | 0.0 | 0.0 | 25 | [4, 5, 12, 1, 10, 15, 6, 22, 8, 19, 11, 13, 3,... |
4 | Promotion | 548 | int64 | 0.0 | 0.0 | 3 | [3, 2, 1] |
5 | week | 548 | int64 | 0.0 | 0.0 | 4 | [1, 2, 3, 4] |
6 | SalesInThousands | 548 | float64 | 0.0 | 0.0 | 517 | [33.73, 35.67, 29.03, 39.25, 27.81, 34.67, 27.... |
Note
Promotion
will be the primary interest. This will be the main column for grouping. And as can be seen, this column contains three unique values, namely 1
, 2
, and 3
, representing each campaign.The main focus of this exploratory data analysis is to understand whether variable SalesInThousand
grouped by Promotion
meets the parametric assumptions such as normality of data distribution, independent observation, and equal variances between groups. Before moving forward to testing each parametric assumption, one question might arise: Why do we bother to check parametric assumptions, not directly choose a non-parametric test which does not assume normal distribution?
Unlike their parametric counterparts, non-parametric tests have some limitations such as less power, rank-based calculation, and lack of informativeness. Nonparametric tests usually have less statistical power to detect actual difference between groups when the sample size is small. Dissimilar to parametric tests relying calculations on the values themselves, nonparametric tests normally depend on ranks. As a consequence, a loss of information can occur here. Last but not least, the calculation of non-parametric tests rely on median as the measure of central tendency. While median itself is useful to inform the middle value in the data distribution, median does not take into account the magnitude of every value in the data.
This first question deals with the normality of data distribution which can be tested both visually and numerically. Visually, normality can be examined by exploiting a histogram, a boxplot, or a quantile-quantile plot. Here, I will use a raincloud plot, a modification of a histogram and a boxplot. This plot also displays every single point in the dataset. The advantage of this visual test is that we can get the advantages of three types of data visualization techniques in one plot such as clearer examination on data shape and distribution, including central tendency, skewness, kurtosis, and presence of outlier.
As the best practice of normality test is both providing the visual and numerical tests, we also include the numerical tests such as kurtosis, skewness, Kolmogrov-Smirnov, and Shapiro-Wilk test. Kurtosis and skweness test inform us the shape of the distribution numerically. In the interpretations, kurtosis displays the peakness of a distribution. Positive kurtosis (leptokurtic) indicates a sharp peak, and negative kurtosis (platykurtic) shows the distribution has a flatter peak. Furthermore, both Shapiro-Wilk test and Kolmogorov-Smirnov test are used to tell how well data fits a specific distribution and provide p-values to interpret the results. The combination of the two tests provides a more comprehensive assessment. If both tests failt to reject the null hypothesis at $\alpha$ at 0.05, the data does not fit the normal distribution.
# convert num to string for better readings
df['Promotion'] = df["Promotion"].replace({1: 'Promotion 1', 2: 'Promotion 2', 3: 'Promotion 3'})
df["Promotion"].unique()
array(['Promotion 3', 'Promotion 2', 'Promotion 1'], dtype=object)
Note
Promotion
were converted into strings by adding "Promotion" at the beginning.# create function to apply normality tests
def test_normality(group):
return pd.Series({
'skewness': skew(group),
'kurtosis': kurtosis(group),
'ks_pvalue': kstest(group, 'norm').pvalue,
'sw_pvalue': shapiro(group).pvalue
})
# apply the normality tests
(df.groupby('Promotion')['SalesInThousands']
.apply(test_normality)
.to_frame()
.rename(columns={'SalesInThousands':'Value'})
.round(4))
Value | ||
---|---|---|
Promotion | ||
Promotion 1 | skewness | 0.8585 |
kurtosis | 0.0284 | |
ks_pvalue | 0.0000 | |
sw_pvalue | 0.0000 | |
Promotion 2 | skewness | 0.9205 |
kurtosis | 0.6614 | |
ks_pvalue | 0.0000 | |
sw_pvalue | 0.0000 | |
Promotion 3 | skewness | 0.7642 |
kurtosis | -0.1787 | |
ks_pvalue | 0.0000 | |
sw_pvalue | 0.0000 |
Note
Promotion 1
is significantly different from a normal distribution.# set up the figure
plt.figure(figsize=(12, 5))
# use raincloud to display distribution
pt.RainCloud(x='Promotion', y='SalesInThousands', data=df,
palette='Set2', bw=.2, width_viol=.6, ax=None, orient='h')
plt.title('Distributions of data in all promotions are not normal')
plt.xlabel('Sales in Thousands')
plt.ylabel(None)
plt.show()
Note
# create function to detect outliers using IQR
def detect_outliers(df, column, category):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df.loc[(df[column] < lower_bound) | (df[column] > upper_bound)]
print(f"The number of outliers in '{category}' is {len(outliers)} or {(len(outliers)/len(df)*100):.2f}% of the total data ({len(df)} rows).")
# apply the function to each group
df.groupby('Promotion').apply(lambda x: detect_outliers(x, 'SalesInThousands', x['Promotion'].iloc[0]))
The number of outliers in 'Promotion 1' is 12 or 6.98% of the total data (172 rows). The number of outliers in 'Promotion 2' is 24 or 12.77% of the total data (188 rows). The number of outliers in 'Promotion 3' is 10 or 5.32% of the total data (188 rows).
Note
Promotion 2
appear to have the highest number of outliers, constituting 24 points (12.77%).Promotion 1
and Promotion 3
, on the other hand, have similar rate of outliers (6.98 vs 5.32%).Homogeneity of variance is tested because some tests, especially a one-way ANOVA test, require us to meet this criteria. To test this assumption, I rely on Levene's test (Levene, 1960) with a null hypothesis of equal variances in the groups (Promotion 1
, Promotion 2
, and Promotion 3
). For this reason, if the p-value is less than 0.05, the decision is to reject the $H_0$ and the assumption of equal variances across promotional campaigns is violated.
But before doing the more formal inspection using Levene's test, I will firstly check the spread of data in each group numerically using variances. This initial inspection will provide a numerical sense of the data spread in each promotional campaign.
# get variance of each group
df.groupby('Promotion')['SalesInThousands'].var().reset_index(name='Variance').round(3)
Promotion | Variance | |
---|---|---|
0 | Promotion 1 | 274.028 |
1 | Promotion 2 | 228.281 |
2 | Promotion 3 | 281.106 |
Note:
Promotion 1
and Promotion 3
are similar, indicating similar variability in these groups. Variance for Promotion 2
, on the other hand, is lower than the other two although the difference is not large. From these variances, it is possible that the assumption of homoscedasticity will be met.# creating separate arrays for each promotion
sales_promo1 = df[df['Promotion'] == 'Promotion 1']['SalesInThousands']
sales_promo2 = df[df['Promotion'] == 'Promotion 2']['SalesInThousands']
sales_promo3 = df[df['Promotion'] == 'Promotion 3']['SalesInThousands']
# performing Levene's test
levene_stat, p_value = levene(sales_promo1, sales_promo2, sales_promo3)
# display output
pd.DataFrame({
'Test': ['Levene\'s Test'],
'Test Statistic': [levene_stat],
'p-value': [p_value],
'Conclusion': ['Variances are significantly different' if p_value < 0.05 else 'Variances are not significantly different']
})
Test | Test Statistic | p-value | Conclusion | |
---|---|---|---|---|
0 | Levene's Test | 1.269679 | 0.281751 | Variances are not significantly different |
Note: From the Levene's test output, the variances across three different marketing campaigns were similar (F = 1.269, p = 0.281).Based on p > 0.05, the assumption of homogeneity of variances is tenable.
This section is concerned with checking whether there are weird values in the dataset. For example, the number of weeks within a month is not possible for either negative integer or greater than 4. However, since our main interest is column Promotion
and SalesInThousand
, other columns will not be investigated further.
df.describe().round()
MarketID | LocationID | AgeOfStore | week | SalesInThousands | |
---|---|---|---|---|---|
count | 548.0 | 548.0 | 548.0 | 548.0 | 548.0 |
mean | 6.0 | 480.0 | 9.0 | 2.0 | 53.0 |
std | 3.0 | 288.0 | 7.0 | 1.0 | 17.0 |
min | 1.0 | 1.0 | 1.0 | 1.0 | 17.0 |
25% | 3.0 | 216.0 | 4.0 | 2.0 | 43.0 |
50% | 6.0 | 504.0 | 7.0 | 2.0 | 50.0 |
75% | 8.0 | 708.0 | 12.0 | 3.0 | 60.0 |
max | 10.0 | 920.0 | 28.0 | 4.0 | 100.0 |
# specify columns to plot
numerical_columns = df.select_dtypes(include='number')
# setup up figure size
plt.figure(figsize=(15, 7))
# create boxplots
for i, col in enumerate(numerical_columns):
plt.subplot(5, 3, i+1)
sns.boxplot(x=df[col])
plt.title(f'{col}')
plt.suptitle('No negative values are found in numerical variables', size=15)
plt.tight_layout()
plt.show()
Note
MarketID
and LocationID
is numeric, examining potential measurement errors in these columns are not done because these columns only serve to indicate unique identifier of each observation, not the measurement itself.AgeOfStore
, week
, and SalesInThousand
do not seem to contain any measurement errors since the values are possible in real-worldd.Before moving forward to the statistical analysis, the central part of this A/B test, I will recapitulate the key findings on the parametric assumption tests in Section 3 to better inform which statistical test is best-suited for the testing. In short, the distributions of all sales by promotions are not normal. Outliers are also present in sales based on promotions. Additionally, it is not possible to move forward with one-way independent ANOVA since the homoscedasticity assumption is also violated based on Lavene's test output. For these reasons, Kruskal-Wallis test will be used.
In this section, Kruskal-Wallis H test will be carried out to check the overall differences between sales by promotions. This test will be proceeded by Dunn's post-hoc test to examine specific group difference if the Kruskal-Wallis test output is significant. Since these tests can only provide a half story of the group difference (promotional campaign difference in terms of sales), effect size ($\eta^2$) for Kruskal-Wallis H will be computed. Last, the analysis will be supplied by computation of confidence intervals for the effect size.
# separate data by groups
sales_promo1 = df[df['Promotion'] == 'Promotion 1']['SalesInThousands']
sales_promo2 = df[df['Promotion'] == 'Promotion 2']['SalesInThousands']
sales_promo3 = df[df['Promotion'] == 'Promotion 3']['SalesInThousands']
# number of groups
num_groups = 3
# Perform Kruskal-Wallis H test
h_stat, p_value = kruskal(sales_promo1, sales_promo2, sales_promo3)
# Degrees of freedom
dof = num_groups - 1
# Print Kruskal-Wallis H Test results
print(f"Kruskal-Wallis H Test statistic:")
results = pd.DataFrame({
'H': [h_stat],
'p-value': [p_value],
'dof': [dof],
'significant': [p_value < 0.05]
})
results.round(4)
Kruskal-Wallis H Test statistic:
H | p-value | dof | significant | |
---|---|---|---|---|
0 | 53.2948 | 0.0 | 2 | True |
plt.figure(figsize=(12, 5))
# create boxplot
sns.boxplot(x='Promotion', y='SalesInThousands', data=df, palette='Set2', width=0.5)
sns.stripplot(x='Promotion', y='SalesInThousands', data=df, color='k', alpha=0.5, jitter=True)
plt.title(f'Kruskal-Wallis test shows a significant difference ($H({dof})={h_stat:.2f}, p={p_value:.4f})$')
plt.xlabel(None)
plt.ylabel('Sales in Thousands')
plt.show()
Note
Dunn's test (Dunn, 1964) here is used as a post-hoc pairwise comparison test between promotional campaigns in terms of sales. Probability (p) value needs to be adjusted to prevent obtaining a difference between a promotional campaign pair by random chance (hereby, reducing the risk of a false positive). Holm's method (Holm, 1979) is adopted to control the family-wise error rate (FWER).
# separate data by groups
sales_promo1 = df[df['Promotion'] == 'Promotion 1']['SalesInThousands']
sales_promo2 = df[df['Promotion'] == 'Promotion 2']['SalesInThousands']
sales_promo3 = df[df['Promotion'] == 'Promotion 3']['SalesInThousands']
# perform Dunn's test
dunn_results = posthoc_dunn([sales_promo1, sales_promo2, sales_promo3], p_adjust='holm').round(4)
display(dunn_results)
1 | 2 | 3 | |
---|---|---|---|
1 | 1.0000 | 0.0 | 0.0486 |
2 | 0.0000 | 1.0 | 0.0000 |
3 | 0.0486 | 0.0 | 1.0000 |
# set up the figure
plt.figure(figsize=(12, 5))
# create the boxplot with data points
sns.boxplot(x='Promotion', y='SalesInThousands', data=df, palette='Set2', width=0.5)
sns.stripplot(x='Promotion', y='SalesInThousands', data=df, color='k', alpha=0.5, jitter=True)
# calculate the vertical offset for annotations
max_y = max(df['SalesInThousands'])
offset = max_y * 0.1 # 10% of the y-axis range as an offset
# extract unique promotions for easy indexing
promotions = df['Promotion'].unique()
promotions_dict = {promo: idx for idx, promo in enumerate(promotions)}
# plot Dunn's test results
for i, promo1 in enumerate(promotions):
for j, promo2 in enumerate(promotions):
if i < j:
p_value = dunn_results.iloc[i, j]
if p_value < 0.05:
# determine positions for the bracket
x1, x2 = promotions_dict[promo1], promotions_dict[promo2]
y = max_y + offset * (i + j + 1) # vertical position of the bracket
h = offset * 0.5 # height of the bracket
col = 'k' # color of the line and text
# add a bracket-style line indicating significant difference
plt.plot([x1, x1, x2, x2], [y, y + h, y + h, y], lw=1.5, c=col)
# add a label for the p-value
plt.text((x1 + x2) * 0.5, y + h, f'$p$ = {p_value:.4f}', ha='center', va='bottom', color=col)
plt.title('Dunn\'s test reveals significant difference across promotions ')
plt.ylim(0, 160)
plt.xlabel(None)
plt.ylabel('Sales in Thousands')
plt.show()
Note
Promotion 1
vs Promotion 2
: p = 0.00Promotion 2
vs Promotion 3
: p = 0.00Promotion 1
vs Promotion 3
: p = 0.04Promotion 3
and Promotion 1
pair is only slightly below the alpha ($\alpha$).While we already obtained Kruskal-Wallis H and Dunn's test outputs to examine whether the differences between promotional campaigns are statistically significant, we have not explicitly test which campaign outperforms others in terms of sales. For this reason, I will use median to indicate the best performing campaign. In its interpretation, the promotion with the highest median is the best in terms of its performance.
median_promotion =( df.groupby('Promotion', as_index=False)
.agg(func={'SalesInThousands':'median'})
.sort_values(by='SalesInThousands', ascending=False))
median_promotion
Promotion | SalesInThousands | |
---|---|---|
0 | Promotion 1 | 55.385 |
2 | Promotion 3 | 51.165 |
1 | Promotion 2 | 45.385 |
plt.figure(figsize=(12, 5), dpi=100)
# create barplot
colors = {'Promotion 1': '#8e9fca', 'Promotion 2': '#fc8d62', 'Promotion 3':'#66c2a5'}
ax = sns.barplot(x='Promotion', y='SalesInThousands', data=median_promotion, palette=colors)
# add data labels
ax.bar_label(ax.containers[0], fmt='%.2f', label_type='edge')
plt.title('While Promotion 1 has the highest median sales, differences in medians are small')
plt.ylim(0, 65)
plt.ylabel('Sales in Thousands')
plt.xlabel(None)
plt.show()
Note
Promotion 1
vs Promotion 2
; USD 10K for Promotion 1
vs Promotion 3
; and USD 5.76K for Promotion 2
vs Promotion 3
.Confidence intervals (CIs) allows us to evaluate a range within which we can be confident that the true mean difference lies. And effect size quantifies the magnitude of difference between groups, not just whether the difference is statistically significant.
The confidence intervals here are used to understand the precision of the estimates and effect sizes to understand the practical significance. So it gives context to the statistical significance in the real-world impact of the difference. Effect size used here is eta-squared ($\eta^2$), measuring the proportion of variance explained by promotions in the data (see rstatix 0.7.2 for the R documentation).
$$\eta^2 = \frac{H - k + 1}{n - k}$$# define grouped data
grouped_data = [sales_promo1, sales_promo2, sales_promo3]
# specify k and n
k = len(grouped_data)
n = sum([len(group) for group in grouped_data])
# calculate eta-squared
eta_squared = (h_stat - k + 1) / (n - k)
print(f"Eta-squared: {eta_squared:.4f}")
Eta-squared: 0.0941
Note: As eta-squared ($\eta^2$) represents the proportion of variance in sales (SalesInThousands
) that is explained by promotional campaigns (Promotion
), the effect size ($\eta^2$) means only 9.4% of the variance in sales is explained by the differences between promotional campaigns.
# bootstrapping for confidence intervals
n_bootstraps = 1000
bootstrapped_eta_squared = []
for _ in range(n_bootstraps):
# resample data with replacement
resampled_data = [np.random.choice(group, size=len(group), replace=True) for group in grouped_data]
resampled_H_stat, _ = kruskal(*resampled_data)
# calculate eta-squared for resampled data
resampled_eta_squared = (resampled_H_stat - k + 1) / (n - k)
bootstrapped_eta_squared.append(resampled_eta_squared)
# calculate the confidence intervals (e.g., 95% CI)
ci_lower = np.percentile(bootstrapped_eta_squared, 2.5)
ci_upper = np.percentile(bootstrapped_eta_squared, 97.5)
print(f"95% Confidence Interval for eta-squared: [{ci_lower:.4f}, {ci_upper:.4f}]")
95% Confidence Interval for eta-squared: [0.0518, 0.1474]
Note: The 95% confidence interval for $\eta^2$, obtained via bootstrapping, was [0.0567, 0.1463]. This interval suggest that the true effect size lies between 5% and 14% of the variance in sales being attributable to the promotional campaigns.
In brief, this analysis seeks to identify which promotional campaign generated largest sales. Since the data did not meet the assumptions for parametric tests, Kruskal-Wallis H test was adopted along with Dunn's post-hoc test. The results revealed that Promotion 1 had the highest median sales among three campaigns. While these three campaigns were found to be statistically different, the practical significance of their differences is small.
Taken altogether, despite finding the significant difference between promotional campaigns, with the Promotion 1 as highest median sales, the practical difference is not big enough. This small impact implies that while the campaigns affect sales, the impact is relatively small to warrant major changes based on these results alone.