#!/usr/bin/env python # coding: utf-8 # Open In Colab # # Business Statistics: News Source # # **Marks: 60** # ## Define Problem Statement and Objectives # News Source, an online news website, has redesigned its landing page with the aim of increasing subscriptions. It collected data during an A/B test: some users were served the old landing page and some were served the new one. # # The data consists of: # * user ID # * test group # * landing page version # * time spent on landing page (minutes) # * subscriber conversion flag # * preferred language # # We will determine whether the landing page redesign has affected user behavior. Throughout, the level of significance is fixed at 5%. # ## Import all the necessary libraries # In[ ]: # math libraries import numpy as np import scipy.stats as st # data handling and visualization libraries import pandas as pd import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') import seaborn as sns # I will import specific tests from the statsmodels library later if necessary # ## Reading the Data into a DataFrame # In[ ]: usage=pd.read_csv('dataset.csv') # ## Explore the dataset and extract insights using Exploratory Data Analysis # Let's look at the data. We'll start with the head and tail, or first five rows and last five rows respectively. # In[ ]: usage.head() # In[ ]: usage.tail() # In[ ]: usage.shape # We have 100 rows of user data examining News Source site usage. Since this data is from A/B testing, we are given whether each user was in the control group or the treatment group, whether they were shown the old layout or the new layout. We also get a unique ID for each user, which we can utilize to check for duplicates. Time spent on the page is a continuous numerical variable, and I expect much of our analysis will revolve around time. The other key variable is found in the converted column, since the main business goal is to convert readers to subscribers. Lastly, we can study language preference. # # It looks like group and landing_page give the same information in two different ways. We can confirm they are perfectly aligned below. # In[ ]: usage.groupby('group')['landing_page'].value_counts() # We next review the column info. # In[ ]: usage.info() # Thankfully, there are no null entries reported. We see that four of the six columns have 'object' data type, that is, strings. We will convert these columns to the 'categorical' data type later. Time, a continuous variable, is stored as a float, which is to be expected, since we will be looking at decimal numbers. # # We review brief statistical analyses. # In[ ]: usage.describe().T # In[ ]: usage.describe(include=[object]).T # User ID is not an insightful numerical variable, since it is just an identifier. For time_spent_on_the_page, which I will just refer to as *time*, we see that the mean and median are fairly close. This indicates that the time data is likely not very skewed. The minimum is 0.19 minutes and the maximum is 10.71 minutes. # # For the categorical variables, the unique count is a good way to check that the data is intact: We expect to only have two possibilites for group, landing_page, and converted. Group and landing_page data are split perfectly in half, slightly more than half of the users were converted (yay!), and the most popular language is Spanish. # # Lastly, let's just check that the data is intact. # In[ ]: usage.isnull().sum() # In[ ]: usage.duplicated().sum() # In[ ]: usage['user_id'].duplicated().sum() # There are no null entries and no duplicated rows. Additionally, we checked specifically to ensure that there are no duplicated user IDs, since we only want one result in the sample from each user. # ### Univariate Analysis # We start by converting the categorical variables to the categorical data type. # In[ ]: # convert object columns to categorical columns # and create a list of these column names col_types=usage.dtypes cat_col=[] for col in usage.columns: if col_types[col]==object: usage[col]=pd.Categorical(usage[col]) cat_col.append(col) # The ```info()``` method allows us to verify that these columns have been successfully converted to the 'category' data type. # In[ ]: usage.info() # Let's look at what the values are for our categorical variables. # In[ ]: # print a table with value counts for each categorical column print('Value Counts by Categorical Column') print('~'*50) for col in cat_col: print('Column:',col) print(usage[col].value_counts()) print('~'*50) # Recall that our sample size is 100 users. We note immediately that the groups are split perfectly down the middle. As was noted earlier, group and landing_page convey the same data in two different ways, and since we are A/B testing the layout, it is helpful that our two groups have the same size. From the converted column, we find that slightly more than half of the users were converted to subscribers. I'd be interested to see how that conversion is distributed across the test groups! Lastly, we have language preference. The three languages have about equal representation in the data, around 1/3 of the sample preferring English, 1/3 Spanish, and 1/3 French. We will look later at whether there is any difference between these language preferences and the conversion rate. # In[ ]: # use seaborn default plotting theme sns.set_theme() # In[ ]: # create a plot of categorical variables # using a loop and subplots # figure setup plt.figure(figsize=(16,5)) plt.suptitle('Ratio of Values',fontsize=20) # loop to create subplots for i,col in enumerate(cat_col): plt.subplot(1,4,i+1) sns.countplot(data=usage,x=col) plt.show() # We find in the countplots what we found in the table above. The only new insight is perhaps a better visual sense of how close these values are. For example, when we look across the entire sample, the number of converted users is quite close to the number of users who weren't converted. # In[ ]: # plot the numerical variable: time time='time_spent_on_the_page' # figure setup fig=plt.figure(figsize=(12,5)) fig.suptitle('Time Spent on the Page',fontsize=20) # histogram subplot plt.subplot(1,2,1) sns.histplot(data=usage,x=time,kde=True) plt.xlabel('Time (min)') # boxplot subplot plt.subplot(1,2,2) sns.boxplot(data=usage,x=time) plt.xlabel('Time (min)') plt.show() # We see from the KDE on the left that time is normally distributed. The peak is between 5 and 6 minutes, as is the median, which can be read from the boxplot. The boxplot further illustrates that the spead of the data is not irregular, that is, there are no outliers: the minimum is just about 0 minutes and the maximum is around 11 minutes, both within bounds for these data. As can be seen in both the histogram and (especially) the boxplot, the middle 50% of the data is concentrated between about 4 minutes and 7 minutes. # # We can combine these two plots into the violin plot below, which encapsulates the analysis above. # In[ ]: # violin plot of time #figure setup plt.figure(figsize=(8,5)) plt.title('Violin Plot of Time', fontsize=16) # violin restricted to data range (cut parameter) sns.violinplot(data=usage,x=time,cut=0) plt.xlabel('Time (min)'); # ### Bivariate Analysis # Let's start with the most pressing question: Does the landing page design have any effect the conversion rate? # In[ ]: # figure setup plt.figure(figsize=(8,5)) plt.title('Conversion Rate by Landing Page Design',fontsize=14) # countplot sns.countplot(data=usage, x='landing_page',hue='converted') plt.xlabel('Design') plt.ylabel('Count'); # Indeed, it appears that the new design has a positive impact on conversion rate. In particular, more than half the users served the new landing page were converted to subscribers, whereas the old design had a conversion rate less than 50%. Now this looks like a big improvement, but we cannot be sure it is statistically significant until we run the test later. # # Next, let's look at whether language preference and conversion rate seemingly interact. # In[ ]: # figure setup plt.figure(figsize=(8,5)) plt.title('Conversion Rate by Language Preference',fontsize=14) # countplot sns.countplot(data=usage,x='language_preferred',hue='converted') plt.legend(loc='upper left') plt.xlabel('Language') plt.ylabel('Count'); # We are seeing a difference in conversion rate between the three languages! The change in landing page design has had a big effect on English-speaking users, while the effect was less for Spanish-speaking users. There appears to be no improvement in conversion rate for the French-speaking users; it seems, instead, that *fewer* users subscribed when served the new landing page design. # # We should check, though, that all three languages were evenly represented in the two groups. # In[ ]: # figure setup plt.figure(figsize=(8,5)) plt.title('Group Language Breakdown',fontsize=14) # countplot sns.countplot(data=usage,x='group',hue='language_preferred') plt.legend(loc='lower left') plt.ylabel('Count'); # Thankfully, the groups are evenly composed of users speaking the three languages. Turning then to our numerical variable, time, was there any difference between the three languages with respect to how much time users spent on the News Source landing page? # In[ ]: # kde of time based on language sns.displot(data=usage,x=time,hue='language_preferred',kind='kde') plt.title('Time on Landing Page by Language Preference',fontsize=16) plt.xlabel('Time (min)'); # In[ ]: # precise calculation of variance difference usage.groupby('language_preferred')[time].var() # No, users preferring all three languages spent essentially the same amount of time, on average, on the landing page. Several important features of the plot worth noting: The time data for all three languages are normally distributed, and Spanish users seem to fall closer to the mean. That is, the density is greater in the middle for Spanish users and the variance is thus less. Our subsequent calculation further supports this observation. # In[ ]: # figure setup plt.figure(figsize=(8,5)) plt.title('Time by Trial Group',fontsize=14) # histogram sns.histplot(data=usage,x=time,hue='group',kde=True) plt.xlabel('Time (min)'); # In[ ]: # precise mean calculation usage.groupby('group')[time].mean() # Looking at time broken down by trial group, we see (from the KDE) that the mean time for the trial group is higher than the control. This is good! Recall that the goal of the News Source layout redesign was to keep people on the site longer and convert them to subscribers. Time is approximately normally distributed for both groups, with a peak around 6 minutes for the treatment group and around 4.5 minutes for the control. # In[ ]: # time and conversion rate # figure setup plt.figure(figsize=(12,5)) plt.suptitle('Time and Conversion Rate',fontsize=20) # boxplot subplot plt.subplot(1,2,1) sns.boxplot(data=usage,x=time,y='converted') plt.xlabel('Time (min)') # swarmplot subplot plt.subplot(1,2,2) sns.swarmplot(data=usage,x=time,y='converted') plt.xlabel('Time (min)') plt.show() # Now looking at conversion rate, we see that spending more time on the landing page is generally a good indicator of conversion status. Since the boxplots contain outliers, we'd like a better idea of the spread and concentration of the data. We accomplish this with a swarm plot, which shows every data point. The swarm plot illustrates that non-converted users have a wider spread of times. This is supported by the wider IQR that can be observed in the boxplot. Times for converted users are generally greater with less variability. # In[ ]: # language differences sns.catplot(data=usage, x='language_preferred', y=time, col='converted', kind='box'); # What is apparent from this plot is that time has a greater connection with conversion status than language. Generally, the time is greater for converted users than non-converted users, independent of language preference. # In[ ]: # conversion and group vs time sns.catplot(data=usage, x='converted', y=time, col='group', kind='box'); # Here again, time is more closely connected with conversion status than trial group. We see that converted users reliably spend more time on the News Source landing page than non-converted users, independent of the page layout. What is of note is that the medians are higher for all users in the treatment group. Equivalently, the redesign appears to have had a positive effect on user retention. # In[ ]: # split violin plot of time sns.catplot(data=usage, x=time, y='language_preferred', hue='group', hue_order=['treatment','control'], kind='violin', split=True) plt.title('Time by Language and Group',fontsize=16) plt.xlabel('Time (min)') plt.ylabel('Language'); # Lastly, we can visualize time with respect to language preference and trial group. For all three languages, we see that users in the treatment group stay on News Source for longer than those in the control group; that is, the peak of the curve is greater (further to the right) for users in the treatment group. What's more, the split violin plot illuminates the differences between the languages and time, as no two languages have similar looking violins. # ## 1. Do the users spend more time on the new landing page than the existing landing page? # In[ ]: # initial numerical comparison usage.groupby('landing_page')[time].mean() # It appears that the mean time for the new landing page is greater than the mean time for the old landing page. Now what we must determine is whether the difference is significant. # ### Perform Visual Analysis # In[ ]: # kde of time for old and new layouts sns.displot(data=usage,x=time,hue='landing_page',kind='kde') plt.title('KDE of Time Spent on Landing Page', fontsize=16) plt.xlabel('Time (min)'); # We have two time variables now: time for the new layout and time for the old layout. From the Kernel Density Estimate, we find that both are approximately normally distributed. Moreover, the peak of the new layout data is higher and further to the right, indicating a high concentration of data above the mean of the old layout data. # ### Step 1: Define the null and alternate hypotheses # To test if the new layout does indeed make a statistically significant difference, we must assume at the outset that there is no population-level difference. That is, our null hypothesis is that the mean time for the old layout ($\mu_{\text{old}}$) is the same as the mean time for the new layout ($\mu_{\text{new}}$): # $$ H_0: \mu_{\text{old}}=\mu_{\text{new}}.$$ # The alternative hypothesis then adresses Question 1 above, namely, do users really spend more time on News Source now that the layout has been redesigned? We formulate this as: # $$ H_a: \mu_{\text{old}}<\mu_{\text{new}}.$$ # ### Step 2: Select Appropriate test # Since the populations are independent, simply randomly sampled, and time is normally distributed (as was shown in Step 1), the two sample independent t-test is the best test to study the effect (if any) of the layout change. # ### Step 3: Decide the significance level # Per the problem statement, we will be assuming a significance level of 5%. # ### Step 4: Collect and prepare data # In[ ]: # separate out the time data old_time=usage[usage['landing_page']=='old'][time] new_time=usage[usage['landing_page']=='new'][time] # We create two pandas Series with the time data for the old layout and the new layout. # ### Step 5: Calculate the p-value # In[ ]: # apply the t-test test_stat, p_val=st.ttest_ind(old_time, new_time, equal_var=False, alternative='less') print('The p-value is',p_val) # ### Step 6: Compare the p-value with $\alpha$ # In[ ]: p_val<0.05 # In[ ]: 0.05/p_val # The p-value is indeed less than the level of significance $\alpha$. In fact, the p-value is about 359 times smaller than $\alpha$!! # ### Step 7: Draw inference # As the p-value is less than the level of significance, we must reject the null hypothesis in favor of the alternate. Thus, the new layout does keep users on News Source for longer than the old layout. # **A similar approach can be followed to answer the other questions.** # ## 2. Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page? # In[ ]: # numerical comparison usage.groupby('landing_page')['converted'].value_counts(sort=False) # A first look at our data shows a greater conversion rate for the new layout: 66% versus 42%. # ### Perform Visual Analysis # In[ ]: # visual analysis # prepare figure plt.figure(figsize=(8,5)) plt.title('Ratios of Landing Page Conversions',fontsize=16) # countplot sns.countplot(data=usage,x='landing_page',hue='converted') plt.xlabel('Landing Page Design') plt.ylabel('Count'); # Indeed, the contrast noted numerically above is illustrated by the count plot. The orange bar shows converted users. The new landing page has about double the converted users as those that weren't converted, while the old landing page converted less than half the users. # ### Step 1: Define the null and alternate hypotheses # Since we are interested in whether the new layout had any effect on the conversion rate, we must assume at the outset that there is no difference. That is, we assume the conversion rate for users shown the new page ($r_{\text{new}}$) is equal to the conversion rate for the old page ($r_{\text{old}}$): # $$ H_0: r_{\text{new}}=r_{\text{old}}.$$ # The alternate hypothesis addresses our inquiry, namely that the new rate is greater than the old rate: # $$ H_a: r_{\text{new}}>r_{\text{old}}.$$ # ### Step 2: Select appropriate test # We will use the two proportion z-test. Our two populations are binomially distributed, independent, and simply randomly sampled. The last condition to satisfy is the sample size inequalities: $nr\geq10$ and $n(1-r)\geq10$. # * Old: $nr_{\text{old}}=21$, $n(1-r_{\text{old}})=29$. # * New: $nr_{\text{new}}=33$, $n(1-r_{\text{new}})=17$. # # Since all four values are greater than 10, we can confidently use this test. # ### Step 3: Decide the significance level # Again, the problem statement fixes the level of significance at $\alpha=0.05$. # ### Step 4: Collect and prepare data # In[ ]: # prepare the conversion data yes=np.array([21,33]) size=np.array([50,50]) # ### Step 5: Calculate the p-value # In[ ]: # two proportion independent z-test # import the required function from statsmodels from statsmodels.stats.proportion import proportions_ztest # test test_stat,p_val = proportions_ztest(yes,size) print('The p-value is',p_val) # ### Step 6: Compare the p-value with $\alpha$ # In[ ]: p_val<0.05 # The p-value is less than the level of significance. # ### Step 7: Draw inference # Since the p-value is less than $\alpha$, we reject the null hypothesis. Therefore the population conversion rates are genuinely different. In particular, the conversion rate for users served the new landing page is greater than the rate for the old page. Put another way, the News Source redesign has had a positive impact on subscriptions. # ## 3. Is the conversion and preferred language are independent or related? # In[ ]: # contingency table WITH margins (totals) pd.crosstab(usage['language_preferred'], usage['converted'], margins=True, margins_name='Totals') # Here is a contingency table of the data with totals in the final row and column. # ### Perform visual analysis # In[ ]: # figure setup plt.figure(figsize=(8,5)) plt.title('Heatmap of Contingency Table',fontsize=16) # heatmap sns.heatmap(pd.crosstab(usage['language_preferred'], usage['converted']), annot=True, vmin=0, vmax=25) plt.xlabel('Converted') plt.ylabel('Language'); # Above is a heatmap of the contingency table data. A darker color represents fewer observations in the data. The highest value (lightest color) is 21 and the lowest value (darkest color) is 11, both associated with English data. # ### Step 1: Define the null and alternate hypotheses # We are interested in whether conversion status and language preference are related variables. We will assume at the outset that they are not: # $$ H_0: \text{conversion status and language are independent}.$$ # # Otherwise, they must be related: # $$ H_a: \text{conversion status depends on language}.$$ # ### Step 2: Select appropriate test # We are studying the connection of two randomly sampled categorical variables. Since the number of observations in each category is greater than 5, the best test is the chi-squared test of independence. # ### Step 3: Decide the significance level # We assume a significance level of $\alpha=0.05$. # ### Step 4: Collect and prepare data # In[ ]: # generate the contingency table contingency=pd.crosstab(usage['language_preferred'],usage['converted']) # ### Step 5: Calculate the p-value # In[ ]: # chi squared test for independence a, p_val, b, c =st.chi2_contingency(contingency) print('The p-value is',p_val) # ### Step 6: Compare the p-value with $\alpha$ # In[ ]: p_val<0.05 # The p-value is greater than $\alpha$. # ### Step 7: Draw inference # Since the p-value is greater than $\alpha$, we fail to reject the null hypothesis. Therefore we cannot conclude that conversion status and language preference are related. # ## 4. Is the time spent on the new page same for the different language users? # In[ ]: usage.groupby(['landing_page','language_preferred'])[time].mean()['new'] # There's certainly a difference, albeit small, between the mean times for the three languages on the new News Source landing page. # ### Perform visual analysis # In[ ]: # figure setup plt.figure(figsize=(8,5)) plt.title('Time on New Landing Page by Language',fontsize=16) # boxplot sns.boxplot(data=usage.loc[usage['landing_page']=='new'], x=time, y='language_preferred') plt.xlabel('Time (min)') plt.ylabel('Language'); # In[ ]: # kde of time for different languages sns.displot(data=usage.loc[usage['landing_page']=='new'], x=time, hue='language_preferred', kind='kde') plt.title('Time on New Landing Page by Language',fontsize=16) plt.xlabel('Time (min)'); # There is clear visual separation of the time data for the three languages. What's striking though is the contrast between the boxplot and the KDE plot: While the medians (in the boxplot) are quite variable, the means (in the KDE plot) are more concentrated. In the boxplot, the data looks far more variable, with differing ranges and outliers. # ### Step 1: Define the null and alternate hypotheses # We suppose at the outset that there is no population-level difference in the time spent on the new landing page for all three languages: # $$ H_0: \mu_E=\mu_F=\mu_S,$$ # where the quantities are the mean time for English, French, and Spanish respectively. The alternate hypothesis is that there is some difference: # $$ H_a: \text{at least one mean differs}.$$ # ### Step 2: Collect and prepare data # In[ ]: # define a new DataFrame for just new landing page data usage_new=usage.loc[usage['landing_page']=='new'] # define pandas Series of time for each language en_time=usage_new[usage_new['language_preferred']=='English'][time] fr_time=usage_new[usage_new['language_preferred']=='French'][time] sp_time=usage_new[usage_new['language_preferred']=='Spanish'][time] # In this case, as we will need these collected and prepared data to run the Shapiro-Wilk test and the Levene test, we complete this step first. # ### Step 3: Select appropriate test # In the case of comparing more than two means, we require the ANOVA test. To invoke this test, however, a number of assumptions must be satisfied: # * independent simple random samples, # * normally distributed populations, and # * common population variances. # # The first assumption is satisfied since the entire data set was simply randomly sampled and users are independent. # # For the second assumption, we use the Shapiro-Wilk test. We assume a level of significance of 5% for this test. Our hypotheses are: # $$ H_0: \text{time data for the new landing page is normally distributed};$$ # $$ H_a: \text{time data for the new landing page is NOT normally distributed}.$$ # In[ ]: # shapiro-wilk test a, p_val=st.shapiro(usage_new[time]) print('The p-value is',p_val) # With a p-value of 0.8, we fail to reject the null hypothesis. Thus, time data for the new News Source landing page is normally distributed. # # Now we must test that the populations (different languages) have equal variances. We accomplish this with the Levene test. Our hypotheses: # $$ H_0: \text{all three languages have equal variance};$$ # $$ H_a: \text{at least one population has a different variance}.$$ # In[ ]: # levene test a, p_val=st.levene(en_time,fr_time,sp_time) print('The p-value is',p_val) # With a p-value of 0.47, we fail to reject the null hypothesis, and can therefore assume that the time data for all three languages have equal variance. # ### Step 4: Decide the significance level # From the problem statement, we assume a significance level of $\alpha=0.05$. # ### Step 5: Calculate the p-value # In[ ]: # one-way anova test a,p_val=st.f_oneway(en_time,fr_time,sp_time) print('The p-value is',p_val) # ### Step 6: Compare the p-value with $\alpha$ # In[ ]: p_val<0.05 # Our p-value of 0.43 is greater than $\alpha$. # ### Step 7: Draw inference # We fail to reject the null hypothesis. We do not have strong evidence to conclude there is a difference between the time spent on the new landing page for the different language users. # ## Conclusion and Business Recommendations # * News Source users shown the new landing page stay on the site for longer. # * The redesign had the intended effect: The new landing page has a higher conversion rate for subscribers than the old page. # * Language preference is not a factor influencing conversion rate. # * The time spent on the new page is likely the same for all three languages. More precisely, our evidence is not strong enough to conclude that different languages lead to (statistically significant) different times. # * I recommend News Source conduct additional testing to quantify the relationship between time on landing page and conversion rate. What percent increase in conversion rate is observed for each additional minute a user spends on the landing page? # * News Source could do an analysis of the type of stories that gets the greatest conversion rate. Some genres could include breaking news, gossip, opinion pieces, or politics. If the data science team determined that, for example, political stories garnered more subscribers, News Source could focus more on these stories. This facet could even differ by language or region of users. # * News Source could next look at subscription pricing. There is likely an optimal price: high enough to make a profit but low enough to encourage the greatest number of users to subscribe. An analysis could be performed, based on survey data, to determine the ideal subscription price. # * The data science team could offer a sign-up discount for the first month of subscriptions and measure whether this leads to an increase in the conversion rate. This could be accomplished through A/B testing, where some users are shown a pop-up offering the discount if they sign up now, while others do not get this pop-up. Further, more detailed analyses, could examine what percent discount is optimal. # ___