Multiple linear regression¶

Introduction¶

In this project, let's analyze a small business' historical marketing promotion data. Each row corresponds to an independent marketing promotion where the business uses TV, social media, radio, and influencer promotions to increase sales. The company is now trying to include other variables that can help them target their marketing efforts.

We will conduct a multiple linear regression analysis to estimate sales from a combination of independent variables. This will include:

Exploring and cleaning data
Using plots and descriptive statistics to select the independent variables
Creating a fitting multiple linear regression model
Checking model assumptions
Interpreting model outputs and communicating the results to non-technical stakeholders

Step 1: Imports¶

Import packages¶

Import relevant Python libraries and modules.

In [1]:

# Import libraries and modules.

### YOUR CODE HERE ### 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols

Load dataset¶

Load the dataset marketing_sales_data.csv as data and display the first five rows. The variables in the dataset have been adjusted to suit the objectives of this lab.

In [2]:

# Load the data.

### YOUR CODE HERE ### 

data = pd.read_csv('marketing_sales_data.csv')

# Display the first five rows.

### YOUR CODE HERE ### 

data.head()

Out[2]:

	TV	Radio	Social Media	Influencer	Sales
0	Low	3.518070	2.293790	Micro	55.261284
1	Low	7.756876	2.572287	Mega	67.574904
2	High	20.348988	1.227180	Micro	272.250108
3	Medium	20.108487	2.728374	Mega	195.102176
4	High	31.653200	7.776978	Nano	273.960377

Step 2: Data exploration¶

Conduct EDA¶

The features in the data are:

TV promotional budget (in "Low," "Medium," and "High" categories)
Social media promotional budget (in millions of dollars)
Radio promotional budget (in millions of dollars)
Sales (in millions of dollars)
Influencer size (in "Mega," "Macro," "Micro," and "Nano" categories)

Create a pairplot of the data¶

Create a pairplot to visualize the relationship between the continous variables in data.

In [3]:

# Create a pairplot of the data.

### YOUR CODE HERE ### 

sns.pairplot(data);

Observation:

Radio and Social Media both appear to have linear relationships with Sales. Given this, Radio and Social Media may be useful as independent variables in a multiple linear regression model estimating Sales.

TV and Influencer are excluded from the pairplot because they are not numeric.

Calculate the mean sales for each categorical variable¶

There are two categorical variables: TV and Influencer. To characterize the relationship between the categorical variables and Sales, find the mean Sales for each category in TV and the mean Sales for each category in Influencer.

In [4]:

# Calculate the mean sales for each TV category. 

### YOUR CODE HERE ### 

print(data.groupby('TV')['Sales'].mean())

print('')

# Calculate the mean sales for each Influencer category .

### YOUR CODE HERE ### 

print(data.groupby('Influencer')['Sales'].mean())

TV
High      300.853195
Low        90.984101
Medium    195.358032
Name: Sales, dtype: float64

Influencer
Macro    181.670070
Mega     194.487941
Micro    188.321846
Nano     191.874432
Name: Sales, dtype: float64

Observation:

The average Sales for High TV promotions is considerably higher than for Medium and Low TV promotions. TV may be a strong predictor of Sales.

The categories for Influencer have different average Sales, but the variation is not substantial. Influencer may be a weak predictor of Sales.

These results can be investigated further when fitting the multiple linear regression model.

Remove missing data¶

This dataset contains rows with missing values. To correct this, drop all rows that contain missing data.

In [5]:

# Drop rows that contain missing data and update the DataFrame.

### YOUR CODE HERE ### 

data = data.dropna(axis=0)

Clean column names¶

The ols() function doesn't run when variable names contain a space. Check that the column names in data do not contain spaces and fix them, if needed.

In [6]:

# Rename all columns in data that contain a space. 

### YOUR CODE HERE ### 

data = data.rename(columns={'Social Media': 'Social_Media'})

Step 3: Model building¶

Fit a multiple linear regression model that predicts sales¶

Use 'TV' and 'Radio' as the independent variables, fit a multiple linear regression model that predicts Sales using two or more independent variables from data.

In [7]:

# Define the OLS formula.

### YOUR CODE HERE ### 

ols_formula = 'Sales ~ C(TV) + Radio'

# Create an OLS model.

### YOUR CODE HERE ### 

OLS = ols(formula = ols_formula, data = data)

# Fit the model.

### YOUR CODE HERE ### 

model = OLS.fit()

# Save the results summary.

### YOUR CODE HERE ### 

model_results = model.summary()

# Display the model results.

### YOUR CODE HERE ### 

model_results

Out[7]:

OLS Regression Results
Dep. Variable:	Sales	R-squared:	0.904
Model:	OLS	Adj. R-squared:	0.904
Method:	Least Squares	F-statistic:	1783.
Date:	Fri, 20 Oct 2023	Prob (F-statistic):	1.63e-288
Time:	10:59:04	Log-Likelihood:	-2714.0
No. Observations:	572	AIC:	5436.
Df Residuals:	568	BIC:	5453.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	218.5261	6.261	34.902	0.000	206.228	230.824
C(TV)[T.Low]	-154.2971	4.929	-31.303	0.000	-163.979	-144.616
C(TV)[T.Medium]	-75.3120	3.624	-20.780	0.000	-82.431	-68.193
Radio	2.9669	0.212	14.015	0.000	2.551	3.383

Omnibus:	61.244	Durbin-Watson:	1.870
Prob(Omnibus):	0.000	Jarque-Bera (JB):	18.077
Skew:	0.046	Prob(JB):	0.000119
Kurtosis:	2.134	Cond. No.	142.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Question: Which independent variables did you choose for the model, and why?

TV was selected, as the preceding analysis showed a strong relationship between the TV promotional budget and the average Sales.
Radio was selected because the pairplot showed a strong linear relationship between Radio and Sales.
Social Media was not selected because it did not increase model performance and it was later determined to be correlated with another independent variable: Radio.
Influencer was not selected because it did not show a strong relationship to Sales in the preceding analysis.

Check model assumptions¶

For multiple linear regression, there is an additional assumption added to the four simple linear regression assumptions: multicollinearity.

Check that all five multiple linear regression assumptions are upheld for your model.

Model assumption: Linearity¶

Create scatterplots comparing the continuous independent variable(s) selected previously with Sales to check the linearity assumption. Use the pairplot you created earlier to verify the linearity assumption or create new scatterplots comparing the variables of interest.

In [8]:

# Create a scatterplot for each independent variable and the dependent variable.

### YOUR CODE HERE ### 

# Create a 1x2 plot figure.
fig, axes = plt.subplots(1, 2, figsize = (8,4))

# Create a scatterplot between Radio and Sales.
sns.scatterplot(x = data['Radio'], y = data['Sales'],ax=axes[0])

# Set the title of the first plot.
axes[0].set_title("Radio and Sales")

# Create a scatterplot between Social Media and Sales.
sns.scatterplot(x = data['Social_Media'], y = data['Sales'],ax=axes[1])

# Set the title of the second plot.
axes[1].set_title("Social Media and Sales")

# Set the xlabel of the second plot.
axes[1].set_xlabel("Social Media")

# Use matplotlib's tight_layout() function to add space between plots for a cleaner appearance.
plt.tight_layout()

Question: Is the linearity assumption met?

The linearity assumption holds for Radio, as there is a clear linear relationship in the scatterplot between Radio and Sales. Social Media was not included in the preceding multiple linear regression model, but it does appear to have a linear relationship with Sales.

Model assumption: Independence¶

The independent observation assumption states that each observation in the dataset is independent. As each marketing promotion (i.e., row) is independent from one another, the independence assumption is not violated.

Model assumption: Normality¶

Create the following plots to check the normality assumption:

Plot 1: Histogram of the residuals
Plot 2: Q-Q plot of the residuals

In [9]:

# Calculate the residuals.

### YOUR CODE HERE ### 

residuals = model.resid

# Create a 1x2 plot figure.
fig, axes = plt.subplots(1, 2, figsize = (8,4))

# Create a histogram with the residuals. 

### YOUR CODE HERE ### 

sns.histplot(residuals, ax=axes[0])

# Set the x label of the residual plot.
axes[0].set_xlabel("Residual Value")

# Set the title of the residual plot.
axes[0].set_title("Histogram of Residuals")

# Create a Q-Q plot of the residuals.

### YOUR CODE HERE ### 

sm.qqplot(residuals, line='s',ax = axes[1])

# Set the title of the Q-Q plot.
axes[1].set_title("Normal QQ Plot")

# Use matplotlib's tight_layout() function to add space between plots for a cleaner appearance.
plt.tight_layout()

# Show the plot.
plt.show()

Question: Is the normality assumption met?

The histogram of the residuals are approximately normally distributed, which supports that the normality assumption is met for this model. The residuals in the Q-Q plot form a straight line, further supporting that this assumption is met.

Model assumption: Constant variance¶

Check that the constant variance assumption is not violated by creating a scatterplot with the fitted values and residuals. Add a line at $y = 0$ to visualize the variance of residuals above and below $y = 0$.

In [10]:

# Create a scatterplot with the fitted values from the model and the residuals.

### YOUR CODE HERE ### 

fig = sns.scatterplot(x = model.fittedvalues, y = model.resid)

# Set the x axis label.
fig.set_xlabel("Fitted Values")

# Set the y axis label.
fig.set_ylabel("Residuals")

# Set the title.
fig.set_title("Fitted Values v. Residuals")

# Add a line at y = 0 to visualize the variance of residuals above and below 0.

### YOUR CODE HERE ### 

fig.axhline(0)

# Show the plot.
plt.show()

Question: Is the constant variance assumption met?

The fitted values are in three groups because the categorical variable is dominating in this model, meaning that TV is the biggest factor that decides the sales.

However, the variance where there are fitted values is similarly distributed, validating that the assumption is met.

Model assumption: No multicollinearity¶

The no multicollinearity assumption states that no two independent variables ($X_i$ and $X_j$) can be highly correlated with each other.

Two common ways to check for multicollinearity are to:

Create scatterplots to show the relationship between pairs of independent variables
Use the variance inflation factor to detect multicollinearity

Use one of these two methods to check your model's no multicollinearity assumption.

In [11]:

# Create a pairplot of the data.

### YOUR CODE HERE ### 

sns.pairplot(data)

Out[11]:

<seaborn.axisgrid.PairGrid at 0x7ff138bd6ad0>

In [12]:

# Calculate the variance inflation factor (optional).

### YOUR CODE HERE ### 

# Import variance_inflation_factor from statsmodels.
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create a subset of the data with the continous independent variables. 
X = data[['Radio','Social_Media']]

# Calculate the variance inflation factor for each variable.
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

# Create a DataFrame with the VIF results for the column names in X.
df_vif = pd.DataFrame(vif, index=X.columns, columns = ['VIF'])

# Display the VIF results.
df_vif

Out[12]:

	VIF
Radio	5.170922
Social_Media	5.170922

Question: Is the no multicollinearity assumption met?

The preceding model only has one continous independent variable, meaning there are no multicollinearity issues.

If a model used both Radio and Social_Media as predictors, there would be a moderate linear relationship between Radio and Social_Media that violates the multicollinearity assumption. Furthermore, the variance inflation factor when both Radio and Social_Media are included in the model is 5.17 for each variable, indicating high multicollinearity.

Step 4: Results and evaluation¶

Display the OLS regression results¶

In [13]:

# Display the model results summary.

### YOUR CODE HERE ### 

model_results

Out[13]:

OLS Regression Results
Dep. Variable:	Sales	R-squared:	0.904
Model:	OLS	Adj. R-squared:	0.904
Method:	Least Squares	F-statistic:	1783.
Date:	Fri, 20 Oct 2023	Prob (F-statistic):	1.63e-288
Time:	10:59:04	Log-Likelihood:	-2714.0
No. Observations:	572	AIC:	5436.
Df Residuals:	568	BIC:	5453.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	218.5261	6.261	34.902	0.000	206.228	230.824
C(TV)[T.Low]	-154.2971	4.929	-31.303	0.000	-163.979	-144.616
C(TV)[T.Medium]	-75.3120	3.624	-20.780	0.000	-82.431	-68.193
Radio	2.9669	0.212	14.015	0.000	2.551	3.383

Omnibus:	61.244	Durbin-Watson:	1.870
Prob(Omnibus):	0.000	Jarque-Bera (JB):	18.077
Skew:	0.046	Prob(JB):	0.000119
Kurtosis:	2.134	Cond. No.	142.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Question: What is the interpretation of the model's R-squared?

Using TV and Radio as the independent variables results in a multiple linear regression model with $R^{2} = 0.904$. In other words, the model explains $90.4\%$ of the variation in Sales. This makes the model an excellent predictor of Sales.

Interpret model coefficients¶

With the model fit evaluated, you can look at the coefficient estimates and the uncertainty of these estimates.

Again, display the OLS regression results.

In [14]:

# Display the model results summary.

### YOUR CODE HERE ### 

model_results

Out[14]:

OLS Regression Results
Dep. Variable:	Sales	R-squared:	0.904
Model:	OLS	Adj. R-squared:	0.904
Method:	Least Squares	F-statistic:	1783.
Date:	Thu, 18 May 2023	Prob (F-statistic):	1.63e-288
Time:	21:06:21	Log-Likelihood:	-2714.0
No. Observations:	572	AIC:	5436.
Df Residuals:	568	BIC:	5453.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	218.5261	6.261	34.902	0.000	206.228	230.824
C(TV)[T.Low]	-154.2971	4.929	-31.303	0.000	-163.979	-144.616
C(TV)[T.Medium]	-75.3120	3.624	-20.780	0.000	-82.431	-68.193
Radio	2.9669	0.212	14.015	0.000	2.551	3.383

Omnibus:	61.244	Durbin-Watson:	1.870
Prob(Omnibus):	0.000	Jarque-Bera (JB):	18.077
Skew:	0.046	Prob(JB):	0.000119
Kurtosis:	2.134	Cond. No.	142.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Question: What are the model coefficients?

When TV and Radio are used to predict Sales, the model coefficients are:

$\beta_{0} = 218.5261$
$\beta_{TVLow}= -154.2971$
$\beta_{TVMedium} = -75.3120$
$\beta_{Radio} = 2.9669$

Question: How shall we write the relationship between Sales and the independent variables as a linear equation?

$\text{Sales} = \beta_{0} + \beta_{1}*X_{1}+ \beta_{2}*X_{2}+ \beta_{3}*X_{3}$

$\text{Sales} = \beta_{0} + \beta_{TVLow}*X_{TVLow}+ \beta_{TVMedium}*X_{TVMedium}+ \beta_{Radio}*X_{Radio}$

$\text{Sales} = 218.5261 - 154.2971*X_{TVLow} - 75.3120*X_{TVMedium}+ 2.9669 *X_{Radio}$

Question: What is your intepretation of the coefficient estimates? Are the coefficients statistically significant?

The default TV category for the model is High since there are coefficients for the other two TV categories, Medium and Low. Because the coefficients for the Medium and Low TV categories are negative, that means the average of sales is lower for Medium or Low TV categories compared to the High TV category when Radio is at the same level.

For example, the model predicts that a Low TV promotion is 154.2971 lower on average compared to a high TV promotion given the same Radio promotion.

The coefficient for Radio is positive, confirming the positive linear relationship shown earlier during the exploratory data analysis.

The p-value for all coefficients is $0.000$, meaning all coefficients are statistically significant at $p=0.05$. The 95% confidence intervals for each coefficient should be reported when presenting results to stakeholders.

For example, there is a $95\%$ chance that the interval $[-163.979,-144.616]$ contains the true parameter of the slope of $\beta_{TVLow}$, which is the estimated difference in promotion sales when a Low TV promotion is chosen instead of a High TV promotion.

Question: Why is it important to interpret the beta coefficients?

Beta coefficients allow you to estimate the magnitude and direction (positive or negative) of the effect of each independent variable on the dependent variable. The coefficient estimates can be converted to explainable insights, such as the connection between an increase in TV promotional budgets and sales mentioned previously.

Question: What are you interested in exploring based on your model?

Potential areas to explore include:

Providing the business with the estimated sales given different TV promotions and radio budgets
Additional plots to help convey the results, such as using the seaborn regplot() to plot the data with a best fit regression line

Question: Do you think your model could be improved? Why or why not? How?

Given how accurate TV was as a predictor, the model could be improved by getting a more granular view of the TV promotions, such as by considering more categories or the actual TV promotional budgets.

Furthermore, additional variables, such as the location of the marketing campaign or the time of year, could increase model accuracy.

Considerations¶

Key Takeaways

Multiple linear regression is a powerful tool to estimate a dependent continous variable from several independent variables.
Exploratory data analysis is useful for selecting both numeric and categorical features for multiple linear regression.
Fitting multiple linear regression models may require trial and error to select variables that fit an accurate model while maintaining model assumptions.

Findings

According to the model, high TV promotional budgets result in significantly more sales than medium and low TV promotional budgets. For example, the model predicts that a Low TV promotion is 154.2971 lower on average than a high TV promotion given the same Radio promotion.

The coefficient for radio is positive, confirming the positive linear relationship shown earlier during the exploratory data analysis.

The p-value for all coefficients is $0.000$, meaning all coefficients are statistically significant at $p=0.05$. The 95% confidence intervals for each coefficient should be reported when presenting results to stakeholders.

For example, there is a $95\%$ chance the interval $[-163.979,-144.616]$ contains the true parameter of the slope of $\beta_{TVLow}$, which is the estimated difference in promotion sales when a low TV promotional budget is chosen instead of a high TV promotion budget.

Recommendations

High TV promotional budgets have a substantial positive influence on sales. The model estimates that switching from a high to medium TV promotional budget reduces sales by $\$75.3120$ million (95% CI $[-82.431,-68.193])$, and switching from a high to low TV promotional budget reduces sales by $\$154.297$ million (95% CI $[-163.979,-144.616])$. The model also estimates that an increase of $\$1$ million in the radio promotional budget will yield a $\$2.9669$ million increase in sales (95% CI $[2.551,3.383]$).

Thus, it is recommended that the business allot a high promotional budget to TV when possible and invest in radio promotions to increase sales.

References¶

Saragih, H.S. (2020). Dummy Marketing and Sales Data.