Multiple Linear Regression¶

Dataset: https://www.superdatascience.com/training/

Author: Filipa C. S. Rodrigues (filipacsrodrigues@gmail.com)

Multiple Linear Regression:¶

\begin{align} {y}= b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n \end{align}

In [1]:

%matplotlib inline 
import pandas
import csv
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [2]:

df = pandas.DataFrame.from_csv('50-Startups.csv', index_col=None)
df[:5]

Out[2]:

	R&D Spend	Administration	Marketing Spend	State	Profit
0	165349.20	136897.80	471784.10	New York	192261.83
1	162597.70	151377.59	443898.53	California	191792.06
2	153441.51	101145.55	407934.54	California	191050.39
3	144372.41	118671.85	383199.62	New York	182901.99
4	142107.34	91391.77	366168.42	California	166187.94

Variables¶

Profit: dependent variable (y)
R&D Spend, Administration, Marketing Spend: numeric independent variables (x)
State: categorical independet variable (x)

In [3]:

df_y = df['Profit']
df_x = df.drop(['Profit'], axis = 1)

Create dummy variable for the categorical variable:

In [4]:

dummy = pandas.get_dummies(df_x['State'])
df_x = dummy.join(df_x)
df_x[:5]

Out[4]:

	California	New York	R&D Spend	Administration	Marketing Spend	State
0	0	1	165349.20	136897.80	471784.10	New York
1	1	0	162597.70	151377.59	443898.53	California
2	1	0	153441.51	101145.55	407934.54	California
3	0	1	144372.41	118671.85	383199.62	New York
4	1	0	142107.34	91391.77	366168.42	California

Only one dummy variable should be used to avoid the "dummy variable trap":

In [5]:

df_x = df_x.drop(['California', 'State'], axis =1)
df_x[:5]

Out[5]:

	New York	R&D Spend	Administration	Marketing Spend
0	1	165349.20	136897.80	471784.10
1	0	162597.70	151377.59	443898.53
2	0	153441.51	101145.55	407934.54
3	1	144372.41	118671.85	383199.62
4	0	142107.34	91391.77	366168.42

Backward Elimination¶

Use Backward Elimination to find the most important features.

Steps:

Select a significance level (SL) to keep the variables in the model.
Fit the full model with all possible predictors.
Consider the predictor with the higher p-value. If p-value > SL, go to step 4. Otherwise go to step 6.
Remove the predictor.
Fit model without this variable.
The model is ready

Apply ordinary least squares (OLS) Regression

SL = 0.05

Add a constant \begin{align} b_0 \end{align} to the model:

In [6]:

df_x = sm.add_constant(df_x)
df_x[:2]

Out[6]:

	const	New York	R&D Spend	Administration	Marketing Spend
0	1	1	165349.2	136897.80	471784.10
1	1	0	162597.7	151377.59	443898.53

Create a model with all variables:

In [7]:

model1 = sm.OLS(df_y, df_x).fit()
print model1.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Profit   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.947
Method:                 Least Squares   F-statistic:                     218.4
Date:                Thu, 22 Dec 2016   Prob (F-statistic):           7.53e-29
Time:                        11:28:30   Log-Likelihood:                -525.25
No. Observations:                  50   AIC:                             1060.
Df Residuals:                      45   BIC:                             1070.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const            5.042e+04   6653.545      7.577      0.000       3.7e+04  6.38e+04
New York        -1332.0930   2690.180     -0.495      0.623     -6750.393  4086.207
R&D Spend           0.8080      0.046     17.662      0.000         0.716     0.900
Administration     -0.0236      0.052     -0.455      0.651        -0.128     0.081
Marketing Spend     0.0264      0.017      1.581      0.121        -0.007     0.060
==============================================================================
Omnibus:                       15.851   Durbin-Watson:                   1.306
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               25.631
Skew:                          -0.951   Prob(JB):                     2.72e-06
Kurtosis:                       5.947   Cond. No.                     1.41e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.41e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

As one can observe, the variable with the largest p-value is the "Administration" -> p-value = 0.651 >> 0.05

Let's see the relationship of this variable with the "Profit":

In [8]:

fig = sm.graphics.plot_fit(model1, 3)

It does not seem to exist a relationship between Profit and Administration.

Let's discard this variable:

In [9]:

model2 = sm.OLS(df_y, df_x.drop(['Administration'], axis = 1)).fit()
print model2.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Profit   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.948
Method:                 Least Squares   F-statistic:                     296.2
Date:                Thu, 22 Dec 2016   Prob (F-statistic):           4.44e-30
Time:                        11:28:34   Log-Likelihood:                -525.36
No. Observations:                  50   AIC:                             1059.
Df Residuals:                      46   BIC:                             1066.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const            4.772e+04   3018.340     15.811      0.000      4.16e+04  5.38e+04
New York        -1484.6097   2646.166     -0.561      0.577     -6811.065  3841.846
R&D Spend           0.8003      0.042     18.976      0.000         0.715     0.885
Marketing Spend     0.0286      0.016      1.809      0.077        -0.003     0.060
==============================================================================
Omnibus:                       15.844   Durbin-Watson:                   1.291
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               25.957
Skew:                          -0.943   Prob(JB):                     2.31e-06
Kurtosis:                       5.984   Cond. No.                     6.73e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.73e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Now, the variable with the highest p-value is "New York" that has a p-value of 0.577, which is much grater than the significance level 0.05.

Let's see the relationship of this variable with the Profit:

In [10]:

fig = sm.graphics.plot_fit(model2, 1)

Being New York (1) or California (0) does not seem to affect the Profit. So we can conclude that the variable "New York" does not bring any value to the model and it can be discarded.

In [11]:

model3 = sm.OLS(df_y, df_x.drop(['Administration', 'New York'], axis = 1)).fit()
print model3.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Profit   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.948
Method:                 Least Squares   F-statistic:                     450.8
Date:                Thu, 22 Dec 2016   Prob (F-statistic):           2.16e-31
Time:                        11:28:44   Log-Likelihood:                -525.54
No. Observations:                  50   AIC:                             1057.
Df Residuals:                      47   BIC:                             1063.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
const            4.698e+04   2689.933     17.464      0.000      4.16e+04  5.24e+04
R&D Spend           0.7966      0.041     19.266      0.000         0.713     0.880
Marketing Spend     0.0299      0.016      1.927      0.060        -0.001     0.061
==============================================================================
Omnibus:                       14.677   Durbin-Watson:                   1.257
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               21.161
Skew:                          -0.939   Prob(JB):                     2.54e-05
Kurtosis:                       5.575   Cond. No.                     5.32e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.32e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Now, the variable with the highest p-value is "Marketing Spend", but the value "0.06" is very close to the significance level "0.05".

Let's see the relationship with "Profit":

In [12]:

fig = sm.graphics.plot_fit(model3, 2)

It seems to exist some kind of relationship between the two variables.

Let's see what happens if this variable is not included in the model:

In [13]:

model4 = sm.OLS(df_y, df_x.drop(['Administration', 'New York', 'Marketing Spend'], axis = 1)).fit()
print model4.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Profit   R-squared:                       0.947
Model:                            OLS   Adj. R-squared:                  0.945
Method:                 Least Squares   F-statistic:                     849.8
Date:                Thu, 22 Dec 2016   Prob (F-statistic):           3.50e-32
Time:                        11:28:56   Log-Likelihood:                -527.44
No. Observations:                  50   AIC:                             1059.
Df Residuals:                      48   BIC:                             1063.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       4.903e+04   2537.897     19.320      0.000      4.39e+04  5.41e+04
R&D Spend      0.8543      0.029     29.151      0.000         0.795     0.913
==============================================================================
Omnibus:                       13.727   Durbin-Watson:                   1.116
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               18.536
Skew:                          -0.911   Prob(JB):                     9.44e-05
Kurtosis:                       5.361   Cond. No.                     1.65e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.65e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

The Adjusted R-squared in the model3, that contains the "R&D Spend" and the "Marketing Spend" variables is: 0.948

The Adjusted R-squared in the model4, that contains only the "R&D Spend" variable is: 0.945

Hence, when both variables are used, the value of the Adjusted R-squared is larger which means the model is better. So the "Marketing Spend" should be mantained in the model.

In [ ]: