Dataset: https://www.superdatascience.com/training/
Author: Filipa C. S. Rodrigues (filipacsrodrigues@gmail.com)
%matplotlib inline
import pandas
import csv
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
/usr/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
df = pandas.DataFrame.from_csv('50-Startups.csv', index_col=None)
df[:5]
R&D Spend | Administration | Marketing Spend | State | Profit | |
---|---|---|---|---|---|
0 | 165349.20 | 136897.80 | 471784.10 | New York | 192261.83 |
1 | 162597.70 | 151377.59 | 443898.53 | California | 191792.06 |
2 | 153441.51 | 101145.55 | 407934.54 | California | 191050.39 |
3 | 144372.41 | 118671.85 | 383199.62 | New York | 182901.99 |
4 | 142107.34 | 91391.77 | 366168.42 | California | 166187.94 |
df_y = df['Profit']
df_x = df.drop(['Profit'], axis = 1)
Create dummy variable for the categorical variable:
dummy = pandas.get_dummies(df_x['State'])
df_x = dummy.join(df_x)
df_x[:5]
California | New York | R&D Spend | Administration | Marketing Spend | State | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 165349.20 | 136897.80 | 471784.10 | New York |
1 | 1 | 0 | 162597.70 | 151377.59 | 443898.53 | California |
2 | 1 | 0 | 153441.51 | 101145.55 | 407934.54 | California |
3 | 0 | 1 | 144372.41 | 118671.85 | 383199.62 | New York |
4 | 1 | 0 | 142107.34 | 91391.77 | 366168.42 | California |
Only one dummy variable should be used to avoid the "dummy variable trap":
df_x = df_x.drop(['California', 'State'], axis =1)
df_x[:5]
New York | R&D Spend | Administration | Marketing Spend | |
---|---|---|---|---|
0 | 1 | 165349.20 | 136897.80 | 471784.10 |
1 | 0 | 162597.70 | 151377.59 | 443898.53 |
2 | 0 | 153441.51 | 101145.55 | 407934.54 |
3 | 1 | 144372.41 | 118671.85 | 383199.62 |
4 | 0 | 142107.34 | 91391.77 | 366168.42 |
Use Backward Elimination to find the most important features.
Steps:
SL = 0.05
Add a constant \begin{align} b_0 \end{align} to the model:
df_x = sm.add_constant(df_x)
df_x[:2]
const | New York | R&D Spend | Administration | Marketing Spend | |
---|---|---|---|---|---|
0 | 1 | 1 | 165349.2 | 136897.80 | 471784.10 |
1 | 1 | 0 | 162597.7 | 151377.59 | 443898.53 |
Create a model with all variables:
model1 = sm.OLS(df_y, df_x).fit()
print model1.summary()
OLS Regression Results ============================================================================== Dep. Variable: Profit R-squared: 0.951 Model: OLS Adj. R-squared: 0.947 Method: Least Squares F-statistic: 218.4 Date: Thu, 22 Dec 2016 Prob (F-statistic): 7.53e-29 Time: 11:28:30 Log-Likelihood: -525.25 No. Observations: 50 AIC: 1060. Df Residuals: 45 BIC: 1070. Df Model: 4 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- const 5.042e+04 6653.545 7.577 0.000 3.7e+04 6.38e+04 New York -1332.0930 2690.180 -0.495 0.623 -6750.393 4086.207 R&D Spend 0.8080 0.046 17.662 0.000 0.716 0.900 Administration -0.0236 0.052 -0.455 0.651 -0.128 0.081 Marketing Spend 0.0264 0.017 1.581 0.121 -0.007 0.060 ============================================================================== Omnibus: 15.851 Durbin-Watson: 1.306 Prob(Omnibus): 0.000 Jarque-Bera (JB): 25.631 Skew: -0.951 Prob(JB): 2.72e-06 Kurtosis: 5.947 Cond. No. 1.41e+06 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.41e+06. This might indicate that there are strong multicollinearity or other numerical problems.
As one can observe, the variable with the largest p-value is the "Administration" -> p-value = 0.651 >> 0.05
Let's see the relationship of this variable with the "Profit":
fig = sm.graphics.plot_fit(model1, 3)
It does not seem to exist a relationship between Profit and Administration.
Let's discard this variable:
model2 = sm.OLS(df_y, df_x.drop(['Administration'], axis = 1)).fit()
print model2.summary()
OLS Regression Results ============================================================================== Dep. Variable: Profit R-squared: 0.951 Model: OLS Adj. R-squared: 0.948 Method: Least Squares F-statistic: 296.2 Date: Thu, 22 Dec 2016 Prob (F-statistic): 4.44e-30 Time: 11:28:34 Log-Likelihood: -525.36 No. Observations: 50 AIC: 1059. Df Residuals: 46 BIC: 1066. Df Model: 3 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- const 4.772e+04 3018.340 15.811 0.000 4.16e+04 5.38e+04 New York -1484.6097 2646.166 -0.561 0.577 -6811.065 3841.846 R&D Spend 0.8003 0.042 18.976 0.000 0.715 0.885 Marketing Spend 0.0286 0.016 1.809 0.077 -0.003 0.060 ============================================================================== Omnibus: 15.844 Durbin-Watson: 1.291 Prob(Omnibus): 0.000 Jarque-Bera (JB): 25.957 Skew: -0.943 Prob(JB): 2.31e-06 Kurtosis: 5.984 Cond. No. 6.73e+05 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 6.73e+05. This might indicate that there are strong multicollinearity or other numerical problems.
Now, the variable with the highest p-value is "New York" that has a p-value of 0.577, which is much grater than the significance level 0.05.
Let's see the relationship of this variable with the Profit:
fig = sm.graphics.plot_fit(model2, 1)
Being New York (1) or California (0) does not seem to affect the Profit. So we can conclude that the variable "New York" does not bring any value to the model and it can be discarded.
model3 = sm.OLS(df_y, df_x.drop(['Administration', 'New York'], axis = 1)).fit()
print model3.summary()
OLS Regression Results ============================================================================== Dep. Variable: Profit R-squared: 0.950 Model: OLS Adj. R-squared: 0.948 Method: Least Squares F-statistic: 450.8 Date: Thu, 22 Dec 2016 Prob (F-statistic): 2.16e-31 Time: 11:28:44 Log-Likelihood: -525.54 No. Observations: 50 AIC: 1057. Df Residuals: 47 BIC: 1063. Df Model: 2 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- const 4.698e+04 2689.933 17.464 0.000 4.16e+04 5.24e+04 R&D Spend 0.7966 0.041 19.266 0.000 0.713 0.880 Marketing Spend 0.0299 0.016 1.927 0.060 -0.001 0.061 ============================================================================== Omnibus: 14.677 Durbin-Watson: 1.257 Prob(Omnibus): 0.001 Jarque-Bera (JB): 21.161 Skew: -0.939 Prob(JB): 2.54e-05 Kurtosis: 5.575 Cond. No. 5.32e+05 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 5.32e+05. This might indicate that there are strong multicollinearity or other numerical problems.
Now, the variable with the highest p-value is "Marketing Spend", but the value "0.06" is very close to the significance level "0.05".
Let's see the relationship with "Profit":
fig = sm.graphics.plot_fit(model3, 2)
It seems to exist some kind of relationship between the two variables.
Let's see what happens if this variable is not included in the model:
model4 = sm.OLS(df_y, df_x.drop(['Administration', 'New York', 'Marketing Spend'], axis = 1)).fit()
print model4.summary()
OLS Regression Results ============================================================================== Dep. Variable: Profit R-squared: 0.947 Model: OLS Adj. R-squared: 0.945 Method: Least Squares F-statistic: 849.8 Date: Thu, 22 Dec 2016 Prob (F-statistic): 3.50e-32 Time: 11:28:56 Log-Likelihood: -527.44 No. Observations: 50 AIC: 1059. Df Residuals: 48 BIC: 1063. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const 4.903e+04 2537.897 19.320 0.000 4.39e+04 5.41e+04 R&D Spend 0.8543 0.029 29.151 0.000 0.795 0.913 ============================================================================== Omnibus: 13.727 Durbin-Watson: 1.116 Prob(Omnibus): 0.001 Jarque-Bera (JB): 18.536 Skew: -0.911 Prob(JB): 9.44e-05 Kurtosis: 5.361 Cond. No. 1.65e+05 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.65e+05. This might indicate that there are strong multicollinearity or other numerical problems.
The Adjusted R-squared in the model3, that contains the "R&D Spend" and the "Marketing Spend" variables is: 0.948
The Adjusted R-squared in the model4, that contains only the "R&D Spend" variable is: 0.945
Hence, when both variables are used, the value of the Adjusted R-squared is larger which means the model is better. So the "Marketing Spend" should be mantained in the model.