Multiple Regression¶

Let's grab a data set of of car values:

In [2]:

import pandas as pd

df = pd.read_excel('https://admintuts.tech/wp-content/downloads/xls/cars.xls')
df.head()

Out[2]:

	Price	Mileage	Make	Model	Trim	Type	Cylinder	Liter	Doors	Cruise	Sound	Leather
0	17314.103129	8221	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	1
1	17542.036083	9135	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
2	16218.847862	13196	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
3	16336.913140	16342	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	0
4	16339.170324	19832	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	1

In [5]:

%matplotlib inline
import numpy as np

df1 = df[['Mileage','Price']]
bins =  np.arange(0,50000,10000)
groups = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()

print(groups.head())
groups['Price'].plot.line()

                     Mileage         Price
Mileage                                   
(0, 10000]       5588.629630  24096.714451
(10000, 20000]  15898.496183  21955.979607
(20000, 30000]  24114.407104  20278.606252
(30000, 40000]  33610.338710  19463.670267

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fd9d3394710>

We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.

Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.

Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.

In [8]:

import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()

X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']

X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())

print (X)

est = sm.OLS(y, X).fit()

est.summary()

      Mileage  Cylinder     Doors
0   -1.417485  0.527410  0.556279
1   -1.305902  0.527410  0.556279
2   -0.810128  0.527410  0.556279
3   -0.426058  0.527410  0.556279
4    0.000008  0.527410  0.556279
5    0.293493  0.527410  0.556279
6    0.335001  0.527410  0.556279
7    0.382369  0.527410  0.556279
8    0.511409  0.527410  0.556279
9    0.914768  0.527410  0.556279
10  -1.171368  0.527410  0.556279
11  -0.581834  0.527410  0.556279
12  -0.390532  0.527410  0.556279
13  -0.003899  0.527410  0.556279
14   0.430591  0.527410  0.556279
15   0.480156  0.527410  0.556279
16   0.509822  0.527410  0.556279
17   0.757160  0.527410  0.556279
18   1.594886  0.527410  0.556279
19   1.810849  0.527410  0.556279
20  -1.326046  0.527410  0.556279
21  -1.129860  0.527410  0.556279
22  -0.667658  0.527410  0.556279
23  -0.405792  0.527410  0.556279
24  -0.112796  0.527410  0.556279
25  -0.044552  0.527410  0.556279
26   0.190700  0.527410  0.556279
27   0.337442  0.527410  0.556279
28   0.566102  0.527410  0.556279
29   0.660837  0.527410  0.556279
..        ...       ...       ...
774 -0.161262 -0.914896  0.556279
775 -0.089234 -0.914896  0.556279
776 -0.040523 -0.914896  0.556279
777  0.002572 -0.914896  0.556279
778  0.236603 -0.914896  0.556279
779  0.249666 -0.914896  0.556279
780  0.357220 -0.914896  0.556279
781  0.365521 -0.914896  0.556279
782  0.434131 -0.914896  0.556279
783  0.517269 -0.914896  0.556279
784  0.589908 -0.914896  0.556279
785  0.599186 -0.914896  0.556279
786  0.793052 -0.914896  0.556279
787  1.033554 -0.914896  0.556279
788  1.045762 -0.914896  0.556279
789  1.205567 -0.914896  0.556279
790  1.541414 -0.914896  0.556279
791  1.561070 -0.914896  0.556279
792  1.725026 -0.914896  0.556279
793  1.851502 -0.914896  0.556279
794 -1.709871  0.527410  0.556279
795 -1.474375  0.527410  0.556279
796 -1.187849  0.527410  0.556279
797 -1.079929  0.527410  0.556279
798 -0.682430  0.527410  0.556279
799 -0.439853  0.527410  0.556279
800 -0.089966  0.527410  0.556279
801  0.079605  0.527410  0.556279
802  0.750446  0.527410  0.556279
803  1.932565  0.527410  0.556279

[804 rows x 3 columns]

/home/nikolas/Desktop/venv/lib/python3.6/site-packages/ipykernel_launcher.py:8: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  
/home/nikolas/Desktop/venv/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/home/nikolas/Desktop/venv/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/home/nikolas/Desktop/venv/lib/python3.6/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/home/nikolas/Desktop/venv/lib/python3.6/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s

Out[8]:

OLS Regression Results
Dep. Variable:	Price	R-squared (uncentered):	0.064
Model:	OLS	Adj. R-squared (uncentered):	0.060
Method:	Least Squares	F-statistic:	18.11
Date:	Sun, 01 Sep 2019	Prob (F-statistic):	2.23e-11
Time:	03:30:21	Log-Likelihood:	-9207.1
No. Observations:	804	AIC:	1.842e+04
Df Residuals:	801	BIC:	1.843e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Mileage	-1272.3412	804.623	-1.581	0.114	-2851.759	307.077
Cylinder	5587.4472	804.509	6.945	0.000	4008.252	7166.642
Doors	-1404.5513	804.275	-1.746	0.081	-2983.288	174.185

Omnibus:	157.913	Durbin-Watson:	0.008
Prob(Omnibus):	0.000	Jarque-Bera (JB):	257.529
Skew:	1.278	Prob(JB):	1.20e-56
Kurtosis:	4.074	Cond. No.	1.03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The table of coefficients above gives us the values to plug into an equation of form: B0 + B1 * Mileage + B2 * cylinders + B3 * doors

In this example, it's pretty clear that the number of cylinders is more important than anything based on the coefficients.

Could we have figured that out earlier?

In [4]:

y.groupby(df.Doors).mean()

Out[4]:

Doors
2    23807.135520
4    20580.670749
Name: Price, dtype: float64

Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.

In [29]:

scaled = scale.transform([[20000, 8, 4]])
print(scaled)
predicted = est.predict(scaled[0])
print(predicted)

[[0.02051781 1.96971667 0.55627894]]
[10198.25991671]