Let's grab a data set of of car values:
import pandas as pd
df = pd.read_excel('https://admintuts.tech/wp-content/downloads/xls/cars.xls')
df.head()
Price | Mileage | Make | Model | Trim | Type | Cylinder | Liter | Doors | Cruise | Sound | Leather | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17314.103129 | 8221 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 1 |
1 | 17542.036083 | 9135 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 0 |
2 | 16218.847862 | 13196 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 0 |
3 | 16336.913140 | 16342 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 0 | 0 |
4 | 16339.170324 | 19832 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 0 | 1 |
%matplotlib inline
import numpy as np
df1 = df[['Mileage','Price']]
bins = np.arange(0,50000,10000)
groups = df1.groupby(pd.cut(df1['Mileage'],bins)).mean()
print(groups.head())
groups['Price'].plot.line()
Mileage Price Mileage (0, 10000] 5588.629630 24096.714451 (10000, 20000] 15898.496183 21955.979607 (20000, 30000] 24114.407104 20278.606252 (30000, 40000] 33610.338710 19463.670267
<matplotlib.axes._subplots.AxesSubplot at 0x7fd9d3394710>
We can use pandas to split up this matrix into the feature vectors we're interested in, and the value we're trying to predict.
Note how we are avoiding the make and model; regressions don't work well with ordinal values, unless you can convert them into some numerical order that makes sense somehow.
Let's scale our feature data into the same range so we can easily compare the coefficients we end up with.
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
X = df[['Mileage', 'Cylinder', 'Doors']]
y = df['Price']
X[['Mileage', 'Cylinder', 'Doors']] = scale.fit_transform(X[['Mileage', 'Cylinder', 'Doors']].as_matrix())
print (X)
est = sm.OLS(y, X).fit()
est.summary()
Mileage Cylinder Doors 0 -1.417485 0.527410 0.556279 1 -1.305902 0.527410 0.556279 2 -0.810128 0.527410 0.556279 3 -0.426058 0.527410 0.556279 4 0.000008 0.527410 0.556279 5 0.293493 0.527410 0.556279 6 0.335001 0.527410 0.556279 7 0.382369 0.527410 0.556279 8 0.511409 0.527410 0.556279 9 0.914768 0.527410 0.556279 10 -1.171368 0.527410 0.556279 11 -0.581834 0.527410 0.556279 12 -0.390532 0.527410 0.556279 13 -0.003899 0.527410 0.556279 14 0.430591 0.527410 0.556279 15 0.480156 0.527410 0.556279 16 0.509822 0.527410 0.556279 17 0.757160 0.527410 0.556279 18 1.594886 0.527410 0.556279 19 1.810849 0.527410 0.556279 20 -1.326046 0.527410 0.556279 21 -1.129860 0.527410 0.556279 22 -0.667658 0.527410 0.556279 23 -0.405792 0.527410 0.556279 24 -0.112796 0.527410 0.556279 25 -0.044552 0.527410 0.556279 26 0.190700 0.527410 0.556279 27 0.337442 0.527410 0.556279 28 0.566102 0.527410 0.556279 29 0.660837 0.527410 0.556279 .. ... ... ... 774 -0.161262 -0.914896 0.556279 775 -0.089234 -0.914896 0.556279 776 -0.040523 -0.914896 0.556279 777 0.002572 -0.914896 0.556279 778 0.236603 -0.914896 0.556279 779 0.249666 -0.914896 0.556279 780 0.357220 -0.914896 0.556279 781 0.365521 -0.914896 0.556279 782 0.434131 -0.914896 0.556279 783 0.517269 -0.914896 0.556279 784 0.589908 -0.914896 0.556279 785 0.599186 -0.914896 0.556279 786 0.793052 -0.914896 0.556279 787 1.033554 -0.914896 0.556279 788 1.045762 -0.914896 0.556279 789 1.205567 -0.914896 0.556279 790 1.541414 -0.914896 0.556279 791 1.561070 -0.914896 0.556279 792 1.725026 -0.914896 0.556279 793 1.851502 -0.914896 0.556279 794 -1.709871 0.527410 0.556279 795 -1.474375 0.527410 0.556279 796 -1.187849 0.527410 0.556279 797 -1.079929 0.527410 0.556279 798 -0.682430 0.527410 0.556279 799 -0.439853 0.527410 0.556279 800 -0.089966 0.527410 0.556279 801 0.079605 0.527410 0.556279 802 0.750446 0.527410 0.556279 803 1.932565 0.527410 0.556279 [804 rows x 3 columns]
/home/nikolas/Desktop/venv/lib/python3.6/site-packages/ipykernel_launcher.py:8: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead. /home/nikolas/Desktop/venv/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler. warnings.warn(msg, DataConversionWarning) /home/nikolas/Desktop/venv/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler. warnings.warn(msg, DataConversionWarning) /home/nikolas/Desktop/venv/lib/python3.6/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/nikolas/Desktop/venv/lib/python3.6/site-packages/pandas/core/indexing.py:543: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self.obj[item] = s
Dep. Variable: | Price | R-squared (uncentered): | 0.064 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.060 |
Method: | Least Squares | F-statistic: | 18.11 |
Date: | Sun, 01 Sep 2019 | Prob (F-statistic): | 2.23e-11 |
Time: | 03:30:21 | Log-Likelihood: | -9207.1 |
No. Observations: | 804 | AIC: | 1.842e+04 |
Df Residuals: | 801 | BIC: | 1.843e+04 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Mileage | -1272.3412 | 804.623 | -1.581 | 0.114 | -2851.759 | 307.077 |
Cylinder | 5587.4472 | 804.509 | 6.945 | 0.000 | 4008.252 | 7166.642 |
Doors | -1404.5513 | 804.275 | -1.746 | 0.081 | -2983.288 | 174.185 |
Omnibus: | 157.913 | Durbin-Watson: | 0.008 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 257.529 |
Skew: | 1.278 | Prob(JB): | 1.20e-56 |
Kurtosis: | 4.074 | Cond. No. | 1.03 |
The table of coefficients above gives us the values to plug into an equation of form: B0 + B1 * Mileage + B2 * cylinders + B3 * doors
In this example, it's pretty clear that the number of cylinders is more important than anything based on the coefficients.
Could we have figured that out earlier?
y.groupby(df.Doors).mean()
Doors 2 23807.135520 4 20580.670749 Name: Price, dtype: float64
Surprisingly, more doors does not mean a higher price! (Maybe it implies a sport car in some cases?) So it's not surprising that it's pretty useless as a predictor here. This is a very small data set however, so we can't really read much meaning into it.
scaled = scale.transform([[20000, 8, 4]])
print(scaled)
predicted = est.predict(scaled[0])
print(predicted)
[[0.02051781 1.96971667 0.55627894]] [10198.25991671]