Let's look at a dataset pertaining to housing. This data contains house sale prices in King County for houses sold between May 2014 and May 2015. In addition to price of houses, the dataset also contains 14 attributes or features of these houses.
In this case study, we will use Multiple Regression to predict the prices of houses given the feature information.
import pandas as pd
import numpy as np
df = pd.read_csv('housing_data.csv')
df.head()
price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | condition | grade | sqft_basement | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 221900.0 | 3 | 1.00 | 1180 | 5650 | 1.0 | 0 | 3 | 7 | 0 | 47.5112 | -122.257 |
1 | 538000.0 | 3 | 2.25 | 2570 | 7242 | 2.0 | 0 | 3 | 7 | 400 | 47.7210 | -122.319 |
2 | 180000.0 | 2 | 1.00 | 770 | 10000 | 1.0 | 0 | 3 | 6 | 0 | 47.7379 | -122.233 |
3 | 604000.0 | 4 | 3.00 | 1960 | 5000 | 1.0 | 0 | 5 | 7 | 910 | 47.5208 | -122.393 |
4 | 510000.0 | 3 | 2.00 | 1680 | 8080 | 1.0 | 0 | 3 | 8 | 0 | 47.6168 | -122.045 |
df.shape
(21613, 12)
As we see from the data, in addition to the price information we have 11 features for each of the 21613 houses. Example of features are the number of bedrooms, number of bathrooms, the size of living area, the size of the lot, the condition of the property and so on.
Let's start by using just one feature and implementing a Simple Linear Regression model.
We know that size of the living area can be a potential indicator of the price. First, visualize the how the living area size can affect price.
ax = df.plot.scatter(x='sqft_living', y='price', title='Price vs Living-Area size (in sqft)')
We're going to be doing a lot of regressions, so why not make a function?
from sklearn.linear_model import LinearRegression
def do_regression( X, y ):
# Linear regression using the entire dataset
lin_model = LinearRegression()
lin_model.fit(X, y)
# Print the intercept, coefficients, and R^2 value
print("Intercept:", lin_model.intercept_)
print("Coefficients:", lin_model.coef_)
print("R-squared value:", lin_model.score(X, y))
return lin_model
Regression using a single feature
# Fitting a simple linear regression model on the "whole" dataset
X = df[['sqft_living']]
y = df['price']
lin_model = do_regression( X, y )
Intercept: -43580.74309447361 Coefficients: [280.6235679] R-squared value: 0.4928532179037931
This implies that based on just the one feature "sqft_living" we have the formula for
price = -43580.74 + (280.62 / sq.ft) * (sq.ft. living)
However, We see that the R-squared is 0.49 indicating this feature alone may not able to give us a model that fits our data well.
Let's visualize how well our model fit the data!
y_pred = lin_model.predict(X)
df_pred = pd.DataFrame({'sqft_living': X.sqft_living, 'orig_price': y, 'pred_price': y_pred})
ax = df_pred.plot.scatter(x='sqft_living', y='orig_price');
df_pred.plot.scatter(x='sqft_living', y='pred_price', color='red', ax=ax);
ax.set_ylabel("price");
Let's also calculate the Error on the Training Data (a.k.a. the residuals) and see how they are distributed. Note that we are using the whole data for training here. For this, we'll calculate the residuals and then plot a histogram.
residuals = y_pred - y
residuals.plot.hist(bins=100, xlim=(-1e6,1e6), figsize=(10,5), title='distribution of the residuals');
Remember: By definition, regression picks the line so that the residuals have zero-mean and have a "sum of squares", or RSS, that is as small as possible. The RSS is the square of the standard deviation. So the standard deviation of the residuals should also be as small as possible. Let's verify this:
residuals.describe()
count 2.161300e+04 mean 1.182063e-10 std 2.614468e+05 min -4.362067e+06 25% -1.061821e+05 50% 2.404282e+04 75% 1.474860e+05 max 1.476062e+06 Name: price, dtype: float64
Using a simple linear regression model to predict the price of houses using 'sqft_living' as a feature doesn't seem to be doing too well. Let's use more features and see if we get a better fit to our data.
We will now use more features to feed into a regression model and see if our R-squared improves.
But how do we select our features? Let's look at how the rest of our features look like when plotted against the price.
# Looking at what other features we have
df.columns
Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'condition', 'grade', 'sqft_basement', 'lat', 'long'], dtype='object')
# Price is our target variable, and the other columns are our features
target = 'price'
features = df.columns[1:]
for feat in features:
df.plot.scatter(x=feat, y=target);
Let's try using four features
X = df[['sqft_living', 'grade', 'bathrooms', 'bedrooms']]
y = df['price']
do_regression(X,y);
Intercept: -478406.7764424477 Coefficients: [ 226.6532061 96138.03615975 -26738.29100485 -39305.5485805 ] R-squared value: 0.5433509785456612
We see that our R-squared value improved, but can we do even better? Let's use all features and see
X = df.iloc[:,1:]
y = df['price']
lin_model = do_regression(X,y);
Intercept: -65054852.06699576 Coefficients: [-3.44331820e+04 4.71488040e+03 2.19351888e+02 5.37560606e-02 -2.91610129e+04 7.82429391e+05 5.42505582e+04 9.03388726e+04 -2.00914407e+01 6.33136069e+05 -2.80646205e+05] R-squared value: 0.6619237624051806
Let's make a histogram like we did before
y_pred = lin_model.predict(X)
residuals2 = y_pred - y
ax = residuals.plot.hist(bins=100, xlim=(-1e6,1e6), figsize=(10,5), title='distribution of the residuals', alpha=0.5);
residuals2.plot.hist(bins=100, xlim=(-1e6,1e6), figsize=(10,5), title='distribution of the residuals', ax=ax, alpha=0.5);
ax.legend(['using one feature','using all features']);
residuals2.describe()
count 2.161300e+04 mean -7.480227e-09 std 2.134636e+05 min -4.544841e+06 25% -8.019286e+04 50% 1.473850e+04 75% 1.105164e+05 max 1.151944e+06 Name: price, dtype: float64
We see that in this case, using all the features gives us a better Mulitple Regression fit to the data.
Import the
housing_data.csv
and fit a linear regression model to the data using just thesqft_living
feature, and theprice
as target. Predict the price of the house whensqft_living=2750
.
# Code provided already
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_csv('housing_data.csv')
X = df[['sqft_living']]
y = df['price']
X_test = [[2750]]
# Your code here
Import the
housing_data.csv
. First fit a Multiple Regression model withgrade
,bathrooms
andbedrooms
as features andprice
as the target. Find the R-squared value of the fit using the data you trained with.
# Your code here