Let's look at a dataset pertaining to housing. This data contains house sale prices in King County for houses sold between May 2014 and May 2015. In addition to price of houses, the dataset also contains 14 attributes or features of these houses.

In this case study, we will use **Multiple Regression** to predict the prices of houses given the feature information.

In [1]:

```
import pandas as pd
import numpy as np
df = pd.read_csv('housing_data.csv')
df.head()
```

Out[1]:

price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | condition | grade | sqft_basement | lat | long | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 221900.0 | 3 | 1.00 | 1180 | 5650 | 1.0 | 0 | 3 | 7 | 0 | 47.5112 | -122.257 |

1 | 538000.0 | 3 | 2.25 | 2570 | 7242 | 2.0 | 0 | 3 | 7 | 400 | 47.7210 | -122.319 |

2 | 180000.0 | 2 | 1.00 | 770 | 10000 | 1.0 | 0 | 3 | 6 | 0 | 47.7379 | -122.233 |

3 | 604000.0 | 4 | 3.00 | 1960 | 5000 | 1.0 | 0 | 5 | 7 | 910 | 47.5208 | -122.393 |

4 | 510000.0 | 3 | 2.00 | 1680 | 8080 | 1.0 | 0 | 3 | 8 | 0 | 47.6168 | -122.045 |

In [2]:

```
df.shape
```

Out[2]:

(21613, 12)

We know that size of the living area can be a potential indicator of the price. First, visualize the how the living area size can affect price.

In [3]:

```
ax = df.plot.scatter(x='sqft_living', y='price', title='Price vs Living-Area size (in sqft)')
```

We're going to be doing a lot of regressions, so why not make a function?

In [4]:

```
from sklearn.linear_model import LinearRegression
def do_regression( X, y ):
# Linear regression using the entire dataset
lin_model = LinearRegression()
lin_model.fit(X, y)
# Print the intercept, coefficients, and R^2 value
print("Intercept:", lin_model.intercept_)
print("Coefficients:", lin_model.coef_)
print("R-squared value:", lin_model.score(X, y))
return lin_model
```

Regression using a single feature

In [5]:

```
# Fitting a simple linear regression model on the "whole" dataset
X = df[['sqft_living']]
y = df['price']
lin_model = do_regression( X, y )
```

Intercept: -43580.74309447361 Coefficients: [280.6235679] R-squared value: 0.4928532179037931

This implies that based on just the one feature "sqft_living" we have the formula for
price = -43580.74 + (280.62 / sq.ft) * (sq.ft. living)

However, We see that the R-squared is 0.49 indicating this feature alone may not able to give us a model that fits our data well.

Let's visualize how well our model fit the data!

In [6]:

```
y_pred = lin_model.predict(X)
df_pred = pd.DataFrame({'sqft_living': X.sqft_living, 'orig_price': y, 'pred_price': y_pred})
ax = df_pred.plot.scatter(x='sqft_living', y='orig_price');
df_pred.plot.scatter(x='sqft_living', y='pred_price', color='red', ax=ax);
ax.set_ylabel("price");
```

In [7]:

```
residuals = y_pred - y
residuals.plot.hist(bins=100, xlim=(-1e6,1e6), figsize=(10,5), title='distribution of the residuals');
```

**Remember:** By definition, regression picks the line so that the residuals have zero-mean and have a "sum of squares", or RSS, that is as small as possible. The RSS is the square of the standard deviation. So the standard deviation of the residuals should also be as small as possible. Let's verify this:

In [8]:

```
residuals.describe()
```

Out[8]:

count 2.161300e+04 mean 1.182063e-10 std 2.614468e+05 min -4.362067e+06 25% -1.061821e+05 50% 2.404282e+04 75% 1.474860e+05 max 1.476062e+06 Name: price, dtype: float64

But how do we select our features? Let's look at how the rest of our features look like when plotted against the price.

In [9]:

```
# Looking at what other features we have
df.columns
```

Out[9]:

Index(['price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'condition', 'grade', 'sqft_basement', 'lat', 'long'], dtype='object')

In [10]:

```
# Price is our target variable, and the other columns are our features
target = 'price'
features = df.columns[1:]
for feat in features:
df.plot.scatter(x=feat, y=target);
```

Let's try using four features

In [11]:

```
X = df[['sqft_living', 'grade', 'bathrooms', 'bedrooms']]
y = df['price']
do_regression(X,y);
```

We see that our R-squared value improved, but can we do even better? Let's use all features and see

In [12]:

```
X = df.iloc[:,1:]
y = df['price']
lin_model = do_regression(X,y);
```

Let's make a histogram like we did before

In [13]:

```
y_pred = lin_model.predict(X)
residuals2 = y_pred - y
ax = residuals.plot.hist(bins=100, xlim=(-1e6,1e6), figsize=(10,5), title='distribution of the residuals', alpha=0.5);
residuals2.plot.hist(bins=100, xlim=(-1e6,1e6), figsize=(10,5), title='distribution of the residuals', ax=ax, alpha=0.5);
ax.legend(['using one feature','using all features']);
residuals2.describe()
```

Out[13]:

count 2.161300e+04 mean -7.480227e-09 std 2.134636e+05 min -4.544841e+06 25% -8.019286e+04 50% 1.473850e+04 75% 1.105164e+05 max 1.151944e+06 Name: price, dtype: float64

Import the

`housing_data.csv`

and fit a linear regression model to the data using just the`sqft_living`

feature, and the`price`

as target. Predict the price of the house when`sqft_living=2750`

.

In [ ]:

```
# Code provided already
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.read_csv('housing_data.csv')
X = df[['sqft_living']]
y = df['price']
X_test = [[2750]]
```

In [ ]:

```
# Your code here
```

Import the

`housing_data.csv`

. First fit a Multiple Regression model with`grade`

,`bathrooms`

and`bedrooms`

as features and`price`

as the target. Find the R-squared value of the fit using the data you trained with.

In [ ]:

```
# Your code here
```