Boston House Price Prediction¶

Import Required Libraries¶

In [14]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import pickle
%matplotlib inline 

Import Dataset¶

In [15]:

california_data = fetch_california_housing()
print(california_data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

An household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surpinsingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

In [16]:

df = pd.DataFrame(california_data.data, columns = california_data.feature_names)
df['Target'] = california_data.target
df

Out[16]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	Target
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422
...	...	...	...	...	...	...	...	...	...
20635	1.5603	25.0	5.045455	1.133333	845.0	2.560606	39.48	-121.09	0.781
20636	2.5568	18.0	6.114035	1.315789	356.0	3.122807	39.49	-121.21	0.771
20637	1.7000	17.0	5.205543	1.120092	1007.0	2.325635	39.43	-121.22	0.923
20638	1.8672	18.0	5.329513	1.171920	741.0	2.123209	39.43	-121.32	0.847
20639	2.3886	16.0	5.254717	1.162264	1387.0	2.616981	39.37	-121.24	0.894

20640 rows × 9 columns

Checking Missing Values¶

In [5]:

df.isnull().sum()

Out[5]:

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
Target        0
dtype: int64

Exploratory Data Analysis (EDA)¶

In [6]:

plt.figure(figsize=(10,6))
sns.kdeplot(df['Target'],shade=True)
plt.show()

/var/folders/d8/bfd00f8j31ncwmcb9ylrs5xc0000gp/T/ipykernel_69686/3487986939.py:2: FutureWarning: 

`shade` is now deprecated in favor of `fill`; setting `fill=True`.
This will become an error in seaborn v0.14.0; please update your code.

  sns.kdeplot(df['Target'],shade=True)

In [ ]:

sns.pairplot(df)

In [7]:

plt.figure(figsize=(10,6))
corr_matrix = df.corr().round(2)
sns.heatmap(data=corr_matrix, annot=True)
plt.show()

Observations:¶

To fit a linear regression model, we select those features which have a high correlation with our target variable 'Target'. By looking at the correlation matrix we can see that 'MedInc' has a strong positive correlation with 'Target' (0.69) where as 'Latitude' has a high negative correlation with 'Target'(-0.14).

An important point in selecting features for a linear regression model is to check for multi-co-linearity. The features 'AveRooms', 'AveBedrms' have a correlation of 0.85. These feature pairs are strongly correlated to each other. We should not select both these features together for training the model. Same goes for the features 'Latitude' and 'Longitude' which have a correlation of -0.92.

In [ ]:

plt.figure(figsize=(20, 8))
plt.subplot(1, 2, 1)
sns.regplot(data = df, x = 'Latitude', y = 'Target', line_kws = {'color':'red'})
plt.subplot(1, 2, 2)
sns.regplot(data = df, x = 'MedInc', y = 'Target', line_kws = {'color':'red'})

Out[ ]:

<AxesSubplot: xlabel='MedInc', ylabel='Target'>

In [17]:

# X = pd.DataFrame(np.c_[df['Latitude'], df['MedInc']], columns = ['Latitude','MedInc'])
X = df.iloc[:, :-1]
y = df['Target']

Splitting the data into training and testing sets¶

In [18]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)

In [19]:

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(16512, 8)
(4128, 8)
(16512,)
(4128,)

Normalization/Standard Scalling¶

In [20]:

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Why do we standardize data in Linear regression

Internally we use gradient descend algorithm, our main aim is to reach the global minima. To achieve that, we have to make sure that all our independent feature units should be on the same scale in order for convergence to happen.

In [22]:

pickle.dump(sc, open('scaler.pkl','wb'))

Model Creation¶

In [23]:

model = LinearRegression()
model.fit(X_train, y_train)

Out[23]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [24]:

print(model.coef_)

[ 0.82789366  0.11540032 -0.2811675   0.32431215 -0.00344465 -0.04502135
 -0.89616073 -0.86808269]

Evaluate On Training Data¶

In [25]:

y_pred = model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_train, y_pred))
r2 = r2_score(y_train, y_pred)
print(f"RMSE Score: {format(rmse)}")
print(f"R2 Score: {format(r2)}")

RMSE Score: 0.7221082719887488
R2 Score: 0.6047922425924859

Evaluate On Test Data¶

In [26]:

model.predict(sc.transform(california_data.data[0].reshape(1,-1)))[0]

/Users/mohsinriad/Documents/MyUniverse/ML-DL/MLOPS-house-price/venv/lib/python3.8/site-packages/sklearn/base.py:420: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(

Out[26]:

4.123924882410245

Benchmark¶

In [27]:

y_pred_test = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2 = r2_score(y_test, y_pred_test)
print(f"RMSE Score: {format(rmse)}")
print(f"R2 Score: {format(r2)}")

RMSE Score: 0.7323542382277793
R2 Score: 0.6112568432827638

NOTE:

RMSE: A metric that tells us how far apart the predicted values are from the observed values in a dataset, on average. The lower the RMSE, the better a model fits a dataset.
R2: A metric that tells us the proportion of the variance in the response variable of a regression model that can be explained by the predictor variables. This value ranges from 0 to 1. The higher the R2 value, the better a model fits a dataset.

Assumptions¶

In [28]:

plt.figure(figsize=(15, 8))
plt.scatter(y_test, y_pred_test)
plt.xlabel('Y Test')
plt.ylabel('Y Prediction')
plt.show()

In [29]:

# residulas/errors
error = y_test - y_pred_test
sns.displot(error, kind="kde")

Out[29]:

<seaborn.axisgrid.FacetGrid at 0x12b8bde20>

In [30]:

plt.scatter(y_pred_test, error)

Out[30]:

<matplotlib.collections.PathCollection at 0x12b972670>

Deployment¶

In [88]:

pickle.dump(model, open('regression_model.pkl','wb'))

In [24]:

pickle_model = pickle.load(open('./regression_model.pkl', 'rb'))

/Users/mohsinriad/Documents/MyUniverse/ML-DL/MLOPS-house-price/venv/lib/python3.8/site-packages/sklearn/base.py:299: UserWarning: Trying to unpickle estimator LinearRegression from version 1.0.2 when using version 1.2.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
  warnings.warn(

In [25]:

pickle_model.predict(sc.transform(california_data.data[1].reshape(1,-1)))

/Users/mohsinriad/Documents/MyUniverse/ML-DL/MLOPS-house-price/venv/lib/python3.8/site-packages/sklearn/base.py:420: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names
  warnings.warn(

Out[25]:

array([3.97772666])