import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import StandardScaler
import pickle
%matplotlib inline
california_data = fetch_california_housing()
print(california_data.DESCR)
.. _california_housing_dataset: California Housing dataset -------------------------- **Data Set Characteristics:** :Number of Instances: 20640 :Number of Attributes: 8 numeric, predictive attributes and the target :Attribute Information: - MedInc median income in block group - HouseAge median house age in block group - AveRooms average number of rooms per household - AveBedrms average number of bedrooms per household - Population block group population - AveOccup average number of household members - Latitude block group latitude - Longitude block group longitude :Missing Attribute Values: None This dataset was obtained from the StatLib repository. https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). An household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surpinsingly large values for block groups with few households and many empty houses, such as vacation resorts. It can be downloaded/loaded using the :func:`sklearn.datasets.fetch_california_housing` function. .. topic:: References - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297
df = pd.DataFrame(california_data.data, columns = california_data.feature_names)
df['Target'] = california_data.target
df
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | Target | |
---|---|---|---|---|---|---|---|---|---|
0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 | 4.526 |
1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 | 3.585 |
2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 | 3.521 |
3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 | 3.413 |
4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 | 3.422 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20635 | 1.5603 | 25.0 | 5.045455 | 1.133333 | 845.0 | 2.560606 | 39.48 | -121.09 | 0.781 |
20636 | 2.5568 | 18.0 | 6.114035 | 1.315789 | 356.0 | 3.122807 | 39.49 | -121.21 | 0.771 |
20637 | 1.7000 | 17.0 | 5.205543 | 1.120092 | 1007.0 | 2.325635 | 39.43 | -121.22 | 0.923 |
20638 | 1.8672 | 18.0 | 5.329513 | 1.171920 | 741.0 | 2.123209 | 39.43 | -121.32 | 0.847 |
20639 | 2.3886 | 16.0 | 5.254717 | 1.162264 | 1387.0 | 2.616981 | 39.37 | -121.24 | 0.894 |
20640 rows × 9 columns
df.isnull().sum()
MedInc 0 HouseAge 0 AveRooms 0 AveBedrms 0 Population 0 AveOccup 0 Latitude 0 Longitude 0 Target 0 dtype: int64
plt.figure(figsize=(10,6))
sns.kdeplot(df['Target'],shade=True)
plt.show()
/var/folders/d8/bfd00f8j31ncwmcb9ylrs5xc0000gp/T/ipykernel_69686/3487986939.py:2: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. sns.kdeplot(df['Target'],shade=True)
sns.pairplot(df)
plt.figure(figsize=(10,6))
corr_matrix = df.corr().round(2)
sns.heatmap(data=corr_matrix, annot=True)
plt.show()
To fit a linear regression model, we select those features which have a high correlation with our target variable 'Target'. By looking at the correlation matrix we can see that 'MedInc' has a strong positive correlation with 'Target' (0.69) where as 'Latitude' has a high negative correlation with 'Target'(-0.14).
An important point in selecting features for a linear regression model is to check for multi-co-linearity. The features 'AveRooms', 'AveBedrms' have a correlation of 0.85. These feature pairs are strongly correlated to each other. We should not select both these features together for training the model. Same goes for the features 'Latitude' and 'Longitude' which have a correlation of -0.92.
plt.figure(figsize=(20, 8))
plt.subplot(1, 2, 1)
sns.regplot(data = df, x = 'Latitude', y = 'Target', line_kws = {'color':'red'})
plt.subplot(1, 2, 2)
sns.regplot(data = df, x = 'MedInc', y = 'Target', line_kws = {'color':'red'})
<AxesSubplot: xlabel='MedInc', ylabel='Target'>
# X = pd.DataFrame(np.c_[df['Latitude'], df['MedInc']], columns = ['Latitude','MedInc'])
X = df.iloc[:, :-1]
y = df['Target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(16512, 8) (4128, 8) (16512,) (4128,)
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Internally we use gradient
descend algorithm
, our main aim is to reach the global minima. To achieve that, we have to make sure that all our independent feature units should be on the same scale in order for convergence to happen.
pickle.dump(sc, open('scaler.pkl','wb'))
model = LinearRegression()
model.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
print(model.coef_)
[ 0.82789366 0.11540032 -0.2811675 0.32431215 -0.00344465 -0.04502135 -0.89616073 -0.86808269]
y_pred = model.predict(X_train)
rmse = np.sqrt(mean_squared_error(y_train, y_pred))
r2 = r2_score(y_train, y_pred)
print(f"RMSE Score: {format(rmse)}")
print(f"R2 Score: {format(r2)}")
RMSE Score: 0.7221082719887488 R2 Score: 0.6047922425924859
model.predict(sc.transform(california_data.data[0].reshape(1,-1)))[0]
/Users/mohsinriad/Documents/MyUniverse/ML-DL/MLOPS-house-price/venv/lib/python3.8/site-packages/sklearn/base.py:420: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
4.123924882410245
y_pred_test = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
r2 = r2_score(y_test, y_pred_test)
print(f"RMSE Score: {format(rmse)}")
print(f"R2 Score: {format(r2)}")
RMSE Score: 0.7323542382277793 R2 Score: 0.6112568432827638
NOTE:
RMSE: A metric that tells us how far apart the predicted values are from the observed values in a dataset, on average. The lower the RMSE, the better a model fits a dataset.
R2: A metric that tells us the proportion of the variance in the response variable of a regression model that can be explained by the predictor variables. This value ranges from 0 to 1. The higher the R2 value, the better a model fits a dataset.
plt.figure(figsize=(15, 8))
plt.scatter(y_test, y_pred_test)
plt.xlabel('Y Test')
plt.ylabel('Y Prediction')
plt.show()
# residulas/errors
error = y_test - y_pred_test
sns.displot(error, kind="kde")
<seaborn.axisgrid.FacetGrid at 0x12b8bde20>
plt.scatter(y_pred_test, error)
<matplotlib.collections.PathCollection at 0x12b972670>
pickle.dump(model, open('regression_model.pkl','wb'))
pickle_model = pickle.load(open('./regression_model.pkl', 'rb'))
/Users/mohsinriad/Documents/MyUniverse/ML-DL/MLOPS-house-price/venv/lib/python3.8/site-packages/sklearn/base.py:299: UserWarning: Trying to unpickle estimator LinearRegression from version 1.0.2 when using version 1.2.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations warnings.warn(
pickle_model.predict(sc.transform(california_data.data[1].reshape(1,-1)))
/Users/mohsinriad/Documents/MyUniverse/ML-DL/MLOPS-house-price/venv/lib/python3.8/site-packages/sklearn/base.py:420: UserWarning: X does not have valid feature names, but StandardScaler was fitted with feature names warnings.warn(
array([3.97772666])