Modified from: https://www.kaggle.com/shreayan98c/boston-house-price-prediction
The problem that we are going to solve here is that given a set of features that describe a house in Boston, our machine learning model must predict the house price. To train our machine learning model with boston housing data, we will be using scikit-learn’s boston dataset.
In this dataset, each row describes a boston town or suburb. There are 506 rows and 13 attributes (features) with a target column (price). https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names
# Importing the libraries
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Importing the Boston Housing dataset
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.keys()) #print keys
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
# Initializing the dataframe
data = pd.DataFrame(boston.data)
print(boston.DESCR)
.. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems. .. topic:: References - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
# See head of the dataset
data.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
#Adding the feature names to the dataframe
data.columns = boston.feature_names
data.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
Each record in the database describes a Boston suburb or town.
#Adding target variable to dataframe
data['PRICE'] = boston.target
# Median value of owner-occupied homes in $1000s
#Check the shape of dataframe
data.shape
(506, 14)
data.columns
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'PRICE'], dtype='object')
data.dtypes
CRIM float64 ZN float64 INDUS float64 CHAS float64 NOX float64 RM float64 AGE float64 DIS float64 RAD float64 TAX float64 PTRATIO float64 B float64 LSTAT float64 PRICE float64 dtype: object
# Identifying the unique number of values in the dataset
data.nunique()
CRIM 504 ZN 26 INDUS 76 CHAS 2 NOX 81 RM 446 AGE 356 DIS 412 RAD 9 TAX 66 PTRATIO 46 B 357 LSTAT 455 PRICE 229 dtype: int64
# Check for missing values
data.isnull().sum()
CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 PRICE 0 dtype: int64
# See rows with missing values
data[data.isnull().any(axis=1)]
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE |
---|
# Viewing the data statistics
print(data.describe())
CRIM ZN INDUS CHAS NOX RM \ count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 AGE DIS RAD TAX PTRATIO B \ count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 mean 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 std 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 min 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 25% 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 50% 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 75% 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 max 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 LSTAT PRICE count 506.000000 506.000000 mean 12.653063 22.532806 std 7.141062 9.197104 min 1.730000 5.000000 25% 6.950000 17.025000 50% 11.360000 21.200000 75% 16.955000 25.000000 max 37.970000 50.000000
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
fig, axs = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for k,v in data.items():
sns.boxplot(y=k, data=data, ax=axs[index])
index += 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)
Columns like CRIM, ZN, RM, B seems to have outliers. Let's see the outliers percentage in every column.
for k, v in data.items():
q1 = v.quantile(0.25)
q3 = v.quantile(0.75)
irq = q3 - q1
v_col = v[(v <= q1 - 1.5 * irq) | (v >= q3 + 1.5 * irq)]
perc = np.shape(v_col)[0] * 100.0 / np.shape(data)[0]
print("Column %s outliers = %.2f%%" % (k, perc))
Column CRIM outliers = 13.04% Column ZN outliers = 13.44% Column INDUS outliers = 0.00% Column CHAS outliers = 100.00% Column NOX outliers = 0.00% Column RM outliers = 5.93% Column AGE outliers = 0.00% Column DIS outliers = 0.99% Column RAD outliers = 0.00% Column TAX outliers = 0.00% Column PTRATIO outliers = 2.96% Column B outliers = 15.22% Column LSTAT outliers = 1.38% Column PRICE outliers = 7.91%
Let's see how the feature distributions look like
#import warnings
#warnings.filterwarnings('ignore')
fig, axs = plt.subplots(ncols=7, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for k,v in data.items():
sns.histplot(v, kde=True, ax=axs[index])
index += 1
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)
# Finding out the correlation between the features
corr = data.corr().abs()
corr.shape
(14, 14)
# Plotting the heatmap of correlation between features
plt.figure(figsize=(20,20))
sns.heatmap(corr, cbar=True, square= True, fmt='.1f', annot=True, annot_kws={'size':15})
#sns.heatmap(corr, cbar=True, square= True, fmt='.1f', annot=True, annot_kws={'size':15}, cmap='Greens')
<AxesSubplot:>
Plot columns with a correlation score above 0.5 with PRICE against PRICE
from sklearn import preprocessing
# Let's scale the columns before plotting them against MEDV
min_max_scaler = preprocessing.MinMaxScaler()
column_sels = ['LSTAT', 'INDUS', 'NOX', 'PTRATIO', 'RM', 'TAX', 'DIS', 'AGE']
x = data.loc[:,column_sels]
y = data['PRICE']
x = pd.DataFrame(data=min_max_scaler.fit_transform(x), columns=column_sels)
fig, axs = plt.subplots(ncols=4, nrows=2, figsize=(20, 10))
index = 0
axs = axs.flatten()
for i, k in enumerate(column_sels):
sns.regplot(y=y, x=x[k], ax=axs[i])
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=5.0)
# Splitting target variable and independent variables
X = data.drop(['PRICE'], axis = 1)
y = data['PRICE']
# Splitting to training and testing data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 4)
# Import library for Linear Regression
from sklearn.linear_model import LinearRegression
# Create a Linear regressor
lm = LinearRegression()
# Train the model using the training sets
lm.fit(X_train, y_train)
LinearRegression()
# Value of y intercept
lm.intercept_
36.35704137659492
#Converting the coefficient values to a dataframe
coeffs = pd.DataFrame([X_train.columns,lm.coef_]).T
coeffs = coeffs.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffs
Attribute | Coefficients | |
---|---|---|
0 | CRIM | -0.12257 |
1 | ZN | 0.055678 |
2 | INDUS | -0.008834 |
3 | CHAS | 4.693448 |
4 | NOX | -14.435783 |
5 | RM | 3.28008 |
6 | AGE | -0.003448 |
7 | DIS | -1.552144 |
8 | RAD | 0.32625 |
9 | TAX | -0.014067 |
10 | PTRATIO | -0.803275 |
11 | B | 0.009354 |
12 | LSTAT | -0.523478 |
# Model prediction on train data
y_pred = lm.predict(X_train)
# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))
R^2: 0.7465991966746854 Adjusted R^2: 0.736910342429894 MAE: 3.08986109497113 MSE: 19.07368870346903 RMSE: 4.367343437774162
𝑅^2 : It is a measure of the linear relationship between X and Y. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.
Adjusted 𝑅^2 :The adjusted R-squared compares the explanatory power of regression models that contain different numbers of predictors.
MAE : It is the mean of the absolute value of the errors. It measures the difference between two continuous variables, here actual and predicted values of y.
MSE: The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value.
RMSE: The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value.
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs Predicted prices")
plt.show()
# Checking residuals
plt.scatter(y_pred,y_train-y_pred)
plt.title("Predicted vs residuals")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.show()
There is no pattern visible in this plot and values are distributed equally around zero. So Linearity assumption is satisfied
# Checking Normality of errors
sns.histplot(y_train-y_pred, kde=True)
plt.title("Histogram of Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()
Here the residuals are normally distributed. So normality assumption is satisfied
# Predicting Test data with the model
y_test_pred = lm.predict(X_test)
# Model Evaluation
acc_linreg = metrics.r2_score(y_test, y_test_pred)
print('R^2:', acc_linreg)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))
R^2: 0.7121818377409196 Adjusted R^2: 0.6850685326005714 MAE: 3.859005592370741 MSE: 30.05399330712412 RMSE: 5.482152251362973
Here the model evaluations scores are almost matching with that of train data. So the model is not overfitting.
# Import Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
# Create a Random Forest Regressor
reg = RandomForestRegressor()
# Train the model using the training sets
reg.fit(X_train, y_train)
RandomForestRegressor()
# Model prediction on train data
y_pred = reg.predict(X_train)
# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))
R^2: 0.9780387443633498 Adjusted R^2: 0.9771990492948897 MAE: 0.8612994350282495 MSE: 1.6530419322033914 RMSE: 1.2857067831365716
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs Predicted prices")
plt.show()
# Checking residuals
plt.scatter(y_pred,y_train-y_pred)
plt.title("Predicted vs residuals")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.show()
# Predicting Test data with the model
y_test_pred = reg.predict(X_test)
# Model Evaluation
acc_rf = metrics.r2_score(y_test, y_test_pred)
print('R^2:', acc_rf)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))
R^2: 0.8287390519698405 Adjusted R^2: 0.8126057742568545 MAE: 2.5514407894736846 MSE: 17.88308057236843 RMSE: 4.228839151867618
# Creating scaled set to be used in model to improve our results
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Import SVM Regressor
from sklearn import svm
# Create a SVM Regressor
reg = svm.SVR()
# Train the model using the training sets
reg.fit(X_train, y_train)
SVR()
C : float, optional (default=1.0): The penalty parameter of the error term. It controls the trade off between smooth decision boundary and classifying the training points correctly.
kernel : string, optional (default='rbf’): kernel parameters selects the type of hyperplane used to separate the data. It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed’ or a callable.
degree : int, optional (default=3): Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
gamma : float, optional (default='auto’): It is for non linear hyperplanes. The higher the gamma value it tries to exactly fit the training data set. Current default is 'auto' which uses 1 / n_features.
coef0 : float, optional (default=0.0): Independent term in kernel function. It is only significant in 'poly' and 'sigmoid'.
shrinking : boolean, optional (default=True): Whether to use the shrinking heuristic.
# Model prediction on train data
y_pred = reg.predict(X_train)
# Model Evaluation
print('R^2:',metrics.r2_score(y_train, y_pred))
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_train, y_pred))*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_train, y_pred))
print('MSE:',metrics.mean_squared_error(y_train, y_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_train, y_pred)))
R^2: 0.6419097248941195 Adjusted R^2: 0.628218037904777 MAE: 2.9361501059460293 MSE: 26.953752101332935 RMSE: 5.191700309275655
# Visualizing the differences between actual prices and predicted values
plt.scatter(y_train, y_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("Prices vs Predicted prices")
plt.show()
# Checking residuals
plt.scatter(y_pred,y_train-y_pred)
plt.title("Predicted vs residuals")
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.show()
# Predicting Test data with the model
y_test_pred = reg.predict(X_test)
# Model Evaluation
acc_svm = metrics.r2_score(y_test, y_test_pred)
print('R^2:', acc_svm)
print('Adjusted R^2:',1 - (1-metrics.r2_score(y_test, y_test_pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print('MAE:',metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:',metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))
R^2: 0.5900158460478174 Adjusted R^2: 0.5513941503856553 MAE: 3.7561453553021686 MSE: 42.81057499010247 RMSE: 6.542979060802691
models = pd.DataFrame({
'Model': ['Linear Regression', 'Random Forest', 'Support Vector Machines'],
'R-squared Score': [acc_linreg*100, acc_rf*100, acc_svm*100]})
models.sort_values(by='R-squared Score', ascending=False)
Model | R-squared Score | |
---|---|---|
1 | Random Forest | 82.873905 |
0 | Linear Regression | 71.218184 |
2 | Support Vector Machines | 59.001585 |