The purpose of this project is to deal with a real world regression problem and approach it with a business concern: from explaining the insights we can get just from the data, find a ML model that would fit on the regression problem and finally put into production the model and the insights we can get from it.
For this project I've chosen this Kaggle dataset. This dataset contains information of different houses in Iowa and the goal of this project is to find a model that able to predict the sale price of a house. Predicting the market price of an element (in this case a house) gives to any business and important and powerful information to take advantage on. In the case of the real state business, one of these huge advantages could be improve to the profit margin and increase the turnover.
So, in the following sections we will put ourselves in the shoes of a real state company, which want to understand which useful insights we can get from their historical data, train a model able to predict the price of a house and take advantage of this to increase the turnover.
In any ML project, the first step that should be done is an exploratory data analysis, not only because doing this will help us to understand better the data we're working with, but also let us know if there are problems in the data we're working on prior trying to model it, let us know that the model we've trained is reliable and finally is a first step to bring useful insights without even have to train a model.
First of all, we'll import (and install) all necessary libraries we'll need in this notebook.
#pip install keras-tuner
#pip install xgboost
#pip install pycaret
#pip install catboost
#pip install lightgbm
#pip install tensorflow
#pip install keras
#pip install scikeras
#pip install imbalanced-learn==0.11.0
#pip install shap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.decomposition import PCA
from sklearn.metrics import mean_squared_error, r2_score
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from keras_tuner.tuners import RandomSearch, Hyperband
import keras_tuner as kt
from scipy.stats import zscore
import shap
import pickle
import warnings
warnings.filterwarnings('ignore') #For a cleaner lecture
%matplotlib inline
WARNING:tensorflow:From C:\Users\rojol\anaconda3\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
Now, we can import the data and take a look to the first rows:
df = pd.read_csv('train.csv')
df.head(10)
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 | 6 | 50 | RL | 85.0 | 14115 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
6 | 7 | 20 | RL | 75.0 | 10084 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 307000 |
7 | 8 | 60 | RL | NaN | 10382 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
8 | 9 | 50 | RM | 51.0 | 6120 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2008 | WD | Abnorml | 129900 |
9 | 10 | 190 | RL | 50.0 | 7420 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 1 | 2008 | WD | Normal | 118000 |
10 rows × 81 columns
Let's take a look to the features of our dataset, the type of each features inferred from pandas and the count of not null values:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 1452 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
As we can see above we have a dataset with 1460 rows and 81 columns. Here you can see a description of each feature, get from the kaggle dataset.
The feature id is a unique identifier for each house, so it doesn't have any relevance to get any insight from it or for training the model, that's why we'll delete this from the dataset:
df.drop(columns = ['Id'], inplace = True)
Another important thing is to check how many null data we have on the dataset. In case we only have a few nulls or to much nulls, we just can delete those records/feature, but otherwise we'll have to infer the missing data. Here you can see all the features with the total of null values:
null_count = df.isna().sum()
null_count = null_count.where(null_count > 0).dropna().sort_values(ascending = False)
total_null_values = null_count.to_frame(name="count_nulls")
perc_null_values = (round(100*null_count/len(df),2)).to_frame(name="perc_null")
df_null_values = pd.concat([total_null_values, perc_null_values], axis=1)
df_null_values
count_nulls | perc_null | |
---|---|---|
PoolQC | 1453.0 | 99.52 |
MiscFeature | 1406.0 | 96.30 |
Alley | 1369.0 | 93.77 |
Fence | 1179.0 | 80.75 |
FireplaceQu | 690.0 | 47.26 |
LotFrontage | 259.0 | 17.74 |
GarageType | 81.0 | 5.55 |
GarageYrBlt | 81.0 | 5.55 |
GarageFinish | 81.0 | 5.55 |
GarageQual | 81.0 | 5.55 |
GarageCond | 81.0 | 5.55 |
BsmtExposure | 38.0 | 2.60 |
BsmtFinType2 | 38.0 | 2.60 |
BsmtFinType1 | 37.0 | 2.53 |
BsmtCond | 37.0 | 2.53 |
BsmtQual | 37.0 | 2.53 |
MasVnrArea | 8.0 | 0.55 |
MasVnrType | 8.0 | 0.55 |
Electrical | 1.0 | 0.07 |
There are 4 features with more than 80% missing data: PoolQC, MiscFeature, Alley and Fence. It's way to much null data to try estimate its value so we'll delete these columns from the dataset. Anyway, if we could have access to the original data could be good to know why there's missing so much data for these columns and try to fill those values from the source.
There is no consensus on how many null values a feature should have to eliminate it, but you can find out in some sources (like this one) that from 50% of null in a feature, you can remove the whole feature from the dataset.
df.drop(columns = ['PoolQC', 'MiscFeature', 'Alley', 'Fence'], inplace = True)
Taking a look info the description of each feature, we can realise that there are some ordinal features that are defined as object. Let's do a numerical transformation to these features in order to work in an easier way with them:
# Making Dictionaries of ordinal features
ExterQual_map = { #Evaluates the quality of the material on the exterior
'Po' : 1,
'Fa' : 2,
'TA' : 3,
'Gd' : 4,
'Ex' : 5
}
ExterCond_map = ExterQual_map # Evaluates the present condition of the material on the exterior
BsmtQual_map = ExterQual_map #Evaluates the height of the basement
BsmtQual_map['NA'] = -1
BsmtCond_map = BsmtQual_map #Evaluates the general condition of the basement
BsmtExposure_map = { #Refers to walkout or garden level walls
'NA' : -1,
'No' : 0,
'Mn' : 1,
'Av' : 2,
'Gd' : 3
}
BsmtFinType1_map = { #Rating of basement finished area
'NA' : -1,
'Unf' : 0,
'LwQ' : 1,
'Rec' : 2,
'BLQ' : 3,
'ALQ' : 4,
'GLQ' : 5
}
BsmtFinType2_map = BsmtFinType1_map #Rating of basement finished area (if multiple types)
HeatingQC_map = ExterQual_map #Heating quality and condition
CentralAir_map = { #Central air conditioning
'Y': 1,
'N': 0
}
KitchenQual_map = ExterQual_map #Kitchen quality
FireplaceQu_map = BsmtQual_map #Fireplace quality
GarageFinish_map = { #Interior finish of the garage
'NA': -1,
'Unf': 0,
'RFn': 1,
'Fin': 2
}
GarageQual_map = BsmtQual_map #Garage quality
GarageCond_map = BsmtQual_map #Garage condition
PavedDrive_map = { #Paved driveway
'N': 0,
'P': 1,
'Y': 2
}
# Transforming Categorical features into numerical features
df_first_transform = df.copy()
df_first_transform.loc[:,'ExterQual'] = df_first_transform['ExterQual'].map(ExterQual_map)
df_first_transform.loc[:,'ExterCond'] = df_first_transform['ExterCond'].map(ExterCond_map)
df_first_transform.loc[:,'BsmtQual'] = df_first_transform['BsmtQual'].map(BsmtQual_map)
df_first_transform.loc[:,'BsmtCond'] = df_first_transform['BsmtCond'].map(BsmtCond_map)
df_first_transform.loc[:,'BsmtExposure'] = df_first_transform['BsmtExposure'].map(BsmtExposure_map)
df_first_transform.loc[:,'BsmtFinType1'] = df_first_transform['BsmtFinType1'].map(BsmtFinType1_map)
df_first_transform.loc[:,'BsmtFinType2'] = df_first_transform['BsmtFinType2'].map(BsmtFinType2_map)
df_first_transform.loc[:,'HeatingQC'] = df_first_transform['HeatingQC'].map(HeatingQC_map)
df_first_transform.loc[:,'CentralAir'] = df_first_transform['CentralAir'].map(CentralAir_map)
df_first_transform.loc[:,'KitchenQual'] = df_first_transform['KitchenQual'].map(KitchenQual_map)
df_first_transform.loc[:,'FireplaceQu'] = df_first_transform['FireplaceQu'].map(FireplaceQu_map)
df_first_transform.loc[:,'GarageFinish'] = df_first_transform['GarageFinish'].map(GarageFinish_map)
df_first_transform.loc[:,'GarageQual'] = df_first_transform['GarageQual'].map(GarageQual_map)
df_first_transform.loc[:,'GarageCond'] = df_first_transform['GarageCond'].map(GarageCond_map)
df_first_transform.loc[:,'PavedDrive'] = df_first_transform['PavedDrive'].map(PavedDrive_map)
df_first_transform.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 76 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MSSubClass 1460 non-null int64 1 MSZoning 1460 non-null object 2 LotFrontage 1201 non-null float64 3 LotArea 1460 non-null int64 4 Street 1460 non-null object 5 LotShape 1460 non-null object 6 LandContour 1460 non-null object 7 Utilities 1460 non-null object 8 LotConfig 1460 non-null object 9 LandSlope 1460 non-null object 10 Neighborhood 1460 non-null object 11 Condition1 1460 non-null object 12 Condition2 1460 non-null object 13 BldgType 1460 non-null object 14 HouseStyle 1460 non-null object 15 OverallQual 1460 non-null int64 16 OverallCond 1460 non-null int64 17 YearBuilt 1460 non-null int64 18 YearRemodAdd 1460 non-null int64 19 RoofStyle 1460 non-null object 20 RoofMatl 1460 non-null object 21 Exterior1st 1460 non-null object 22 Exterior2nd 1460 non-null object 23 MasVnrType 1452 non-null object 24 MasVnrArea 1452 non-null float64 25 ExterQual 1460 non-null int64 26 ExterCond 1460 non-null int64 27 Foundation 1460 non-null object 28 BsmtQual 1423 non-null float64 29 BsmtCond 1423 non-null float64 30 BsmtExposure 1422 non-null float64 31 BsmtFinType1 1423 non-null float64 32 BsmtFinSF1 1460 non-null int64 33 BsmtFinType2 1422 non-null float64 34 BsmtFinSF2 1460 non-null int64 35 BsmtUnfSF 1460 non-null int64 36 TotalBsmtSF 1460 non-null int64 37 Heating 1460 non-null object 38 HeatingQC 1460 non-null int64 39 CentralAir 1460 non-null int64 40 Electrical 1459 non-null object 41 1stFlrSF 1460 non-null int64 42 2ndFlrSF 1460 non-null int64 43 LowQualFinSF 1460 non-null int64 44 GrLivArea 1460 non-null int64 45 BsmtFullBath 1460 non-null int64 46 BsmtHalfBath 1460 non-null int64 47 FullBath 1460 non-null int64 48 HalfBath 1460 non-null int64 49 BedroomAbvGr 1460 non-null int64 50 KitchenAbvGr 1460 non-null int64 51 KitchenQual 1460 non-null int64 52 TotRmsAbvGrd 1460 non-null int64 53 Functional 1460 non-null object 54 Fireplaces 1460 non-null int64 55 FireplaceQu 770 non-null float64 56 GarageType 1379 non-null object 57 GarageYrBlt 1379 non-null float64 58 GarageFinish 1379 non-null float64 59 GarageCars 1460 non-null int64 60 GarageArea 1460 non-null int64 61 GarageQual 1379 non-null float64 62 GarageCond 1379 non-null float64 63 PavedDrive 1460 non-null int64 64 WoodDeckSF 1460 non-null int64 65 OpenPorchSF 1460 non-null int64 66 EnclosedPorch 1460 non-null int64 67 3SsnPorch 1460 non-null int64 68 ScreenPorch 1460 non-null int64 69 PoolArea 1460 non-null int64 70 MiscVal 1460 non-null int64 71 MoSold 1460 non-null int64 72 YrSold 1460 non-null int64 73 SaleType 1460 non-null object 74 SaleCondition 1460 non-null object 75 SalePrice 1460 non-null int64 dtypes: float64(12), int64(40), object(24) memory usage: 867.0+ KB
As we can see above, the ordinal features has been changed to a float64 type (due there are also null values). Finally, let's add two more important features: total square feet of the house and the total of bathrooms:
df_first_transform['TotalSF'] = df_first_transform['TotalBsmtSF'] + df_first_transform['GrLivArea']
df_first_transform['Total_Bathrooms'] = df_first_transform['FullBath'] + df_first_transform['HalfBath'] + df_first_transform['BsmtFullBath'] + df_first_transform['BsmtHalfBath']
# Code for transforming the dataset. We'll be using it later.
def feature_engineering_sale_price_dataset(df):
df_first_transform = df.copy()
df_first_transform.drop(columns = ['PoolQC', 'MiscFeature', 'Alley', 'Fence'], inplace = True)
df_first_transform.loc[:,'ExterQual'] = df_first_transform['ExterQual'].map(ExterQual_map)
df_first_transform.loc[:,'ExterCond'] = df_first_transform['ExterCond'].map(ExterCond_map)
df_first_transform.loc[:,'BsmtQual'] = df_first_transform['BsmtQual'].map(BsmtQual_map)
df_first_transform.loc[:,'BsmtCond'] = df_first_transform['BsmtCond'].map(BsmtCond_map)
df_first_transform.loc[:,'BsmtExposure'] = df_first_transform['BsmtExposure'].map(BsmtExposure_map)
df_first_transform.loc[:,'BsmtFinType1'] = df_first_transform['BsmtFinType1'].map(BsmtFinType1_map)
df_first_transform.loc[:,'BsmtFinType2'] = df_first_transform['BsmtFinType2'].map(BsmtFinType2_map)
df_first_transform.loc[:,'HeatingQC'] = df_first_transform['HeatingQC'].map(HeatingQC_map)
df_first_transform.loc[:,'CentralAir'] = df_first_transform['CentralAir'].map(CentralAir_map)
df_first_transform.loc[:,'KitchenQual'] = df_first_transform['KitchenQual'].map(KitchenQual_map)
df_first_transform.loc[:,'FireplaceQu'] = df_first_transform['FireplaceQu'].map(FireplaceQu_map)
df_first_transform.loc[:,'GarageFinish'] = df_first_transform['GarageFinish'].map(GarageFinish_map)
df_first_transform.loc[:,'GarageQual'] = df_first_transform['GarageQual'].map(GarageQual_map)
df_first_transform.loc[:,'GarageCond'] = df_first_transform['GarageCond'].map(GarageCond_map)
df_first_transform.loc[:,'PavedDrive'] = df_first_transform['PavedDrive'].map(PavedDrive_map)
df_first_transform['TotalSF'] = df_first_transform['TotalBsmtSF'] + df_first_transform['GrLivArea']
df_first_transform['Total_Bathrooms'] = df_first_transform['FullBath'] + df_first_transform['HalfBath'] + df_first_transform['BsmtFullBath'] + df_first_transform['BsmtHalfBath']
return df_first_transform
This dataset contains too many features to explain them by themselves and their relationship to each other to extract information one at a time. That's why we'll try to filter the important features from our predictive feature SalePrice and then we could extract more detailed analysis only from them and get useful insights.
This first feature filter we'll do it using the spearman correlation (because we're dealing with ordinal and continous variables). We'll only do the exploratory data analysis with features with the greatest correlation with our SalePrice feature.
corrmat = df_first_transform.corr(method='spearman')
f, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corrmat, vmax=.8, square=True)
<AxesSubplot: >
This first filter, only contains numerical features. We'll deal later with the categorical features of the dataset. For now, we'll keep those features with a correlation at least 0.6 (positive or negative), from that value we can assume that exists a strong correlation.
Also, we'll sort the correlation matrix in order that the most correlated features will appear on the top of the heatmap:
top_corr_features = corrmat.index[abs(corrmat["SalePrice"])>=0.6]
plt.figure(figsize=(3,6))
g = sns.heatmap(df_first_transform[top_corr_features].corr(method='spearman')[['SalePrice']].sort_values('SalePrice', ascending=False),annot=True)
So, there are 2 features very strongly correlated with SalesPrice:
and 10 features strongly correlated with SalesPrice:
In addition, in the correlation matrix doesn't appear any high negative correlation, so this means that a high value of any of these features also has a high value for sale price and low value of any of these features also has a low value for sale price.
Finally, we can get the firsts insights from this correlation matrix:
Note that in the previous correlation doesn't appear any feature about the condition of the house (overall, material of the exterior, basement) and that's interesting because it doesn't seem that the actual condition of the house is highly related to the sale price, but it does the qualities (as we've seen above).
Let's check this plotting the relation of these features with the Sales Price feature:
top_corr_features_excl_salesprice = top_corr_features.drop('SalePrice')
fig, ax = plt.subplots(round(len(top_corr_features_excl_salesprice) / 3), 3, figsize = (18, 18))
for i, ax in enumerate(fig.axes):
if i < len(top_corr_features_excl_salesprice) - 1:
sns.regplot(x=top_corr_features_excl_salesprice[i],y='SalePrice', data=df_first_transform[top_corr_features], ax=ax, line_kws={'color': '#000000'})
As we can see above, each scatterplot has a positive slope in the linear regression (black line), so, as we said before, for greater values of each feature, the greater value for the sale price.
Let's get some more insights. Let's plot the histogram of these previous features and the main statistical metrics:
df_first_transform[top_corr_features].hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8)
array([[<AxesSubplot: title={'center': 'OverallQual'}>, <AxesSubplot: title={'center': 'YearBuilt'}>, <AxesSubplot: title={'center': 'ExterQual'}>, <AxesSubplot: title={'center': 'BsmtQual'}>], [<AxesSubplot: title={'center': 'TotalBsmtSF'}>, <AxesSubplot: title={'center': 'GrLivArea'}>, <AxesSubplot: title={'center': 'FullBath'}>, <AxesSubplot: title={'center': 'KitchenQual'}>], [<AxesSubplot: title={'center': 'GarageCars'}>, <AxesSubplot: title={'center': 'GarageArea'}>, <AxesSubplot: title={'center': 'SalePrice'}>, <AxesSubplot: title={'center': 'TotalSF'}>], [<AxesSubplot: title={'center': 'Total_Bathrooms'}>, <AxesSubplot: >, <AxesSubplot: >, <AxesSubplot: >]], dtype=object)
df_first_transform[top_corr_features].describe()
OverallQual | YearBuilt | ExterQual | BsmtQual | TotalBsmtSF | GrLivArea | FullBath | KitchenQual | GarageCars | GarageArea | SalePrice | TotalSF | Total_Bathrooms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1460.000000 | 1460.000000 | 1460.00000 | 1423.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 | 1460.000000 |
mean | 6.099315 | 1971.267808 | 3.39589 | 3.579761 | 1057.429452 | 1515.463699 | 1.565068 | 3.511644 | 1.767123 | 472.980137 | 180921.195890 | 2572.893151 | 2.430822 |
std | 1.382997 | 30.202904 | 0.57428 | 0.680602 | 438.705324 | 525.480383 | 0.550916 | 0.663760 | 0.747315 | 213.804841 | 79442.502883 | 823.598492 | 0.922647 |
min | 1.000000 | 1872.000000 | 2.00000 | 2.000000 | 0.000000 | 334.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 34900.000000 | 334.000000 | 1.000000 |
25% | 5.000000 | 1954.000000 | 3.00000 | 3.000000 | 795.750000 | 1129.500000 | 1.000000 | 3.000000 | 1.000000 | 334.500000 | 129975.000000 | 2014.000000 | 2.000000 |
50% | 6.000000 | 1973.000000 | 3.00000 | 4.000000 | 991.500000 | 1464.000000 | 2.000000 | 3.000000 | 2.000000 | 480.000000 | 163000.000000 | 2479.000000 | 2.000000 |
75% | 7.000000 | 2000.000000 | 4.00000 | 4.000000 | 1298.250000 | 1776.750000 | 2.000000 | 4.000000 | 2.000000 | 576.000000 | 214000.000000 | 3008.500000 | 3.000000 |
max | 10.000000 | 2010.000000 | 5.00000 | 5.000000 | 6110.000000 | 5642.000000 | 3.000000 | 5.000000 | 4.000000 | 1418.000000 | 755000.000000 | 11752.000000 | 6.000000 |
With the histogram and the principal statistical metrics of the main numerical features, we can tell the big picture of the houses that have been sold:
As we've done before, let's pick a few important categorical features for the price sale and let's try to get some insights of them. This time the criterion I have chosen has been my own knowledge of what characteristics might be important in predicting the value of a house. From the data_description file I've picked these 5 features:
Let's get a detailed view of each one:
Acronym and values:
Let's start doing a histogram to check the main category:
plt.figure(figsize=(4,4))
sns.histplot(df_first_transform["MSZoning"])
<AxesSubplot: xlabel='MSZoning', ylabel='Count'>
Let's see if there are price difference between them using a boxplot:
sns.boxplot(data=df_first_transform, y="MSZoning", x="SalePrice")
<AxesSubplot: xlabel='SalePrice', ylabel='MSZoning'>
As we can see above, the majority of houses sold are in a residential zone with low density. It seems that from more expensive to less expensive we have the following zones: floating village, residential zone with low density, then very similar we have residential zone with medium and high density and finally we have commercial zones. Although the floating village is, in general, more expensive that the low density residential, we can notice than there are quite a lot outliers for high prices in this type of residential.
As we've done previously, let's plot a histogram to check if there are top neighborhood:
df_first_transform["Neighborhood"].hist(figsize=(20, 5), bins=50, xlabelsize=8, ylabelsize=8)
<AxesSubplot: >
plt.figure(figsize=(15,8))
sns.boxplot(data=df_first_transform, y="Neighborhood", x="SalePrice")
<AxesSubplot: xlabel='SalePrice', ylabel='Neighborhood'>
As we could imagine, the neighborhood is a feature with a direct impact on the price sale. As we can see above, the boxplots are very different from one neighborhood to another. The most expensive neighborhood seems to be Northridge Heights (NridHt) and Northridge Heights (StoneBr) and the cheaper ones are Briardale (BrDale), Meadow Village (MeadowV) and Iowa DOT and Rail Road (IDOTRR). The top bought neighborhoods are: North Ames (NAmes) and College Creek (CollgCr).
Acronym and values:
plt.figure(figsize=(6,3))
sns.histplot(df_first_transform["BldgType"])
<AxesSubplot: xlabel='BldgType', ylabel='Count'>
sns.boxplot(data=df_first_transform, y="BldgType", x="SalePrice")
<AxesSubplot: xlabel='SalePrice', ylabel='BldgType'>
The single-family detached home is the type of home most bought and also is the most expensive in general.
Acronym and values:
plt.figure(figsize=(8,3))
sns.histplot(df_first_transform["Foundation"])
<AxesSubplot: xlabel='Foundation', ylabel='Count'>
sns.boxplot(data=df_first_transform, y="Foundation", x="SalePrice")
<AxesSubplot: xlabel='SalePrice', ylabel='Foundation'>
The main type of foundations are Brick & Tile and Cinder Block, the first one is way more expensive.
Let's check if we have recent data sales, otherwise, due the inflation the prices we'll try to predict may be not accurate.
df_first_transform[["YrSold"]].agg(['min','max'])
YrSold | |
---|---|
min | 2006 |
max | 2010 |
The dataset only contains data about 5 years of difference, from 2006 to 2010. There isn't a huge difference in years, so we don't have to delete old data that could affect to our price prediction.
From the previous analysis of more important features correlated to the price, let's group by those features and let's check how different is the price in each group. We're going to create the groups using this features:
To check how different can be the price in each group, we'll see how dispersed the data is in relation to the mean, so we'll use the Standard Desvitation (std) metric.
In the following chunk, you we'll see the data group by the previous features and ordered by the std:
length_bins = 200
bins_total_SF = pd.cut(df_first_transform.TotalSF, np.arange(df_first_transform.TotalSF.min(),df_first_transform.TotalSF.max(),length_bins))
groupby = df_first_transform.groupby(["OverallQual", bins_total_SF, "Total_Bathrooms", "GarageCars", "Neighborhood", "MSZoning", "BldgType", "Foundation"])
groupby_metrics = groupby[["SalePrice"]].describe().sort_values(by=('SalePrice','std'), ascending = False)
groupby_metrics.head(5)
SalePrice | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | ||||||||
OverallQual | TotalSF | Total_Bathrooms | GarageCars | Neighborhood | MSZoning | BldgType | Foundation | ||||||||
8 | (3934, 4134] | 3 | 3 | NridgHt | RL | 1Fam | PConc | 2.0 | 346646.500000 | 132021.785795 | 253293.0 | 299969.75 | 346646.5 | 393323.25 | 440000.0 |
9 | (4534, 4734] | 4 | 3 | NridgHt | RL | 1Fam | PConc | 3.0 | 543914.666667 | 93566.270655 | 437154.0 | 510043.50 | 582933.0 | 597295.00 | 611657.0 |
8 | (3134, 3334] | 4 | 2 | Crawfor | RM | TwnhsE | PConc | 3.0 | 300833.333333 | 81866.252713 | 235000.0 | 255000.00 | 275000.0 | 333750.00 | 392500.0 |
(3934, 4134] | 4 | 3 | NoRidge | RL | 1Fam | PConc | 2.0 | 358500.000000 | 61518.289963 | 315000.0 | 336750.00 | 358500.0 | 380250.00 | 402000.0 | |
7 | (2734, 2934] | 2 | 2 | NWAmes | RL | 1Fam | CBlock | 2.0 | 119750.000000 | 52679.455198 | 82500.0 | 101125.00 | 119750.0 | 138375.00 | 157000.0 |
Let's plot the value of the std (for those non null values) and we'll see how is distributed:
groupby_metrics.reset_index(drop=True, inplace=True)
plt.figure(figsize=(5,5))
sns.histplot(groupby_metrics["SalePrice"]['std'].dropna(), bins=100)
<AxesSubplot: xlabel='std', ylabel='Count'>
As we can see above, for a large number of groups the Standard Desviation is near 0 and then you can see a few large values of the Std. This could suggest that for most of the houses with the same characteristics, the price is consistent and we could try to predict with good results. On the other hand, there are a few others houses that the price is not consistent for the same characteristics, this could due because some outliers. We'll check for outliers in a later section.
In this section we've said many things about this dataset. Let's just summarize all above in the following take aways:
In the previous part of this project, we've had examined a dataset with many features that describe the houses of the Iowa region. We explored and discovered interesting insights about the real state market and how the features of a house can influence about the selling price of the house. Now, we're moving on to the Machine Learning part of this project.
Our main goal here is to find the best model that could accurately predict the selling price of a house, so its market price. Also, in this section we'll clean and transform the data to ease that the data could fit in several models.
Let's check if there are outliers in our data that we should remove before finding a model that fits our data.
First of all, let's plot the histogram of our target feature SalePrice:
plt.figure(figsize=(6,4))
sns.histplot(df_first_transform['SalePrice'], bins=100)
<AxesSubplot: xlabel='SalePrice', ylabel='Count'>
As we can see above, the distribution of the SalePrice feature seems to be a gaussian distribution positively skewed. Let's find out the skew value:
df_first_transform['SalePrice'].skew()
1.8828757597682129
Because the skew value > 0.5, we can numerically confirm the positively skewed distribution. On the other hand, also in the right side of the distribution, seems to have a long tail. That would confirm the presence of outliers in the right side of the distribution. The kurtosis is the statistical value that measure how long or short are the tails of the distribution. Let's find out its value:
df_first_transform['SalePrice'].kurtosis()
6.536281860064529
For values >3 of kurtosis we can say that we have a heavy-tailed distribution (Leptokurtic) an this confirm the presence of outliers. Here for more information. We are going to eliminate the outliers to prevent those affecting the results of the model we are going to build. The method we're going to use is the z-score:
df_remove_outliers = df_first_transform.copy()
# Define a Z-score threshold (the usual one is 3, we'll set it as 4 to delete only the highest outliers)
z_threshold = 4
# Calculate Z-scores for each column
z_scores = df[['SalePrice']].apply(zscore)
# Differenciate outliers from not outliers
outliers_index = (z_scores.abs() > z_threshold)["SalePrice"]
# Delete the outliers from the dataset
df_remove_outliers = df_remove_outliers[~outliers_index]
print("The outliers removed are the following:\n")
df_first_transform[outliers_index].sort_values(by="SalePrice", ascending = False)
The outliers removed are the following:
MSSubClass | MSZoning | LotFrontage | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | ... | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | TotalSF | Total_Bathrooms | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
691 | 60 | RL | 104.0 | 21535 | Pave | IR1 | Lvl | AllPub | Corner | Gtl | ... | 0 | 0 | 0 | 1 | 2007 | WD | Normal | 755000 | 6760 | 5 |
1182 | 60 | RL | 160.0 | 15623 | Pave | IR1 | Lvl | AllPub | Corner | Gtl | ... | 0 | 555 | 0 | 7 | 2007 | WD | Abnorml | 745000 | 6872 | 5 |
1169 | 60 | RL | 118.0 | 35760 | Pave | IR1 | Lvl | AllPub | CulDSac | Gtl | ... | 0 | 0 | 0 | 7 | 2006 | WD | Normal | 625000 | 5557 | 5 |
898 | 20 | RL | 100.0 | 12919 | Pave | IR1 | Lvl | AllPub | Inside | Gtl | ... | 0 | 0 | 0 | 3 | 2010 | New | Partial | 611657 | 4694 | 4 |
803 | 60 | RL | 107.0 | 13891 | Pave | Reg | Lvl | AllPub | Inside | Gtl | ... | 192 | 0 | 0 | 1 | 2009 | New | Partial | 582933 | 4556 | 4 |
1046 | 60 | RL | 85.0 | 16056 | Pave | IR1 | Lvl | AllPub | Inside | Gtl | ... | 0 | 0 | 0 | 7 | 2006 | New | Partial | 556581 | 4860 | 4 |
440 | 20 | RL | 105.0 | 15431 | Pave | Reg | Lvl | AllPub | Inside | Gtl | ... | 170 | 0 | 0 | 4 | 2009 | WD | Normal | 555000 | 5496 | 3 |
769 | 60 | RL | 47.0 | 53504 | Pave | IR2 | HLS | AllPub | CulDSac | Mod | ... | 210 | 0 | 0 | 6 | 2010 | WD | Normal | 538000 | 4929 | 5 |
178 | 20 | RL | 63.0 | 17423 | Pave | IR1 | Lvl | AllPub | CulDSac | Gtl | ... | 0 | 0 | 0 | 7 | 2009 | New | Partial | 501837 | 4450 | 3 |
9 rows × 78 columns
Above, the deleted 9 outliers from the right tail, now the distribution of the SalePrice feature is the following:
plt.figure(figsize=(6,4))
sns.histplot(df_remove_outliers['SalePrice'], bins=100)
<AxesSubplot: xlabel='SalePrice', ylabel='Count'>
There are ml algorithms that only work with numerical data, for that reason we need to transform our category features into numerical features. We'll use numerical imputation with ordinal features, where exist an order in the possible values of the feature, for example ratings, size classifications, etc. For nominal variables we'll use one hot encoding. For more info here.
Let's check the nominals features we have:
df_numerical = df_remove_outliers.copy()
df_numerical.select_dtypes('object').columns
Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 'Electrical', 'Functional', 'GarageType', 'SaleType', 'SaleCondition'], dtype='object')
Let's use dummy encoding to encode the rest of categorical features. First, let's input an 'unknown' value for those nan values:
#Input 'unknown' value to the features we want to apply one hot encoding
df_numerical[df_numerical.select_dtypes('object').columns] = df_numerical[df_numerical.select_dtypes('object').columns].fillna('unknown')
#Apply dummy encoding
df_numerical = pd.get_dummies(df_numerical, columns=df_numerical.select_dtypes('object').columns )
df_numerical.head(5)
MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | ExterQual | ExterCond | ... | SaleType_ConLw | SaleType_New | SaleType_Oth | SaleType_WD | SaleCondition_Abnorml | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60 | 65.0 | 8450 | 7 | 5 | 2003 | 2003 | 196.0 | 4 | 3 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 20 | 80.0 | 9600 | 6 | 8 | 1976 | 1976 | 0.0 | 3 | 3 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 60 | 68.0 | 11250 | 7 | 5 | 2001 | 2002 | 162.0 | 4 | 3 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 70 | 60.0 | 9550 | 7 | 5 | 1915 | 1970 | 0.0 | 3 | 3 | ... | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
4 | 60 | 84.0 | 14260 | 8 | 5 | 2000 | 2000 | 350.0 | 4 | 3 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 231 columns
Let's refresh how much null values we have:
total_null_values = df_numerical.isna().sum().sort_values(ascending = False)
perc_null_values = round(df_numerical.isna().sum().sort_values(ascending = False)/len(df_numerical)*100,2)
total_null_values = total_null_values.to_frame(name="count_nulls")
perc_null_values = perc_null_values.to_frame(name="perc_null")
df_null_values = pd.concat([total_null_values, perc_null_values], axis=1)
df_null_values.head(15)
count_nulls | perc_null | |
---|---|---|
FireplaceQu | 690 | 47.55 |
LotFrontage | 259 | 17.85 |
GarageQual | 81 | 5.58 |
GarageFinish | 81 | 5.58 |
GarageYrBlt | 81 | 5.58 |
GarageCond | 81 | 5.58 |
BsmtFinType2 | 38 | 2.62 |
BsmtExposure | 38 | 2.62 |
BsmtQual | 37 | 2.55 |
BsmtFinType1 | 37 | 2.55 |
BsmtCond | 37 | 2.55 |
MasVnrArea | 8 | 0.55 |
Exterior1st_VinylSd | 0 | 0.00 |
Exterior1st_CBlock | 0 | 0.00 |
Exterior1st_BrkFace | 0 | 0.00 |
We have 12 features with null values. Because there are libraries like sckitlearn that don't allow null values, let's input a value to this nulls. This can be approached by simple methods. For categorical features, we can input the mode and for numerical features, the mean or the median. This time we'll use a ML approach using K Nearest Neighbours.
from sklearn.impute import KNNImputer
df_inputed_nulls = df_numerical.copy()
knn_imputer = KNNImputer()
X = knn_imputer.fit_transform(df_inputed_nulls)
df_inputed_nulls = pd.DataFrame(X, columns = df_inputed_nulls.columns)
df_inputed_nulls.head()
MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | MasVnrArea | ExterQual | ExterCond | ... | SaleType_ConLw | SaleType_New | SaleType_Oth | SaleType_WD | SaleCondition_Abnorml | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60.0 | 65.0 | 8450.0 | 7.0 | 5.0 | 2003.0 | 2003.0 | 196.0 | 4.0 | 3.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 20.0 | 80.0 | 9600.0 | 6.0 | 8.0 | 1976.0 | 1976.0 | 0.0 | 3.0 | 3.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 60.0 | 68.0 | 11250.0 | 7.0 | 5.0 | 2001.0 | 2002.0 | 162.0 | 4.0 | 3.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 70.0 | 60.0 | 9550.0 | 7.0 | 5.0 | 1915.0 | 1970.0 | 0.0 | 3.0 | 3.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 60.0 | 84.0 | 14260.0 | 8.0 | 5.0 | 2000.0 | 2000.0 | 350.0 | 4.0 | 3.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 231 columns
#Check we don't have any null in the dataframe
df_inputed_nulls.isnull().values.any()
False
There's no nulls left and all features are numerical, so we could use this dataset in the case that the model/library we'll try on demands a numerical and non null data.
df_numerical_no_nulls = df_inputed_nulls.copy()
X_numerical_no_nulls = df_numerical_no_nulls.dropna().drop(columns=['SalePrice']).values
y_numerical_no_nulls = df_numerical_no_nulls.dropna()['SalePrice'].values
X_train, X_test, y_train, y_test = train_test_split(X_numerical_no_nulls, y_numerical_no_nulls, test_size=0.3, random_state=1)
df_results = pd.DataFrame() # Here we'll store the performance metrics of each model
Next, you can check how many instances and features we have on our training data:
X_train.shape
(1015, 230)
It's important to check if train and test data are similar in order to guarantee that we can rely on our metrics performance in the test and training set. The goal in every ML problem is being able to generalize with good results the model we've trained with unseen data (the test set). That's why we have to check that training and test data are similar, because otherwise if they are different we can't expect them to have good and similar metrics performance, even the training data has good performance metrics.
First of all, let's compare that in both datasets we have the same proportion of the positive target class:
abs(y_train.mean()-y_test.mean())
712.9703597415064
As we can see above, the difference between the two mean prices are very low (below 1000$), so we can be sure that for the target feature, the training and the test dataset are similar.
Secondly, compare all the principal statistic metrics for the other features:
df_X_train = pd.DataFrame(data=X_train, columns=df_numerical_no_nulls.columns.drop('SalePrice'))
df_X_test = pd.DataFrame(data=X_test, columns=df_numerical_no_nulls.columns.drop('SalePrice'))
describe_X_train = df_X_train.describe().T
describe_X_test = df_X_test.describe().T
abs(describe_X_train.sub(describe_X_test).iloc[:,1:]).sort_values(by='mean', ascending = False).head(25) # Substract the metrics for each feature
mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|
LotArea | 94.921173 | 613.270709 | 191.0 | 59.50 | 383.0 | 427.50 | 56245.0 |
GrLivArea | 15.746138 | 15.273485 | 146.0 | 14.00 | 33.5 | 44.50 | 966.0 |
BsmtFinSF1 | 14.476667 | 12.376460 | 0.0 | 0.00 | 26.0 | 26.25 | 3384.0 |
MasVnrArea | 13.198321 | 12.592640 | 0.0 | 0.00 | 0.0 | 9.25 | 471.0 |
TotalBsmtSF | 10.667422 | 4.286352 | 0.0 | 9.00 | 8.5 | 63.75 | 2910.0 |
2ndFlrSF | 10.517804 | 7.238467 | 0.0 | 0.00 | 0.0 | 8.50 | 207.0 |
GarageArea | 9.861420 | 7.257330 | 0.0 | 9.50 | 8.5 | 11.75 | 28.0 |
MiscVal | 5.820423 | 62.829722 | 0.0 | 0.00 | 0.0 | 0.00 | 7200.0 |
EnclosedPorch | 5.602278 | 5.416604 | 0.0 | 0.00 | 0.0 | 0.00 | 259.0 |
TotalSF | 5.078716 | 35.829846 | 386.0 | 42.25 | 20.0 | 13.50 | 3938.0 |
ScreenPorch | 3.913470 | 7.763547 | 0.0 | 0.00 | 0.0 | 0.00 | 40.0 |
PoolArea | 3.421675 | 44.923181 | 0.0 | 0.00 | 0.0 | 0.00 | 738.0 |
LowQualFinSF | 3.357545 | 17.080358 | 0.0 | 0.00 | 0.0 | 0.00 | 59.0 |
MSSubClass | 2.606228 | 1.062093 | 0.0 | 0.00 | 0.0 | 10.00 | 0.0 |
WoodDeckSF | 2.571338 | 9.430302 | 0.0 | 0.00 | 0.0 | 0.00 | 121.0 |
BsmtFinSF2 | 2.290001 | 1.712194 | 0.0 | 0.00 | 0.0 | 0.00 | 347.0 |
3SsnPorch | 2.213841 | 10.665330 | 0.0 | 0.00 | 0.0 | 0.00 | 101.0 |
LotFrontage | 2.063559 | 3.711030 | 0.0 | 0.10 | 0.3 | 1.10 | 160.0 |
1stFlrSF | 1.870789 | 10.598426 | 146.0 | 11.00 | 22.0 | 8.25 | 1464.0 |
BsmtUnfSF | 1.519243 | 4.222609 | 0.0 | 18.25 | 9.5 | 21.00 | 183.0 |
OpenPorchSF | 1.378345 | 11.221274 | 0.0 | 0.00 | 2.5 | 1.25 | 129.0 |
YearRemodAdd | 0.868636 | 0.670508 | 0.0 | 2.00 | 1.0 | 1.00 | 1.0 |
YearBuilt | 0.620513 | 1.071262 | 8.0 | 2.00 | 0.0 | 0.00 | 1.0 |
MoSold | 0.162196 | 0.058724 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
TotRmsAbvGrd | 0.084740 | 0.010956 | 1.0 | 0.00 | 0.0 | 0.00 | 2.0 |
The difference between the principals statistic metrics of training and test dataset are low or near or equally 0, so we can be sure that the training and the test dataset are similar enough to generalize the results of the model in unseen data.
For this regression problem we'll be using these following two performance metrics:
We'll be using both metrics to check how the trained models are performing.
# Define error metrics
def RMSE(y, y_pred):
return np.sqrt(mean_squared_error(y, y_pred))
In this section, we'll try to find a model able to predict the house price sale based on all the features. To do so, we'll use a hyperparameter tuning and a cross validation.
We'll use 10 folds for cross validation and we'll use the GridSearchCV for the hyperparameter tunning.
# Setup cross validation folds
kf = KFold(n_splits=10, random_state=42, shuffle=True)
Support vector machine model is commonly used for classification problems, but also can be used for regression and its called SVR. The main idea of SVR is to fit our data with a hyperplane. It could be very possible that our data is not linear, so SVM can also transform the data into a greater dimension (with a kernel function) in order to find a hyperplane that is able to fit data in this new dimension.
Similar to SVM for classification, to find this hyperplane this model uses support vectors, which are those instances that fall in an epsilon "tube". This epsilon-tube (which is a parameter) measures how much we can tolerate the errors in the prediction, because for the instances that fall in this tube there is no error, for that reason, it's also called epsilon-insensitive tube (this tube can be more or less width depending on the problem error tolerance). The model is trained to minimize the differences between the actual and predicted value within the specified epsilon-insensitive tube
Another important parameter for this model is the 'C' parameter, which is a regularization parameter that controls the trade-off between training error and model complexity: for a smaller 'C' we can get more training error, but we generalise better, for greater 'C' we get the opposite.
This model is effective when we have a high number of features, that is because the curse of dimensionality works in our favor, because when the dimensionality grows also grows the sparsity and the distance between our data, so it could be easier to find a hyperplane that fit our data.
Support vector machine works better with standardized data, that's why we'll create a pipeline that includes the standardization of the data as a first step of a grid search hiperparameter tunning.
%%time
pipeline_svm = Pipeline([ ( "scaler" , StandardScaler()),
("svc",SVC())])
param_grid_svc = {'svc__C': [1,10,100], #Regularization parameter.
'svc__gamma': [0.01,0.001,0.0001,0.00005], #Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’
'svc__kernel': ['rbf', 'poly', 'sigmoid','linear']}
grid_pipeline_svc = GridSearchCV(pipeline_svm,param_grid_svc, n_jobs=-1, cv = kf, verbose = 10)
grid_pipeline_svc.fit(X_train,y_train)
print("Best params:\n")
print(grid_pipeline_svc.best_params_)
Fitting 10 folds for each of 48 candidates, totalling 480 fits Best params: {'svc__C': 1, 'svc__gamma': 0.01, 'svc__kernel': 'poly'} CPU times: total: 4.33 s Wall time: 2min 53s
pred_grid_pipeline_svc = grid_pipeline_svc.predict(X_test)
svc_df = pd.DataFrame(data=[RMSE(y_test, pred_grid_pipeline_svc),r2_score(y_test, pred_grid_pipeline_svc)],
columns=['SVC Score'],
index=["RMSE", "R2"])
print(svc_df)
df_results = pd.concat([df_results,svc_df.T])
SVC Score RMSE 55511.432899 R2 0.452671
Decision forests are a type of model made of multiple decision trees. There are different ways to ensemble multiple decision trees and there are two types of decision forest:
In tree algorithms, the value of the feature does not matter, but the order of the values, that's why in these models the standardization of the data is not necessary and in this section we won't standardize the data.
Finally, these models can be used for classification and regression and they are robust to noisy data, have interpretable properties and are well suited for training on small datasets, or on datasets where the ratio of number of features / number of examples is high.
Here for more information about trees and decision forests.
params_grid_xgb = {
'n_estimators' : [550,600,650,700],
'max_depth': [2,3,4,5,6,7]}
xgb = XGBRegressor(n_jobs= -1)
%%time
grid_xgb = GridSearchCV(xgb, params_grid_xgb, n_jobs=-1, cv=kf, verbose=1)
grid_xgb.fit(X_train, y_train)
print("Best params:\n")
print(grid_xgb.best_params_)
Fitting 10 folds for each of 24 candidates, totalling 240 fits Best params: {'max_depth': 5, 'n_estimators': 600} CPU times: total: 12.6 s Wall time: 2min 37s
pred_grid_pipeline_xgb = grid_xgb.predict(X_test)
xgb_df = pd.DataFrame(data=[RMSE(y_test, pred_grid_pipeline_xgb),r2_score(y_test, pred_grid_pipeline_xgb)],
columns=['XGB Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(xgb_df)
df_results = pd.concat([df_results,xgb_df.T])
Performance metrics: XGB Score RMSE 22903.288331 R2 0.906829
params_grid_lightgmb = {
'boosting_type':['gbdt','dart','rf'],
'max_depth':[4,5,6,7,8,9],
'num_leaves':[8,10,11,12,13,14]}
lightgbm = LGBMRegressor(n_jobs= -1,verbose=0)
%%time
grid_lightgmb = GridSearchCV(lightgbm, params_grid_lightgmb, n_jobs=-1, cv=kf, verbose=0)
grid_lightgmb.fit(X_train, y_train)
print("Best params:\n")
print(grid_lightgmb.best_params_)
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] No further splits with positive gain, best gain: -inf Best params: {'boosting_type': 'gbdt', 'max_depth': 8, 'num_leaves': 11} CPU times: total: 6.62 s Wall time: 1min 21s
pred_grid_pipeline_lightgmb = grid_lightgmb.predict(X_test)
lightgmb_df = pd.DataFrame(data=[RMSE(y_test, pred_grid_pipeline_lightgmb),r2_score(y_test, pred_grid_pipeline_lightgmb)],
columns=['Lightgmb Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(lightgmb_df)
df_results = pd.concat([df_results,lightgmb_df.T])
Performance metrics: Lightgmb Score RMSE 21965.389512 R2 0.914304
This model allows us to train with categorical and missing data, so for this model we'll use the version of our dataset prior to transforming out categorical features and inputting them a numerical value to the nulls:
df_modeling_categorical = df_remove_outliers.copy()
categorical_columns = df_modeling_categorical.select_dtypes(include=['object']).columns.tolist()
df_modeling_categorical[categorical_columns] = df_modeling_categorical[categorical_columns].fillna('NaN')
cat_indexes = df_modeling_categorical.columns.get_indexer(categorical_columns)
X_cat = df_modeling_categorical.drop(columns=['SalePrice']).values
y_cat = df_modeling_categorical['SalePrice'].values
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(X_cat, y_cat, test_size=0.3, random_state=81)
params_catboost = {
'iterations': [250,300,350,400]
}
catBoost = CatBoostRegressor(random_state=42, verbose = 0, cat_features=cat_indexes)
%%time
grid_catBoost = GridSearchCV(catBoost, params_catboost, n_jobs=-1, cv=kf, verbose=1)
grid_catBoost.fit(X_train_cat, y_train_cat)
print("Best params:\n")
print(grid_catBoost.best_params_)
Fitting 10 folds for each of 4 candidates, totalling 40 fits Best params: {'iterations': 350} CPU times: total: 45.2 s Wall time: 4min 21s
pred_grid_pipeline_catBoost = grid_catBoost.predict(X_test_cat)
catBoost_df = pd.DataFrame(data=[RMSE(y_test_cat, pred_grid_pipeline_catBoost),r2_score(y_test_cat, pred_grid_pipeline_catBoost)],
columns=['catBoost Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(catBoost_df)
df_results = pd.concat([df_results,catBoost_df.T])
Performance metrics: catBoost Score RMSE 19111.318220 R2 0.919822
params_random_forest = {
'n_estimators': [100,120,200],
'max_depth': [15,17,19],
'min_samples_leaf': [2,3,4,5,6]
}
rfc = RandomForestRegressor(n_jobs=-1)
%%time
grid_random_forest = GridSearchCV(rfc, params_random_forest, n_jobs=-1, cv=kf, verbose=1)
grid_random_forest.fit(X_train, y_train)
print("Best params:\n")
print(grid_random_forest.best_params_)
Fitting 10 folds for each of 45 candidates, totalling 450 fits Best params: {'max_depth': 17, 'min_samples_leaf': 4, 'n_estimators': 100} CPU times: total: 10 s Wall time: 10min 7s
pred_grid_pipeline_random_forest = grid_random_forest.predict(X_test)
random_forest_df = pd.DataFrame(data=[RMSE(y_test, pred_grid_pipeline_random_forest),r2_score(y_test, pred_grid_pipeline_random_forest)],
columns=['Random Forest Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(random_forest_df)
df_results = pd.concat([df_results,random_forest_df.T])
Performance metrics: Random Forest Score RMSE 25188.925280 R2 0.887305
The simple idea behind any linear regression, is trying to fit the data with a hyperplane. In linear regression, we can expect overfitting, especially when the number of observations is low respect the number of features. In that case, exists linear regressions that add regularization terms to the fit in order to make the model simpler and deal with the overfitting. Make the model simpler, is done by penalizing the values of the weight of each feature. Let's see which types of penalizations and linear regressions we'll deal in this project:
Here for more information about L2 regularization. Here for more information about L1 regularization.
Let's try to fit our data to each of these linear regressions:
%%time
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
CPU times: total: 219 ms Wall time: 132 ms
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
pred_linear_model = linear_model.predict(X_test)
linear_model_df = pd.DataFrame(data=[RMSE(y_test, pred_linear_model),r2_score(y_test, pred_linear_model)],
columns=['linear model Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(linear_model_df)
df_results = pd.concat([df_results,linear_model_df.T])
Performance metrics: linear model Score RMSE 30611.436295 R2 0.833563
%%time
pipeline_ridge = Pipeline([( "scaler" , StandardScaler()),
("ridge",Ridge())])
ridge_params = {'ridge__alpha': [0.1, 1, 10,50,100,150,200,250,350,400,450]}
ridge_grid = GridSearchCV(pipeline_ridge, ridge_params, cv=kf)
ridge_grid.fit(X_train, y_train)
print("Best params:\n")
print(ridge_grid.best_params_)
Best params: {'ridge__alpha': 250} CPU times: total: 6.97 s Wall time: 2.97 s
pred_ridge = ridge_grid.predict(X_test)
ridge_df = pd.DataFrame(data=[RMSE(y_test, pred_ridge),r2_score(y_test, pred_ridge)],
columns=['Ridge Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(ridge_df)
df_results = pd.concat([df_results,ridge_df.T])
Performance metrics: Ridge Score RMSE 26606.641233 R2 0.874263
%%time
pipeline_lasso = Pipeline([
("lasso",Lasso())])
lasso_params = {'lasso__alpha': [0.1, 1, 10,100,150,200,250,300]}
lasso_grid = GridSearchCV(pipeline_lasso, lasso_params, cv=kf)
lasso_grid.fit(X_train, y_train)
print("Best params:\n")
print(lasso_grid.best_params_)
Best params: {'lasso__alpha': 100} CPU times: total: 53.8 s Wall time: 13.9 s
pred_lasso = lasso_grid.predict(X_test)
lasso_df = pd.DataFrame(data=[RMSE(y_test, pred_lasso),r2_score(y_test, pred_lasso)],
columns=['Lasso Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(lasso_df)
df_results = pd.concat([df_results,lasso_df.T])
Performance metrics: Lasso Score RMSE 26290.517656 R2 0.877233
%%time
pipeline_elastic = Pipeline([( "scaler" , StandardScaler()),
("elastic",ElasticNet())])
elasticnet_params = {'elastic__alpha': [0.1, 1, 10], 'elastic__l1_ratio': [0.01, 0.1, 0.5, 0.9]}
elasticnet_grid = GridSearchCV(pipeline_elastic, elasticnet_params, cv=kf)
elasticnet_grid.fit(X_train, y_train)
print("Best params:\n")
print(elasticnet_grid.best_params_)
Best params: {'elastic__alpha': 1, 'elastic__l1_ratio': 0.5} CPU times: total: 22.7 s Wall time: 5.7 s
pred_elasticnet = elasticnet_grid.predict(X_test)
elasticnet_df = pd.DataFrame(data=[RMSE(y_test, pred_elasticnet),r2_score(y_test, pred_elasticnet)],
columns=['elasticnet Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(elasticnet_df)
df_results = pd.concat([df_results,elasticnet_df.T])
Performance metrics: elasticnet Score RMSE 26519.716080 R2 0.875083
This model can be used for both classification and regression problems. The main idea of this model is to find the closest 'k' neighbours of an instance (here you can considerate different metrics distance and different values of 'k'). For classification problems, the output of the instance would be the class with more frequency in their k-neighbours. For regression problems, the output would be the mean of their k-neighbours.
This model is easy and intuitive and it can capture non linearity in a simple way. Despite of that, this model is sensitive to outliers and is not very efficient when dealing with large datasets. Finally, because of the curse of dimensionality, when greater the number of features, the distance between instances is also greater and less meaningful.
Finally, in Knn is also important to scale the features. Because the algorithm calculates the distances between instances, if the features we're dealing have very different scale, this could imply that the features with larger scales contribute in larger distances between instances. On the other hand, features with shorter scales contribute in shorter distances. That's why we should our data in the same (standard) scale, in order that all features contribute with the same weight to the distance between instances.
Here for more information.
grid_param_knn = {
'knn__n_neighbors':[5,8,9,10,11,12,15,20,25,30,35,40,45,50],
'knn__weights': ['uniform', 'distance']}
knn = KNeighborsRegressor(n_jobs=-1)
pipeline_knn = Pipeline([( "scaler" , StandardScaler()),
("knn",knn)])
%%time
grid_knn = GridSearchCV(pipeline_knn, grid_param_knn, n_jobs=-1, cv=kf, verbose=1)
grid_knn.fit(X_train, y_train)
print("Best params:\n")
print(grid_knn.best_params_)
Fitting 10 folds for each of 28 candidates, totalling 280 fits Best params: {'knn__n_neighbors': 10, 'knn__weights': 'distance'} CPU times: total: 2.23 s Wall time: 9.66 s
pred_grid_pipeline_knn = grid_knn.predict(X_test)
knn_df = pd.DataFrame(data=[RMSE(y_test, pred_grid_pipeline_knn),r2_score(y_test, pred_grid_pipeline_knn)],
columns=['Knn Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(knn_df)
df_results = pd.concat([df_results,knn_df.T])
Performance metrics: Knn Score RMSE 36966.075330 R2 0.757289
A neural network is a model that is able to learn non linear data itself, without the necessity of having to specify manually the non-linearity. This model is represented by different layers of neurons connected to each other, which try to imitate the neurons of our brain. Neural networks comprise an input layer, one or more hidden layers, and an output layer. Each layer contains neurons with connections having associated weights. During the training, the model adjusts the weights between the neurons. The non-linearity comes from an activation function in each one of the neurons, this activation function is a non-linear function and some examples of that are the Relu, Sigmoid, Tanh, etc.
In this case, we won't use the package sklearn form training the neural network, instead we'll use a more extended library as keras. For the hyperparameter tunning will use a method of the keras library, which the hyperparameters to test are defined meanwhile you're creating the NN model. As in the sklearn library, the search of hyperparameters can be performed in several ways: can be performed all the combinations (Gridsearch), can be performed by a random pick of combinations (RandomSearch). Also, there are other methods that improve the efficiency, like HyperBand or Bayesian Optimization. Hyperband tries to improve the RandomSearch by picking randomly hyperparameters using just a few epochs of training, and then it does a full training with the best combinations. On the other hand, in Bayesian Optimization, only the first combinations of hyperparameters are picked randomly, then the algorithm, based on the best results, chooses the next best hyperparameters combinations. In this notebook we'll be using the Hyperband approach with keras.
Here for more information.
%%time
num_features = X_train.shape[1]
# Define the TensorFlow model for regression with optimizer, activation function,
# number or neurons per layer and learning rate as a hyperparameters
def build_model(hp):
model = tf.keras.Sequential()
#Input Layer
model.add(tf.keras.layers.InputLayer(input_shape=(num_features,)))
# Hidden Layers
# First hidden layer
activation_choice_layer1 = hp.Choice('activation_layer1', values=['relu', 'sigmoid', 'selu','tanh', 'softsign','softplus','softmax'])
model.add(tf.keras.layers.Dense(units=hp.Int('units_hidden_layer1', min_value=8, max_value=256, step=16),
activation=activation_choice_layer1))
# Second hidden layer
activation_choice_layer2 = hp.Choice('activation_layer2', values=['relu', 'sigmoid', 'selu','tanh','softsign','softplus','softmax'])
model.add(tf.keras.layers.Dense(units=hp.Int('units_hidden_layer2', min_value=8, max_value=256, step=16),
activation=activation_choice_layer2))
# Add output layer
model.add(tf.keras.layers.Dense(1))
# Choice optimizer
optimizer_choice = hp.Choice('optimizer', values=['adam', 'sgd', 'rmsprop'])
if optimizer_choice == 'adam':
optimizer = tf.keras.optimizers.Adam(learning_rate=hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='log'))
elif optimizer_choice == 'sgd':
optimizer = tf.keras.optimizers.SGD(learning_rate=hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='log'))
else:
optimizer = tf.keras.optimizers.RMSprop(learning_rate=hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='log'))
# Create the NN
model.compile(optimizer=optimizer, loss='mean_squared_error')
return model
# Hyperparameter tunning
tuner = Hyperband(
build_model,
objective='val_loss',
executions_per_trial=1,
directory='test19',
seed = 123
)
# Perform hyperparameter tuning
tuner.search(X_train, y_train, epochs=50,validation_data=(X_test, y_test))
Reloading Tuner from test19\untitled_project\tuner0.json CPU times: total: 1.56 s Wall time: 2.42 s
best_hyperparameters = tuner.get_best_hyperparameters(1)[0]
print("Best hyperparameters are:\n")
print(best_hyperparameters.values)
print("\n")
best_model = tuner.get_best_models(num_models=1)
print("Summary best model:\n")
print(best_model[0].summary())
print("\n")
Best hyperparameters are: {'activation_layer1': 'selu', 'units_hidden_layer1': 184, 'activation_layer2': 'selu', 'units_hidden_layer2': 72, 'optimizer': 'rmsprop', 'learning_rate': 0.0008204040606605068, 'tuner/epochs': 100, 'tuner/initial_epoch': 34, 'tuner/bracket': 1, 'tuner/round': 1, 'tuner/trial_id': '0197'} WARNING:tensorflow:From C:\Users\rojol\anaconda3\Lib\site-packages\keras\src\backend.py:277: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead. WARNING:tensorflow:From C:\Users\rojol\anaconda3\Lib\site-packages\keras\src\saving\legacy\save.py:538: The name tf.train.NewCheckpointReader is deprecated. Please use tf.compat.v1.train.NewCheckpointReader instead. Summary best model: Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 184) 42504 dense_1 (Dense) (None, 72) 13320 dense_2 (Dense) (None, 1) 73 ================================================================= Total params: 55897 (218.35 KB) Trainable params: 55897 (218.35 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________ None
%%time
# Tune the model with the best hyperparameters
tf.random.set_seed(123) ## to get initial same weights
model = tuner.hypermodel.build(best_hyperparameters)
print("Model summary:\n")
print(model.summary())
history = model.fit(X_train, y_train, epochs=500, validation_data=(X_test, y_test), verbose=0)
print("Model evaluation:\n")
print(model.evaluate(X_test, y_test, verbose = 0))
Model summary: Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_3 (Dense) (None, 184) 42504 dense_4 (Dense) (None, 72) 13320 dense_5 (Dense) (None, 1) 73 ================================================================= Total params: 55897 (218.35 KB) Trainable params: 55897 (218.35 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________ None WARNING:tensorflow:From C:\Users\rojol\anaconda3\Lib\site-packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v1.ragged.RaggedTensorValue instead. Model evaluation: 996563136.0 CPU times: total: 1min 27s Wall time: 1min 5s
y_pred = model.predict(X_test)
nn_df = pd.DataFrame(data=[RMSE(y_test, y_pred),r2_score(y_test, y_pred)],
columns=['nn Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(nn_df)
df_results = pd.concat([df_results,nn_df.T])
14/14 [==============================] - 0s 2ms/step Performance metrics: nn Score RMSE 31568.388096 R2 0.822994
As you can see in the next cell, there are all the performance results for each model we've tried to fit to our data. Our main metrics have been the RMSE and R2, so we'll take the Top 3 performing models taking account their values. For this models, we'll try to increase the model performance using the PCA technique, because we're dealing with a high number of features.
df_results.sort_values(by='R2', ascending=False)
RMSE | R2 | |
---|---|---|
catBoost Score | 19111.318220 | 0.919822 |
Lightgmb Score | 21965.389512 | 0.914304 |
XGB Score | 22903.288331 | 0.906829 |
Random Forest Score | 25188.925280 | 0.887305 |
Lasso Score | 26290.517656 | 0.877233 |
elasticnet Score | 26519.716080 | 0.875083 |
Ridge Score | 26606.641233 | 0.874263 |
linear model Score | 30611.436295 | 0.833563 |
nn Score | 31568.388096 | 0.822994 |
Knn Score | 36966.075330 | 0.757289 |
SVC Score | 55511.432899 | 0.452671 |
Principal Components Analysis (PCA) is a method of reduction dimensionality. Shouldn't be understood as a feature selection method because the result of PCA is a linear combination (principal components) of the original features. The principal idea of this is getting the linear combinations with the maximum variability of the original features, with the first principal component explaining the most variance, the second explaining the second most, and so on. This is done by transforming the original problem in finding the eigen vector and eigen values of the covariance matrix, because the eigen vectors are the components and the eigen values are the weight of how these vectors explain the variability of the original features. Is useful to use PCA especially when we're dealing with highly correlated features.
An important parameter for PCA is how many principal components we pick. This can be done by using cross-validation, but also there are several rules of thumb to know in advance which could be a good number.
In this problem, we have a large number of features, so let's try to improve the performance of the top models by reducing the overfitting using PCA.
In this section, we'll take a look to two methods in order to know with how many principal components we should pick in our cross validation process in order to reduce as much as possible the dimensionality of the model and still having high variability explained with these principal components.
The methods will use are: observing the ratio of explained variance and the Elbow method. Here for more information
With this method, we can see how much variance is explained by adding the principal components one by one:
#Apply PCA method to the training data
pca = PCA()
pca.fit(X_train)
# EXPLAINED VARIANCE
# Plot cumulative explained variance against number of components
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Zoom Number of Components')
plt.ylabel('Zoom Cumulative Explained Variance')
plt.xlim(0,15)
plt.grid(True)
plt.show()
As we can see in the previous plot, with 10 components we can explain almost the 100% of the data, and we get to reduce the dimensionality from 230 features to 10 components.
Another method is to plot (sorted decreasing) the eigen values of each component, i.e., how much variability explains each component. To know the ideal number of components we should look for an "elbow" in the plot of the eigenvalues.
# ELBOW METHOD Plot explained variance against number of components
plt.plot(range(1, len(pca.explained_variance_) + 1), pca.explained_variance_)
plt.xlabel('Number of Components')
plt.ylabel('Eigen values')
plt.grid(True)
plt.xlim(0,10)
plt.show()
As we can see above, the "elbow" is when the number of components is 2.
For the two previous methods, we can guess that the best number of components in between 2 and 10, so we'll be using this range of values for the grid search CV:
param_grid_pca = {'pca__n_components': [2,3,4,5,6,7,8,9,10,11]}
Let's see how decreasing the number of inputs in the model (keeping those components who explain more variance in the data) will affect in the performance of the top previous models.
CatBoost has been the top performing algorithm, so let's try this one first. For this testing, because PCA needs numerical features, we won't able to do the test with the same dataset (with categorical features) that we've used before for CatBoost, so we will use the full numerical dataset that we've used in all others models.
%%time
pipeline_catBoost_pca = Pipeline([
('scaler', StandardScaler()),
('pca', PCA()),
('catBoost',CatBoostRegressor(verbose = 0))])
params_catBoost_pca = {'catBoost__iterations': [400,450,500]} | param_grid_pca # union the params for each step in a unique set.
catBoost_pca_grid = GridSearchCV(pipeline_catBoost_pca, params_catBoost_pca, cv=kf, n_jobs = -1)
catBoost_pca_grid.fit(X_train, y_train)
# Get the best parameters value
print("Best params are:\n")
print(catBoost_pca_grid.best_params_)
Best params are: {'catBoost__iterations': 500, 'pca__n_components': 10} CPU times: total: 10.8 s Wall time: 3min 15s
pred_grid_pipeline_catBoost_pca = catBoost_pca_grid.predict(X_test)
catBoost_pca_df = pd.DataFrame(data=[RMSE(y_test, pred_grid_pipeline_catBoost_pca),r2_score(y_test, pred_grid_pipeline_catBoost_pca)],
columns=['catBoost Pca Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(catBoost_pca_df)
df_results = pd.concat([df_results,catBoost_pca_df.T])
Performance metrics: catBoost Pca Score RMSE 25335.329688 R2 0.885992
%%time
pipeline_lightgbm_pca = Pipeline([
('scaler', StandardScaler()),
('pca', PCA()),
('lightgbm',LGBMRegressor(n_jobs= -1,verbose=0))])
params_lightgbm_pca = {
'lightgbm__boosting_type':['gbdt','dart','rf'],
'lightgbm__max_depth':[4,5,6],
'lightgbm__num_leaves':[8,9,10]} | param_grid_pca # union the params for each step in a unique set.
lightgbm_pca_grid = GridSearchCV(pipeline_lightgbm_pca, params_lightgbm_pca, cv=kf, n_jobs = -1)
lightgbm_pca_grid.fit(X_train, y_train)
# Get the best parameters value
print("Best params are:\n")
print(lightgbm_pca_grid.best_params_)
Best params are: {'lightgbm__boosting_type': 'gbdt', 'lightgbm__max_depth': 6, 'lightgbm__num_leaves': 8, 'pca__n_components': 7} CPU times: total: 19.7 s Wall time: 2min 23s
pred_grid_pipeline_lightgbm_pca = lightgbm_pca_grid.predict(X_test)
lightgbm_pca_df = pd.DataFrame(data=[RMSE(y_test, pred_grid_pipeline_lightgbm_pca),r2_score(y_test, pred_grid_pipeline_lightgbm_pca)],
columns=['Lightgbm Pca Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(lightgbm_pca_df)
df_results = pd.concat([df_results,lightgbm_pca_df.T])
Performance metrics: Lightgbm Pca Score RMSE 26499.870835 R2 0.875270
%%time
pipeline_xgb_pca = Pipeline([
('scaler', StandardScaler()),
('pca', PCA()),
('xgb',XGBRegressor(n_jobs= -1,verbose=0))])
params_xgb_pca = {
'xgb__n_estimators' : [600,650,700],
'xgb__max_depth': [-1,2,3,4,5]} | param_grid_pca # union the params for each step in a unique set.
xgb_pca_grid = GridSearchCV(pipeline_xgb_pca, params_xgb_pca, cv=kf, n_jobs = -1)
xgb_pca_grid.fit(X_train, y_train)
# Get the best parameters value
print("Best params are:\n")
print(xgb_pca_grid.best_params_)
Best params are: {'pca__n_components': 10, 'xgb__max_depth': 5, 'xgb__n_estimators': 600} CPU times: total: 21.1 s Wall time: 4min 3s
pred_grid_pipeline_xgb_pca = xgb_pca_grid.predict(X_test)
Xgb_pca_df = pd.DataFrame(data=[RMSE(y_test, pred_grid_pipeline_xgb_pca),r2_score(y_test, pred_grid_pipeline_xgb_pca)],
columns=['Xgb Pca Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(Xgb_pca_df)
df_results = pd.concat([df_results,Xgb_pca_df.T])
Performance metrics: Xgb Pca Score RMSE 29708.378194 R2 0.843238
%%time
pipeline_lasso_pca = Pipeline([
('scaler', StandardScaler()),
('pca', PCA()),
('lasso',Lasso())])
params_lasso_pca = lasso_params | param_grid_pca # union the params for each step in a unique set.
lasso_pca_grid = GridSearchCV(pipeline_lasso_pca, params_lasso_pca, cv=kf)
lasso_pca_grid.fit(X_train, y_train)
# Get the best parameters value
print("Best params are:\n")
print(lasso_pca_grid.best_params_)
Best params are: {'lasso__alpha': 150, 'pca__n_components': 10} CPU times: total: 1min 28s Wall time: 24.6 s
pred_lasso_pca = lasso_pca_grid.predict(X_test)
lasso_pca_df = pd.DataFrame(data=[RMSE(y_test, pred_lasso_pca),r2_score(y_test, pred_lasso_pca)],
columns=['Lasso Pca model Score'],
index=["RMSE", "R2"])
print("Performance metrics:\n")
print(lasso_pca_df)
df_results = pd.concat([df_results,lasso_pca_df.T])
Performance metrics: Lasso Pca model Score RMSE 30708.029664 R2 0.832511
As you can see in the next cell, using the PCA method has had little effect on the performance and the best model is still the CatBoost.
df_results.sort_values(by='R2', ascending=False)
RMSE | R2 | |
---|---|---|
catBoost Score | 19111.318220 | 0.919822 |
Lightgmb Score | 21965.389512 | 0.914304 |
XGB Score | 22903.288331 | 0.906829 |
Random Forest Score | 25188.925280 | 0.887305 |
catBoost Pca Score | 25335.329688 | 0.885992 |
Lasso Score | 26290.517656 | 0.877233 |
Lightgbm Pca Score | 26499.870835 | 0.875270 |
elasticnet Score | 26519.716080 | 0.875083 |
Ridge Score | 26606.641233 | 0.874263 |
Xgb Pca Score | 29708.378194 | 0.843238 |
linear model Score | 30611.436295 | 0.833563 |
Lasso Pca model Score | 30708.029664 | 0.832511 |
nn Score | 31568.388096 | 0.822994 |
Knn Score | 36966.075330 | 0.757289 |
SVC Score | 55511.432899 | 0.452671 |
Finally, the R2 of our best model is above 0.92, so we could say that our model performs well. Also, this value of R2 means that our model explains the 92% of the variability of the data. Its value of RMSE tells us that the average difference between the predicted values from the model and the actual values in the dataset is around 19K$.
Now we have find a good model, we'll save it in order to use it in the future:
filename = "regressor_sale_house_price_catBoost.pickle"
pickle.dump(grid_catBoost, open(filename, "wb"))
Let's find out which features the model considers more important in order to predict the price of a house:
# Access feature importances
importances = grid_catBoost.best_estimator_.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': df_modeling_categorical.columns.drop("SalePrice"), 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=True)
plt.figure(figsize=(20, 15))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()
As we can see above, the most important feature is TotalSF, i.e., the total surface of the house, including the basement. Secondly, the most important feature is OverallQual, i.e., the rate of the overall material and finish of the house. With a large difference importance, the following more important features are: total of bathrooms, the total surface of the living zone, the neighborhood and the rate of the exterior qualities and the capacity of the cars in the garage among others.
PyCaret is a Python library for simplified machine learning workflows, offering automated model training, hyperparameter tuning, and comprehensive visualizations. We'll use it to compare its results with ours.
# check installed version
import pycaret
pycaret.__version__
'3.2.0'
# import pycaret regression and init setup
from pycaret.regression import *
s = setup(df_inputed_nulls, target = 'SalePrice', session_id = 123)
Description | Value | |
---|---|---|
0 | Session id | 123 |
1 | Target | SalePrice |
2 | Target type | Regression |
3 | Original data shape | (1451, 231) |
4 | Transformed data shape | (1451, 231) |
5 | Transformed train set shape | (1015, 231) |
6 | Transformed test set shape | (436, 231) |
7 | Numeric features | 230 |
8 | Preprocess | True |
9 | Imputation type | simple |
10 | Numeric imputation | mean |
11 | Categorical imputation | mode |
12 | Fold Generator | KFold |
13 | Fold Number | 10 |
14 | CPU Jobs | -1 |
15 | Use GPU | False |
16 | Log Experiment | False |
17 | Experiment Name | reg-default-name |
18 | USI | 9c57 |
# compare baseline models
best = compare_models()
Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
---|---|---|---|---|---|---|---|---|
catboost | CatBoost Regressor | 13936.2905 | 508768359.6845 | 21938.0727 | 0.8993 | 0.1170 | 0.0823 | 4.5270 |
lightgbm | Light Gradient Boosting Machine | 15547.6534 | 584812851.2968 | 23695.2510 | 0.8816 | 0.1281 | 0.0918 | 0.4170 |
et | Extra Trees Regressor | 15895.0145 | 611611904.8429 | 24135.7852 | 0.8787 | 0.1285 | 0.0938 | 1.8250 |
gbr | Gradient Boosting Regressor | 15596.6724 | 607080630.3706 | 24061.3562 | 0.8781 | 0.1297 | 0.0931 | 0.7680 |
rf | Random Forest Regressor | 16542.8376 | 652120559.3498 | 25026.3396 | 0.8697 | 0.1347 | 0.0982 | 1.9910 |
br | Bayesian Ridge | 16261.0548 | 659567059.2000 | 24413.7656 | 0.8688 | 0.1455 | 0.0988 | 0.0620 |
xgboost | Extreme Gradient Boosting | 17092.7906 | 691612886.4000 | 25918.4770 | 0.8598 | 0.1363 | 0.0995 | 0.4210 |
ridge | Ridge Regression | 17064.6634 | 738229705.6000 | 26015.5662 | 0.8552 | 0.1784 | 0.1043 | 0.0250 |
llar | Lasso Least Angle Regression | 16539.2279 | 765245331.2000 | 26123.5359 | 0.8518 | 0.1607 | 0.1001 | 0.0290 |
en | Elastic Net | 17726.1774 | 783889942.4000 | 26842.7387 | 0.8451 | 0.1517 | 0.1059 | 0.0690 |
lasso | Lasso Regression | 17687.4437 | 817440176.0000 | 27266.5982 | 0.8395 | 0.1844 | 0.1085 | 0.0590 |
lr | Linear Regression | 18503.2964 | 843757708.8000 | 27747.9467 | 0.8349 | 0.2393 | 0.1149 | 0.3550 |
omp | Orthogonal Matching Pursuit | 19587.2015 | 873386400.0000 | 28615.6967 | 0.8254 | 0.1784 | 0.1210 | 0.0300 |
ada | AdaBoost Regressor | 21144.3587 | 866664378.0229 | 29188.4413 | 0.8240 | 0.1746 | 0.1372 | 0.2480 |
knn | K Neighbors Regressor | 26505.5283 | 1396843801.6000 | 37053.9914 | 0.7170 | 0.2087 | 0.1637 | 0.0590 |
huber | Huber Regressor | 25524.1098 | 1435446893.7169 | 37397.4639 | 0.7084 | 0.2052 | 0.1538 | 0.3240 |
dt | Decision Tree Regressor | 25313.1354 | 1408980684.2620 | 37201.8405 | 0.7060 | 0.2021 | 0.1479 | 0.0530 |
par | Passive Aggressive Regressor | 35359.6257 | 4331790874.5136 | 54541.2186 | 0.2677 | 0.2595 | 0.2166 | 0.0260 |
dummy | Dummy Regressor | 53769.6633 | 4992875545.6000 | 70136.6227 | -0.0076 | 0.3854 | 0.3451 | 0.0220 |
lar | Least Angle Regression | 1428130440357468342059008.0000 | inf | inf | -inf | 29.6543 | 10848298145755928576.0000 | 0.0520 |
evaluate_model(best)
interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…
tuned_best_model = tune_model(best)
MAE | MSE | RMSE | R2 | RMSLE | MAPE | |
---|---|---|---|---|---|---|
Fold | ||||||
0 | 15389.4346 | 628195629.4495 | 25063.8311 | 0.8785 | 0.1222 | 0.0868 |
1 | 15301.7831 | 505125240.0008 | 22474.9914 | 0.8560 | 0.1236 | 0.0884 |
2 | 13205.9407 | 333561528.5047 | 18263.6669 | 0.9129 | 0.1248 | 0.0878 |
3 | 15740.0056 | 460878132.6264 | 21468.0724 | 0.9256 | 0.1205 | 0.0918 |
4 | 16817.0621 | 1073548190.6676 | 32765.0453 | 0.8037 | 0.1536 | 0.1059 |
5 | 14746.5690 | 407079416.5315 | 20176.2092 | 0.8926 | 0.1322 | 0.0914 |
6 | 18433.4601 | 748159086.7817 | 27352.4969 | 0.8894 | 0.1305 | 0.1007 |
7 | 14347.6615 | 405071080.6799 | 20126.3777 | 0.9174 | 0.1004 | 0.0787 |
8 | 17306.2942 | 914167154.5253 | 30235.1973 | 0.8589 | 0.1711 | 0.1057 |
9 | 12517.2711 | 286480835.2491 | 16925.7447 | 0.9169 | 0.0970 | 0.0746 |
Mean | 15380.5482 | 576226629.5016 | 23485.1633 | 0.8852 | 0.1276 | 0.0912 |
Std | 1721.1592 | 247567918.3321 | 4967.2663 | 0.0357 | 0.0209 | 0.0099 |
Fitting 10 folds for each of 10 candidates, totalling 100 fits Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
tuned_best_model
<catboost.core.CatBoostRegressor at 0x2152e1f7b50>
As we can see above, we've get better results than pycaret library.
To recap, what we have done so far is analyze a dataset on houses in the United States and we have created a model capable of predicting the selling price based on the features of the dataset. Now, we are going to leverage this model and truly see the value that this work will bring us.
First of all, let's load the new data, the model we've just created and apply the feature engineering necessary to being able to apply the model:
df_test = pd.read_csv('test.csv')
df_test.head(3)
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal |
3 rows × 80 columns
Let's apply the feature engineering:
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
## Apply feature engineering
df_test_copy = df_test.copy()
ids = df_test_copy.pop('Id')
df_test_transformed = feature_engineering_sale_price_dataset(df_test_copy)
categorical_columns = df_test_transformed.select_dtypes(include=['object']).columns.tolist()
df_test_transformed[categorical_columns] = df_test_transformed[categorical_columns].fillna('NaN')
## Apply the model to get the price prediction
prediction_house_price = loaded_model.predict(df_test_transformed)
# Join all features in the same dataframe
df_final_results = df_test.copy()
#Add useful features
df_final_results['TotalSF'] = df_final_results['TotalBsmtSF'] + df_final_results['GrLivArea']
df_final_results['Total_Bathrooms'] = df_final_results['FullBath'] + df_final_results['HalfBath'] + df_final_results['BsmtFullBath'] + df_final_results['BsmtHalfBath']
df_final_results["SalePricePredicted"] = prediction_house_price.round().astype(int)
df_final_results.head(3)
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | TotalSF | Total_Bathrooms | SalePricePredicted | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal | 1778.0 | 1.0 | 117847 |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal | 2658.0 | 2.0 | 165324 |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal | 2557.0 | 3.0 | 183397 |
3 rows × 83 columns
Write the results in a csv file:
df_final_results.to_csv('house_dataset_w_predicted_price.csv', index=False)
Let's imagine, that we work in a real statement business. With the historical data of the houses sold by the last 5 years, we've built a model that is able to predict the price of the house that will be sold, so, we can assume that actually we're predicting the market price.
Now, we have a portfolio of houses to sell and we want to know the market price for each house to help the seller owner to pick the right price and also to guide new buyers what they can afford.
In the next sections, we'll be answering those questions.
Knowing the market price of the houses, we can share with new buyers which neighborhood they could afford, telling them the mean of the house prices and the mean of the price per sq.
df_final_results["PricePerSq"] = df_final_results["SalePricePredicted"]/df_final_results["TotalSF"]
price_per_neighborhood = df_final_results.groupby(["Neighborhood"])["SalePricePredicted"].mean()
price_sq_per_neighborhood = df_final_results.groupby(["Neighborhood"])["PricePerSq"].mean()
df_price_per_neighborhood = price_per_neighborhood.to_frame(name="AVG Price")
df_price_sq_per_neighborhood = price_sq_per_neighborhood.to_frame(name="AVG Price/SQ")
pd.merge(df_price_per_neighborhood, df_price_sq_per_neighborhood, on="Neighborhood").round(2).sort_values(by="AVG Price")
AVG Price | AVG Price/SQ | |
---|---|---|
Neighborhood | ||
MeadowV | 93594.15 | 56.88 |
BrDale | 102318.14 | 63.48 |
IDOTRR | 107452.14 | 57.74 |
BrkSide | 121976.88 | 61.12 |
OldTown | 122156.74 | 58.47 |
SWISU | 125466.61 | 55.45 |
Sawyer | 134047.95 | 66.31 |
Edwards | 134369.89 | 64.80 |
Blueste | 140079.50 | 76.20 |
NAmes | 143540.99 | 63.90 |
NPkVill | 151577.64 | 70.88 |
Mitchel | 164527.77 | 69.55 |
SawyerW | 180853.71 | 70.83 |
Gilbert | 189034.81 | 77.07 |
NWAmes | 192181.24 | 68.01 |
ClearCr | 194509.50 | 68.51 |
Crawfor | 196065.35 | 73.34 |
Blmngtn | 196103.82 | 73.90 |
CollgCr | 204511.39 | 76.43 |
Somerst | 225556.00 | 79.74 |
Timber | 247913.82 | 80.04 |
Veenker | 276132.31 | 77.89 |
NoRidge | 305682.87 | 80.72 |
NridgHt | 311091.75 | 85.90 |
StoneBr | 312845.54 | 85.88 |
As we can see above, for low budget clients we can suggest neighborhoods as MeadowV or BrDale and so on. For high budget clients we, can suggest NridgHT and StoneBr neighborhoods.
As we've done before, let's show the average price per type of building and its average price/sq.
price_per_BldgType = df_final_results.groupby(["BldgType"])["SalePricePredicted"].mean()
price_sq_per_BldgType = df_final_results.groupby(["BldgType"])["PricePerSq"].mean()
df_price_per_BldgType = price_per_BldgType.to_frame(name="AVG Price")
df_price_sq_per_BldgType = price_sq_per_BldgType.to_frame(name="AVG Price/SQ")
pd.merge(df_price_per_BldgType, df_price_sq_per_BldgType, on="BldgType").round(2).sort_values(by="AVG Price")
AVG Price | AVG Price/SQ | |
---|---|---|
BldgType | ||
2fmCon | 125862.29 | 53.47 |
Twnhs | 132762.96 | 67.56 |
Duplex | 147466.70 | 55.53 |
1Fam | 180446.33 | 70.34 |
TwnhsE | 199061.81 | 76.63 |
The cheapest building types are: the Two-family Conversion (2fmCon) and Townhouse Inside Unit (TwnhsI) and the most expensive building types are: Single-family Detached (1Fam) and Townhouse End Unit (TwnhsE).
The fact that we could know in advance an estimation of the market price of the house, this doesn't mean that this finally will be the sale price. Maybe some sellers prefer to set a lower selling price to speed up the selling or some sellers don't want to sell below a certain price. So let's imagine we have all the selling prices for the portfolio we're working on (for that purpose will add some random noise to our predicted price and we'll imagine that this is the selling price).
# Set the seed for reproducibility
np.random.seed(42)
random_vector = np.random.normal(0, np.sqrt(100000000), df_final_results.shape[0]).round().astype(int)
df_final_results['SellingPrice'] = df_final_results['SalePricePredicted'] + random_vector
For those sellers, that are willing to sell the house below the market price, we can set a larger profit for the company until reaching (or almost reaching) the market price for which we believe the property will be sold. These properties are:
df_final_results['Price Difference'] = df_final_results['SalePricePredicted'] - df_final_results['SellingPrice']
df_final_results[['Id','Price Difference']].sort_values(by='Price Difference').head(10)
Id | Price Difference | |
---|---|---|
209 | 1670 | -38527 |
478 | 1939 | -30789 |
179 | 1640 | -27202 |
755 | 2216 | -26324 |
1453 | 2914 | -26017 |
1233 | 2694 | -25797 |
654 | 2115 | -25734 |
762 | 2223 | -25601 |
880 | 2341 | -25269 |
1291 | 2752 | -24930 |
Above, we can see the identification of those properties with lower selling prices respect to the market price.
Also, let's check which properties are being sold with higher selling prices:
df_final_results[['Id','Price Difference']].sort_values(by='Price Difference', ascending=False).head(10)
Id | Price Difference | |
---|---|---|
262 | 1723 | 32413 |
1101 | 2562 | 28963 |
1061 | 2522 | 28485 |
646 | 2107 | 26969 |
668 | 2129 | 26510 |
74 | 1535 | 26197 |
1355 | 2816 | 25910 |
1346 | 2807 | 25539 |
1160 | 2621 | 24994 |
544 | 2005 | 24716 |
For properties above, we should set a lower profit and also suggest a lower selling price to the seller.
An important part of an ML project is being able to make the trained model available for other teams or colleagues of the company to take advantage of the insights they can get of it. One technical way could be creating an API that our colleagues could make their own predictions. Another approach it could be to present the results in an interactive report and enable anyone with the link to get their own insights.
That's why I've created an example of a report that we could create with the results of the model:
from IPython.display import IFrame, HTML
ulr= 'https://lookerstudio.google.com/embed/reporting/69313d85-34c5-4552-b44d-cefe48c3a18f/page/8ZboD'
IFrame(ulr, width='100%', height=500)