Predicting House Sale Prices

predicting_house_price.jpg

The goal of this project - predicting house price using Linear Regression and Ridge regression with Linear least squares with l2 regularization methods from scikit-learn.

In this project I use original data from article Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project by Dean De Cock, and use original dataset AmesHousing.txt and dictionary with descriptions attributes DataDocumentation.txt - you can download its from here and here .

I will not write many comments, all comments on the actions performed are inside the program code in cells.

Caution Cross validation for determination optimal alphas for Ridge regression it takes a very long time.

1. Introduction

Load required modules and expore dataset. Print name of columns contains more than 5% NaN values, drop its and drops remain rows contain NaN values in the dataset.

In [1]:
from IPython.display import display
from IPython.display import HTML
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import warnings
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')  
pd.set_option('display.max_columns', None)
# pd.set_eng_float_format(accuracy=3, use_eng_prefix=True)

# Inspect dataset
# Open
ames = pd.read_csv("data/AmesHousing.txt",delimiter="\t")

# Get info about numbers  rows and columns

# Calculate NaN sum for each row
ames_nan_count = ames.isnull().sum()

display("ames contains {:,} and {:} columns.\n".format(
                                            ames.shape[0],
                                            ames.shape[1]))
# Print columns name with NaN more equal 5% and drop it

drop_columns = []

for i, feature in enumerate(ames_nan_count.index):
    nan = ames_nan_count[feature]
    nan_perc = nan/ames.shape[0]
    if nan_perc >= 0.05:
        drop_columns.append(feature)
        print ("row No ",  i+1 , ", row name ", feature,  
               f", NaN = {nan:,} or  % {nan_perc:.2%} from {ames.shape[0]:,} values.")


# Drop columns        
ames.drop(axis=1, columns = drop_columns, inplace = True)

# Check NaN values
display(ames.isnull().sum().sort_values(ascending = False).head(17))

# Drop NaN values and reset index
ames.dropna(axis = 0, inplace = True)
ames.reset_index(drop = True, inplace = True)

# Display number and object columns
number_columns =list(ames.select_dtypes(include = 'number'))
display(f"Numbers of numeric  columns = {len(number_columns):}" )
display(number_columns)

object_columns = list(ames.select_dtypes(include = 'object'))
display(f"Numbers  of object columns = {len(object_columns):}")
display(object_columns)
'ames contains 2,930 and 82 columns.\n'
row No  5 , row name  Lot Frontage , NaN = 490 or  % 16.72% from 2,930 values.
row No  8 , row name  Alley , NaN = 2,732 or  % 93.24% from 2,930 values.
row No  59 , row name  Fireplace Qu , NaN = 1,422 or  % 48.53% from 2,930 values.
row No  60 , row name  Garage Type , NaN = 157 or  % 5.36% from 2,930 values.
row No  61 , row name  Garage Yr Blt , NaN = 159 or  % 5.43% from 2,930 values.
row No  62 , row name  Garage Finish , NaN = 159 or  % 5.43% from 2,930 values.
row No  65 , row name  Garage Qual , NaN = 159 or  % 5.43% from 2,930 values.
row No  66 , row name  Garage Cond , NaN = 159 or  % 5.43% from 2,930 values.
row No  74 , row name  Pool QC , NaN = 2,917 or  % 99.56% from 2,930 values.
row No  75 , row name  Fence , NaN = 2,358 or  % 80.48% from 2,930 values.
row No  76 , row name  Misc Feature , NaN = 2,824 or  % 96.38% from 2,930 values.
Bsmt Exposure     83
BsmtFin Type 2    81
Bsmt Cond         80
Bsmt Qual         80
BsmtFin Type 1    80
Mas Vnr Type      23
Mas Vnr Area      23
Bsmt Half Bath     2
Bsmt Full Bath     2
Garage Cars        1
Garage Area        1
BsmtFin SF 2       1
BsmtFin SF 1       1
Bsmt Unf SF        1
Total Bsmt SF      1
Electrical         1
Gr Liv Area        0
dtype: int64
'Numbers of numeric  columns = 37'
['Order',
 'PID',
 'MS SubClass',
 'Lot Area',
 'Overall Qual',
 'Overall Cond',
 'Year Built',
 'Year Remod/Add',
 'Mas Vnr Area',
 'BsmtFin SF 1',
 'BsmtFin SF 2',
 'Bsmt Unf SF',
 'Total Bsmt SF',
 '1st Flr SF',
 '2nd Flr SF',
 'Low Qual Fin SF',
 'Gr Liv Area',
 'Bsmt Full Bath',
 'Bsmt Half Bath',
 'Full Bath',
 'Half Bath',
 'Bedroom AbvGr',
 'Kitchen AbvGr',
 'TotRms AbvGrd',
 'Fireplaces',
 'Garage Cars',
 'Garage Area',
 'Wood Deck SF',
 'Open Porch SF',
 'Enclosed Porch',
 '3Ssn Porch',
 'Screen Porch',
 'Pool Area',
 'Misc Val',
 'Mo Sold',
 'Yr Sold',
 'SalePrice']
'Numbers  of object columns = 34'
['MS Zoning',
 'Street',
 'Lot Shape',
 'Land Contour',
 'Utilities',
 'Lot Config',
 'Land Slope',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Mas Vnr Type',
 'Exter Qual',
 'Exter Cond',
 'Foundation',
 'Bsmt Qual',
 'Bsmt Cond',
 'Bsmt Exposure',
 'BsmtFin Type 1',
 'BsmtFin Type 2',
 'Heating',
 'Heating QC',
 'Central Air',
 'Electrical',
 'Kitchen Qual',
 'Functional',
 'Paved Drive',
 'Sale Type',
 'Sale Condition']

2. Feature engineering

In [2]:
# Define and print correlation absolute values to SalePrice column
sale_price_corr = ames.corr()["SalePrice"].\
    abs().sort_values(ascending=False)

# Display first 15 rows
print("Correlation Correlation matrix for “Price” column")
display(sale_price_corr.head(15))

# Define list dropped numeric columns with corr < 0.4
drop_numeric_col = list((sale_price_corr[sale_price_corr<