Predicting house sale prices¶

In this project we will fine-tune a linear regression algorithm to predict the data from a large dataset on houses in Ames, Iowa, United States from 2006 to 2010. You can read more about why the data was collected here. You can also read about the different columns in the data here.

In [102]:

#Importing the necessary libraries and algorithms
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999
import warnings 
warnings.filterwarnings('ignore')
from sklearn.model_selection import KFold

In [103]:

#reading in the data
data = pd.read_csv('AmesHousing.tsv',sep='\t')

#Creating raw data for later
raw_data = data

The Pipeline¶

In order to be able to tweak the model all the time we will define three functions. transform_features(),select_features()andtrain_test(). Those will be the functions we will work with throughout this project. But every time we will update them. Hence they will be located in one block, so we can copy them and show the differences to their previous versions with in-line comments.

In [3]:

#This will be the default function block or pipeline.
#From now on I will remove the inline comments made so I can highlight the new functionalities every time.
#Here the basic functionalities of the model will be layed out.
def transform_features(df):
    
    #We will leave this for now.
    return df

def select_features(df):
    
    #The sole purpose of this function is to select features
    return df[['Gr Liv Area','SalePrice']]

def train_test(df):
    df = select_features(df)
    #Splitting the data in order to train and test our model.
    train = df[:1460]
    test = df[1460:]
    
    #Here we select the numeric columns and drop the price column from train.
    #Sometimes we will use it and sometime we will not.
    numeric_train = train.select_dtypes(include = ['integer','float'])
    numeric_test = test.select_dtypes(include = ['integer','float'])
    features = numeric_train.columns.drop('SalePrice')
    
    #Training the model.
    model = LinearRegression()
    model.fit(train[features],train['SalePrice'])
    
    #Testing the model.
    prediction = model.predict(test[features])
    
    #Finding the error metric root mean squared error
    mse = mean_squared_error(test['SalePrice'],prediction)
    rmse = mse**0.5
    
    return rmse

transform_df = transform_features(data)
select_df = select_features(transform_df)
rmse = train_test(select_df)

print('rmse:',rmse)

rmse: 57088.25161263909

Feature Engineering¶

The transform_features() function serves to as you may guess transform the values before we will select them. These are the general considerations for transformation.

Remove features. These decisions can be based on several grounds: missing values, leaking information about the final sale (like giving away the year of sale), usefulness of the data, etc.
Transform features into the proper format. (numerical to categorical, scaling numerical, filling in missing values, etc)
create new features by combining other features

Exploring the data¶

Right now we will explore the data in the following order:

Eliminate columns with more than 5% missing values
Fill the numerical columns with the mode
Creating new features by combining columns that are left
Dropping columns for other reasons

In [4]:

#Inspecting missing value percentages
per_missing = data.isnull().sum()/len(data)
per_missing

Out[4]:

Order              0.000000
PID                0.000000
MS SubClass        0.000000
MS Zoning          0.000000
Lot Frontage       0.167235
Lot Area           0.000000
Street             0.000000
Alley              0.932423
Lot Shape          0.000000
Land Contour       0.000000
Utilities          0.000000
Lot Config         0.000000
Land Slope         0.000000
Neighborhood       0.000000
Condition 1        0.000000
Condition 2        0.000000
Bldg Type          0.000000
House Style        0.000000
Overall Qual       0.000000
Overall Cond       0.000000
Year Built         0.000000
Year Remod/Add     0.000000
Roof Style         0.000000
Roof Matl          0.000000
Exterior 1st       0.000000
Exterior 2nd       0.000000
Mas Vnr Type       0.007850
Mas Vnr Area       0.007850
Exter Qual         0.000000
Exter Cond         0.000000
Foundation         0.000000
Bsmt Qual          0.027304
Bsmt Cond          0.027304
Bsmt Exposure      0.028328
BsmtFin Type 1     0.027304
BsmtFin SF 1       0.000341
BsmtFin Type 2     0.027645
BsmtFin SF 2       0.000341
Bsmt Unf SF        0.000341
Total Bsmt SF      0.000341
Heating            0.000000
Heating QC         0.000000
Central Air        0.000000
Electrical         0.000341
1st Flr SF         0.000000
2nd Flr SF         0.000000
Low Qual Fin SF    0.000000
Gr Liv Area        0.000000
Bsmt Full Bath     0.000683
Bsmt Half Bath     0.000683
Full Bath          0.000000
Half Bath          0.000000
Bedroom AbvGr      0.000000
Kitchen AbvGr      0.000000
Kitchen Qual       0.000000
TotRms AbvGrd      0.000000
Functional         0.000000
Fireplaces         0.000000
Fireplace Qu       0.485324
Garage Type        0.053584
Garage Yr Blt      0.054266
Garage Finish      0.054266
Garage Cars        0.000341
Garage Area        0.000341
Garage Qual        0.054266
Garage Cond        0.054266
Paved Drive        0.000000
Wood Deck SF       0.000000
Open Porch SF      0.000000
Enclosed Porch     0.000000
3Ssn Porch         0.000000
Screen Porch       0.000000
Pool Area          0.000000
Pool QC            0.995563
Fence              0.804778
Misc Feature       0.963823
Misc Val           0.000000
Mo Sold            0.000000
Yr Sold            0.000000
Sale Type          0.000000
Sale Condition     0.000000
SalePrice          0.000000
dtype: float64

In [5]:

#Eliminating the columns with more than 5% of missing values.
per_missing = per_missing[per_missing<0.05].index
per_missing

Out[5]:

Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Area', 'Street',
       'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope',
       'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type',
       'House Style', 'Overall Qual', 'Overall Cond', 'Year Built',
       'Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st',
       'Exterior 2nd', 'Mas Vnr Type', 'Mas Vnr Area', 'Exter Qual',
       'Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2',
       'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air',
       'Electrical', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF',
       'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath',
       'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual',
       'TotRms AbvGrd', 'Functional', 'Fireplaces', 'Garage Cars',
       'Garage Area', 'Paved Drive', 'Wood Deck SF', 'Open Porch SF',
       'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Misc Val',
       'Mo Sold', 'Yr Sold', 'Sale Type', 'Sale Condition', 'SalePrice'],
      dtype='object')

In [6]:

#Dropping the first batch of columns
data = data[per_missing]
data

Out[6]:

	Order	PID	MS SubClass	MS Zoning	Lot Area	Street	Lot Shape	Land Contour	Utilities	Lot Config	Land Slope	Neighborhood	Condition 1	Condition 2	Bldg Type	House Style	Overall Qual	Overall Cond	Year Built	Year Remod/Add	Roof Style	Roof Matl	Exterior 1st	Exterior 2nd	Mas Vnr Type	Mas Vnr Area	Exter Qual	Exter Cond	Foundation	Bsmt Qual	Bsmt Cond	Bsmt Exposure	BsmtFin Type 1	BsmtFin SF 1	BsmtFin Type 2	BsmtFin SF 2	Bsmt Unf SF	Total Bsmt SF	Heating	Heating QC	Central Air	Electrical	1st Flr SF	2nd Flr SF	Low Qual Fin SF	Gr Liv Area	Bsmt Full Bath	Bsmt Half Bath	Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	Kitchen Qual	TotRms AbvGrd	Functional	Fireplaces	Garage Cars	Garage Area	Paved Drive	Wood Deck SF	Open Porch SF	Enclosed Porch	3Ssn Porch	Screen Porch	Pool Area	Misc Val	Mo Sold	Yr Sold	Sale Type	Sale Condition	SalePrice
0	1	526301100	20	RL	31770	Pave	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	5	1960	1960	Hip	CompShg	BrkFace	Plywood	Stone	112.0	TA	TA	CBlock	TA	Gd	Gd	BLQ	639.0	Unf	0.0	441.0	1080.0	GasA	Fa	Y	SBrkr	1656	0	0	1656	1.0	0.0	1	0	3	1	TA	7	Typ	2	2.0	528.0	P	210	62	0	0	0	0	0	5	2010	WD	Normal	215000
1	2	526350040	20	RH	11622	Pave	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Feedr	Norm	1Fam	1Story	5	6	1961	1961	Gable	CompShg	VinylSd	VinylSd	None	0.0	TA	TA	CBlock	TA	TA	No	Rec	468.0	LwQ	144.0	270.0	882.0	GasA	TA	Y	SBrkr	896	0	0	896	0.0	0.0	1	0	2	1	TA	5	Typ	0	1.0	730.0	Y	140	0	0	0	120	0	0	6	2010	WD	Normal	105000
2	3	526351010	20	RL	14267	Pave	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	6	1958	1958	Hip	CompShg	Wd Sdng	Wd Sdng	BrkFace	108.0	TA	TA	CBlock	TA	TA	No	ALQ	923.0	Unf	0.0	406.0	1329.0	GasA	TA	Y	SBrkr	1329	0	0	1329	0.0	0.0	1	1	3	1	Gd	6	Typ	0	1.0	312.0	Y	393	36	0	0	0	0	12500	6	2010	WD	Normal	172000
3	4	526353030	20	RL	11160	Pave	Reg	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	7	5	1968	1968	Hip	CompShg	BrkFace	BrkFace	None	0.0	Gd	TA	CBlock	TA	TA	No	ALQ	1065.0	Unf	0.0	1045.0	2110.0	GasA	Ex	Y	SBrkr	2110	0	0	2110	1.0	0.0	2	1	3	1	Ex	8	Typ	2	2.0	522.0	Y	0	0	0	0	0	0	0	4	2010	WD	Normal	244000
4	5	527105010	60	RL	13830	Pave	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	5	5	1997	1998	Gable	CompShg	VinylSd	VinylSd	None	0.0	TA	TA	PConc	Gd	TA	No	GLQ	791.0	Unf	0.0	137.0	928.0	GasA	Gd	Y	SBrkr	928	701	0	1629	0.0	0.0	2	1	3	1	TA	6	Typ	1	2.0	482.0	Y	212	34	0	0	0	0	0	3	2010	WD	Normal	189900
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2925	2926	923275080	80	RL	7937	Pave	IR1	Lvl	AllPub	CulDSac	Gtl	Mitchel	Norm	Norm	1Fam	SLvl	6	6	1984	1984	Gable	CompShg	HdBoard	HdBoard	None	0.0	TA	TA	CBlock	TA	TA	Av	GLQ	819.0	Unf	0.0	184.0	1003.0	GasA	TA	Y	SBrkr	1003	0	0	1003	1.0	0.0	1	0	3	1	TA	6	Typ	0	2.0	588.0	Y	120	0	0	0	0	0	0	3	2006	WD	Normal	142500
2926	2927	923276100	20	RL	8885	Pave	IR1	Low	AllPub	Inside	Mod	Mitchel	Norm	Norm	1Fam	1Story	5	5	1983	1983	Gable	CompShg	HdBoard	HdBoard	None	0.0	TA	TA	CBlock	Gd	TA	Av	BLQ	301.0	ALQ	324.0	239.0	864.0	GasA	TA	Y	SBrkr	902	0	0	902	1.0	0.0	1	0	2	1	TA	5	Typ	0	2.0	484.0	Y	164	0	0	0	0	0	0	6	2006	WD	Normal	131000
2927	2928	923400125	85	RL	10441	Pave	Reg	Lvl	AllPub	Inside	Gtl	Mitchel	Norm	Norm	1Fam	SFoyer	5	5	1992	1992	Gable	CompShg	HdBoard	Wd Shng	None	0.0	TA	TA	PConc	Gd	TA	Av	GLQ	337.0	Unf	0.0	575.0	912.0	GasA	TA	Y	SBrkr	970	0	0	970	0.0	1.0	1	0	3	1	TA	6	Typ	0	0.0	0.0	Y	80	32	0	0	0	0	700	7	2006	WD	Normal	132000
2928	2929	924100070	20	RL	10010	Pave	Reg	Lvl	AllPub	Inside	Mod	Mitchel	Norm	Norm	1Fam	1Story	5	5	1974	1975	Gable	CompShg	HdBoard	HdBoard	None	0.0	TA	TA	CBlock	Gd	TA	Av	ALQ	1071.0	LwQ	123.0	195.0	1389.0	GasA	Gd	Y	SBrkr	1389	0	0	1389	1.0	0.0	1	0	2	1	TA	6	Typ	1	2.0	418.0	Y	240	38	0	0	0	0	0	4	2006	WD	Normal	170000
2929	2930	924151050	60	RL	9627	Pave	Reg	Lvl	AllPub	Inside	Mod	Mitchel	Norm	Norm	1Fam	2Story	7	5	1993	1994	Gable	CompShg	HdBoard	HdBoard	BrkFace	94.0	TA	TA	PConc	Gd	TA	Av	LwQ	758.0	Unf	0.0	238.0	996.0	GasA	Ex	Y	SBrkr	996	1004	0	2000	0.0	0.0	2	1	3	1	TA	9	Typ	1	3.0	650.0	Y	190	48	0	0	0	0	0	11	2006	WD	Normal	188000

2930 rows × 71 columns

For the text columns we will remove all with missing values.

For the numeric columns we can fill the missing values with the mode.

In [7]:

text_data_cols = data.select_dtypes(include= ['object']).isnull().sum()
text_drop_cols = text_data_cols[text_data_cols>0]
data = data.drop(text_drop_cols.index, axis=1)

In [8]:

numeric_data = data.select_dtypes(include= ['integer','float'])
cols = numeric_data.columns
mode = data.filter(cols).mode()
data[cols] = data[cols].fillna(mode.iloc[0])
data.isnull().sum().value_counts() #They are all filled now

Out[8]:

0    64
dtype: int64

New features to create:

The new column to create should be years until remodeling. 'Year Until Remod'
- Year Remod/Add: Remodel date (same as construction date if no remodeling or additions)
- Year Built: Original construction date
Years sold
- Year Built and
- Year Sold

In [9]:

years_sold = data['Yr Sold'] - data['Year Built']
years_sold[years_sold < 0] #Only one unexpected value.

Out[9]:

2180   -1
dtype: int64

In [10]:

year_until_remod = data['Year Remod/Add'] - data['Year Built']
year_until_remod[year_until_remod<0]

Out[10]:

850   -1
dtype: int64

In [11]:

data['Years Sold'] = years_sold
data['Year Until Remod'] = year_until_remod

Columns that leak data about the sale:

Mo Sold: Month Sold (MM)
Yr Sold: Year Sold (YYYY)
Sale Type: Type of sale
Sale Condition: Condition of sale

Columns that are not useful for machine learning:

PID
Order

In [12]:

data = data.drop(['Mo Sold','Yr Sold','Sale Type','Sale Condition','PID','Order'], axis=1)

In [13]:

def transform_features(data):
    #These are all the lines of code we have created above but in a compact fashion.
    per_missing = data.isnull().sum()/len(data)
    per_missing = per_missing[per_missing<0.05].index
    data = data[per_missing]
    text_data_cols = data.select_dtypes(include= ['object']).isnull().sum()
    text_drop_cols = text_data_cols[text_data_cols>0]
    data = data.drop(text_drop_cols.index, axis=1)
    numeric_data = data.select_dtypes(include= ['integer','float'])
    cols = numeric_data.columns
    mode = data.filter(cols).mode()
    data[cols] = data[cols].fillna(mode.iloc[0])
    data.isnull().sum().value_counts()
    years_sold = data['Yr Sold'] - data['Year Built']
    year_until_remod = data['Year Remod/Add'] - data['Year Built']
    data['Years Sold'] = years_sold
    data['Year Until Remod'] = year_until_remod
    data = data.drop(['Mo Sold','Yr Sold','Sale Type','Sale Condition','PID','Order'], axis=1)
    return data

def select_features(df):
    
    #The sole purpose of this function is to select features
    return df[['Gr Liv Area','SalePrice']]

def train_test(df):
    df = select_features(df)
    train = df[:1460]
    test = df[1460:]
    numeric_train = train.select_dtypes(include = ['integer','float'])
    numeric_test = test.select_dtypes(include = ['integer','float'])
    features = numeric_train.columns.drop('SalePrice')
    model = LinearRegression()
    model.fit(train[features],train['SalePrice'])
    prediction = model.predict(test[features])
    mse = mean_squared_error(test['SalePrice'],prediction)
    rmse = mse**0.5
    
    return rmse

#removed the first line because removing the values twice raises an error.
select_df = select_features(data)
rmse = train_test(select_df)

print('rmse:',rmse)

rmse: 57088.25161263909

Selecting the features¶

In this block we will select the features from the data we will use in the end.

First we will select the best features from the numerical columns using correlation and univariate testing.
Second we will need to decide what to do with the categorical columns e.g. if there are many values it will generate a lot of dummy columns and the dataframe will be populated with a lot of zero's.

Numerical values¶

In [14]:

#For the numeric columns we will generate a heatmap
import seaborn as sns
cols = data.select_dtypes(include=['integer','float']).columns
corr = data[cols].corr()
corr.style.background_gradient(cmap='viridis')

Out[14]:

	MS SubClass	Lot Area	Overall Qual	Overall Cond	Year Built	Year Remod/Add	Mas Vnr Area	BsmtFin SF 1	BsmtFin SF 2	Bsmt Unf SF	Total Bsmt SF	1st Flr SF	2nd Flr SF	Low Qual Fin SF	Gr Liv Area	Bsmt Full Bath	Bsmt Half Bath	Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	TotRms AbvGrd	Fireplaces	Garage Cars	Garage Area	Wood Deck SF	Open Porch SF	Enclosed Porch	3Ssn Porch	Screen Porch	Pool Area	Misc Val	SalePrice	Years Sold	Year Until Remod
MS SubClass	1.000000	-0.204613	0.039419	-0.067349	0.036579	0.043397	0.003634	-0.059767	-0.070847	-0.129987	-0.218489	-0.247828	0.304237	0.025765	0.068061	0.014187	-0.003179	0.134631	0.175879	-0.019208	0.257698	0.031898	-0.049955	-0.045851	-0.103374	-0.017310	-0.014823	-0.022866	-0.037956	-0.050614	-0.003434	-0.029254	-0.085092	-0.037302	-0.008367
Lot Area	-0.204613	1.000000	0.097188	-0.034759	0.023258	0.021682	0.125950	0.191692	0.083198	0.023883	0.253765	0.332235	0.032996	0.000812	0.285599	0.124208	0.026448	0.127433	0.035497	0.136569	-0.020301	0.216597	0.256989	0.179494	0.212749	0.157212	0.103760	0.021868	0.016243	0.055044	0.093775	0.069188	0.266549	-0.024227	-0.010451
Overall Qual	0.039419	0.097188	1.000000	-0.094812	0.597027	0.569609	0.423089	0.284458	-0.041122	0.270527	0.547766	0.477837	0.241402	-0.048680	0.570556	0.168373	-0.041433	0.522263	0.268853	0.063291	-0.159744	0.380693	0.393007	0.599392	0.563562	0.255663	0.298412	-0.140332	0.018240	0.041615	0.030399	0.005179	0.799262	-0.597021	-0.256948
Overall Cond	-0.067349	-0.034759	-0.094812	1.000000	-0.368773	0.047680	-0.132472	-0.051341	0.040992	-0.137303	-0.174179	-0.157052	0.006218	0.009175	-0.115643	-0.043458	0.084185	-0.214316	-0.088127	-0.006137	-0.086386	-0.089816	-0.031702	-0.181508	-0.153918	0.020344	-0.068934	0.071459	0.043852	0.044055	-0.016787	0.034056	-0.101697	0.369570	0.505496
Year Built	0.036579	0.023258	0.597027	-0.368773	1.000000	0.612095	0.306797	0.280069	-0.027328	0.129311	0.407764	0.310463	0.016828	-0.144282	0.241726	0.212119	-0.030515	0.469406	0.269268	-0.055093	-0.137852	0.111919	0.170672	0.537033	0.480726	0.228964	0.198365	-0.374364	0.015803	-0.041436	0.002213	-0.011011	0.558426	-0.999055	-0.727220
Year Remod/Add	0.043397	0.021682	0.569609	0.047680	0.612095	1.000000	0.191740	0.152240	-0.061934	0.165397	0.298393	0.242108	0.158939	-0.060365	0.316855	0.134590	-0.046175	0.457266	0.211771	-0.021536	-0.142404	0.197528	0.133322	0.425434	0.375566	0.217857	0.241748	-0.220383	0.037412	-0.046888	-0.011410	-0.003132	0.532974	-0.609747	0.097670
Mas Vnr Area	0.003634	0.125950	0.423089	-0.132472	0.306797	0.191740	1.000000	0.299951	-0.014955	0.089662	0.393530	0.392280	0.120159	-0.057173	0.400703	0.139684	0.016609	0.254471	0.188703	0.080590	-0.050733	0.278418	0.270051	0.356646	0.370479	0.165875	0.140003	-0.109915	0.014222	0.066817	0.004893	0.045276	0.502196	-0.307065	-0.219680
BsmtFin SF 1	-0.059767	0.191692	0.284458	-0.051341	0.280069	0.152240	0.299951	1.000000	-0.054022	-0.477244	0.536731	0.457586	-0.163710	-0.066129	0.209944	0.640212	0.077687	0.078086	-0.008205	-0.118575	-0.086654	0.048118	0.296098	0.255750	0.309649	0.224199	0.125150	-0.100318	0.050567	0.095952	0.084147	0.092900	0.433147	-0.278673	-0.220328
BsmtFin SF 2	-0.070847	0.083198	-0.041122	0.040992	-0.027328	-0.061934	-0.014955	-0.054022	1.000000	-0.238420	0.090109	0.084710	-0.098188	-0.004989	-0.017996	0.163634	0.099147	-0.076262	-0.033004	-0.033033	-0.037864	-0.049288	0.067180	-0.014395	0.003625	0.098591	-0.005516	0.032415	-0.023314	0.062978	0.044403	-0.005195	0.006018	0.027594	-0.019367
Bsmt Unf SF	-0.129987	0.023883	0.270527	-0.137303	0.129311	0.165397	0.089662	-0.477244	-0.238420	1.000000	0.412248	0.296139	0.002320	0.047207	0.235870	-0.398629	-0.105784	0.274901	-0.033530	0.188508	0.065651	0.251633	0.001790	0.179736	0.164837	-0.039285	0.119143	0.006383	-0.005399	-0.047945	-0.031963	-0.010125	0.183308	-0.130699	-0.019176
Total Bsmt SF	-0.218489	0.253765	0.547766	-0.174179	0.407764	0.298393	0.393530	0.536731	0.090109	0.412248	1.000000	0.800429	-0.204487	-0.023180	0.445108	0.326861	0.012850	0.325434	-0.054557	0.052721	-0.038612	0.281627	0.333467	0.437861	0.485608	0.230290	0.245941	-0.084848	0.037918	0.075499	0.072109	0.083895	0.632529	-0.407603	-0.254174
1st Flr SF	-0.247828	0.332235	0.477837	-0.157052	0.310463	0.242108	0.392280	0.457586	0.084710	0.296139	0.800429	1.000000	-0.250057	-0.012689	0.562166	0.257836	0.010469	0.371584	-0.104203	0.106648	0.076025	0.390162	0.406345	0.439370	0.491208	0.227131	0.238041	-0.065713	0.044061	0.098316	0.121821	0.093003	0.621676	-0.310585	-0.180573
2nd Flr SF	0.304237	0.032996	0.241402	0.006218	0.016828	0.158939	0.120159	-0.163710	-0.098188	0.002320	-0.204487	-0.250057	1.000000	0.018024	0.655251	-0.163068	-0.059458	0.404143	0.611634	0.504651	0.069444	0.585214	0.165844	0.182199	0.127508	0.089097	0.184538	0.055429	-0.032172	0.011741	0.044602	-0.005078	0.269373	-0.017607	0.116782
Low Qual Fin SF	0.025765	0.000812	-0.048680	0.009175	-0.144282	-0.060365	-0.057173	-0.066129	-0.004989	0.047207	-0.023180	-0.012689	0.018024	1.000000	0.097050	-0.047215	-0.013385	-0.002606	-0.039303	0.070531	0.000517	0.102181	-0.006946	-0.067336	-0.053390	-0.015646	-0.000761	0.087326	-0.004505	0.006943	0.035200	-0.005939	-0.037660	0.143973	0.129186
Gr Liv Area	0.068061	0.285599	0.570556	-0.115643	0.241726	0.316855	0.400703	0.209944	-0.017996	0.235870	0.445108	0.562166	0.655251	0.097050	1.000000	0.057367	-0.043498	0.630321	0.433949	0.516808	0.117836	0.807772	0.454924	0.488854	0.483971	0.250153	0.340857	0.004030	0.006481	0.086804	0.135463	0.067252	0.706780	-0.242510	-0.029186
Bsmt Full Bath	0.014187	0.124208	0.168373	-0.043458	0.212119	0.134590	0.139684	0.640212	0.163634	-0.398629	0.326861	0.257836	-0.163068	-0.047215	0.057367	1.000000	-0.149027	-0.023011	-0.035385	-0.161084	-0.018349	-0.043642	0.172775	0.162053	0.184903	0.187283	0.080765	-0.069014	0.027086	0.052355	0.043730	-0.004817	0.275823	-0.209849	-0.150131
Bsmt Half Bath	-0.003179	0.026448	-0.041433	0.084185	-0.030515	-0.046175	0.016609	0.077687	0.099147	-0.105784	0.012850	0.010469	-0.059458	-0.013385	-0.043498	-0.149027	1.000000	-0.046371	-0.057999	0.020333	-0.064671	-0.048069	0.038089	-0.033463	-0.021147	0.051546	-0.035276	-0.009272	0.026971	0.042372	0.066911	0.036996	-0.035817	0.029619	-0.001676
Full Bath	0.134631	0.127433	0.522263	-0.214316	0.469406	0.457266	0.254471	0.078086	-0.076262	0.274901	0.325434	0.371584	0.404143	-0.002606	0.630321	-0.023011	-0.046371	1.000000	0.159689	0.359489	0.171175	0.528599	0.229793	0.478206	0.406497	0.179574	0.258675	-0.117795	0.015435	-0.015130	0.028205	-0.009771	0.545604	-0.468900	-0.193849
Half Bath	0.175879	0.035497	0.268853	-0.088127	0.269268	0.211771	0.188703	-0.008205	-0.033004	-0.033530	-0.054557	-0.104203	0.611634	-0.039303	0.433949	-0.035385	-0.057999	0.159689	1.000000	0.247053	-0.042321	0.346493	0.201871	0.233324	0.178735	0.115212	0.180704	-0.081312	-0.023231	0.035990	0.001515	0.026648	0.285056	-0.268792	-0.155062
Bedroom AbvGr	-0.019208	0.136569	0.063291	-0.006137	-0.055093	-0.021536	0.080590	-0.118575	-0.033033	0.188508	0.052721	0.106648	0.504651	0.070531	0.516808	-0.161084	0.020333	0.359489	0.247053	1.000000	0.240737	0.672647	0.076971	0.091375	0.073417	0.029711	0.083650	0.052115	-0.047151	0.009250	0.036707	0.000887	0.143913	0.054227	0.050642
Kitchen AbvGr	0.257698	-0.020301	-0.159744	-0.086386	-0.137852	-0.142404	-0.050733	-0.086654	-0.037864	0.065651	-0.038612	0.076025	0.069444	0.000517	0.117836	-0.018349	-0.064671	0.171175	-0.042321	0.240737	1.000000	0.294445	-0.108085	-0.037113	-0.057576	-0.087410	-0.068283	0.027911	-0.021379	-0.056337	-0.013066	0.025145	-0.119814	0.139182	0.049882
TotRms AbvGrd	0.031898	0.216597	0.380693	-0.089816	0.111919	0.197528	0.278418	0.048118	-0.049288	0.251633	0.281627	0.390162	0.585214	0.102181	0.807772	-0.043642	-0.048069	0.528599	0.346493	0.672647	0.294445	1.000000	0.302865	0.355517	0.327217	0.154735	0.235684	0.017221	-0.025097	0.033731	0.072103	0.061134	0.495474	-0.113074	0.030604
Fireplaces	-0.049955	0.256989	0.393007	-0.031702	0.170672	0.133322	0.270051	0.296098	0.067180	0.001790	0.333467	0.406345	0.165844	-0.006946	0.454924	0.172775	0.038089	0.229793	0.201871	0.076971	-0.108085	0.302865	1.000000	0.321057	0.294262	0.228064	0.159637	-0.000250	0.018414	0.168004	0.098449	0.008192	0.474558	-0.170744	-0.099070
Garage Cars	-0.045851	0.179494	0.599392	-0.181508	0.537033	0.425434	0.356646	0.255750	-0.014395	0.179736	0.437861	0.439370	0.182199	-0.067336	0.488854	0.162053	-0.033463	0.478206	0.233324	0.091375	-0.037113	0.355517	0.321057	1.000000	0.888699	0.241272	0.204088	-0.132332	0.023334	0.042980	0.030386	-0.016957	0.647812	-0.537197	-0.306588
Garage Area	-0.103374	0.212749	0.563562	-0.153918	0.480726	0.375566	0.370479	0.309649	0.003625	0.164837	0.485608	0.491208	0.127508	-0.053390	0.483971	0.184903	-0.021147	0.406497	0.178735	0.073417	-0.057576	0.327217	0.294262	0.888699	1.000000	0.237682	0.233229	-0.108237	0.029511	0.062598	0.053055	0.008526	0.640138	-0.480541	-0.279010
Wood Deck SF	-0.017310	0.157212	0.255663	0.020344	0.228964	0.217857	0.165875	0.224199	0.098591	-0.039285	0.230290	0.227131	0.089097	-0.015646	0.250153	0.187283	0.051546	0.179574	0.115212	0.029711	-0.087410	0.154735	0.228064	0.241272	0.237682	1.000000	0.039243	-0.119136	-0.003967	-0.052191	0.094156	0.056820	0.327143	-0.228578	-0.099055
Open Porch SF	-0.014823	0.103760	0.298412	-0.068934	0.198365	0.241748	0.140003	0.125150	-0.005516	0.119143	0.245941	0.238041	0.184538	-0.000761	0.340857	0.080765	-0.035276	0.258675	0.180704	0.083650	-0.068283	0.235684	0.159637	0.204088	0.233229	0.039243	1.000000	-0.059875	-0.009458	0.047548	0.064135	0.077254	0.312951	-0.199692	-0.039808
Enclosed Porch	-0.022866	0.021868	-0.140332	0.071459	-0.374364	-0.220383	-0.109915	-0.100318	0.032415	0.006383	-0.084848	-0.065713	0.055429	0.087326	0.004030	-0.069014	-0.009272	-0.117795	-0.081312	0.052115	0.027911	0.017221	-0.000250	-0.132332	-0.108237	-0.119136	-0.059875	1.000000	-0.032674	-0.063965	0.092596	0.008773	-0.128787	0.373774	0.279852
3Ssn Porch	-0.037956	0.016243	0.018240	0.043852	0.015803	0.037412	0.014222	0.050567	-0.023314	-0.005399	0.037918	0.044061	-0.032172	-0.004505	0.006481	0.027086	0.026971	0.015435	-0.023231	-0.047151	-0.021379	-0.025097	0.018414	0.023334	0.029511	-0.003967	-0.009458	-0.032674	1.000000	-0.029430	-0.006501	-0.000753	0.032225	-0.014794	0.012585
Screen Porch	-0.050614	0.055044	0.041615	0.044055	-0.041436	-0.046888	0.066817	0.095952	0.062978	-0.047945	0.075499	0.098316	0.011741	0.006943	0.086804	0.052355	0.042372	-0.015130	0.035990	0.009250	-0.056337	0.033731	0.168004	0.042980	0.062598	-0.052191	0.047548	-0.063965	-0.029430	1.000000	0.026383	0.007162	0.112151	0.041107	0.011449
Pool Area	-0.003434	0.093775	0.030399	-0.016787	0.002213	-0.011410	0.004893	0.084147	0.044403	-0.031963	0.072109	0.121821	0.044602	0.035200	0.135463	0.043730	0.066911	0.028205	0.001515	0.036707	-0.013066	0.072103	0.098449	0.030386	0.053055	0.094156	0.064135	0.092596	-0.006501	0.026383	1.000000	0.011942	0.068403	-0.004493	-0.012689
Misc Val	-0.029254	0.069188	0.005179	0.034056	-0.011011	-0.003132	0.045276	0.092900	-0.005195	-0.010125	0.083895	0.093003	-0.005078	-0.005939	0.067252	-0.004817	0.036996	-0.009771	0.026648	0.000887	0.025145	0.061134	0.008192	-0.016957	0.008526	0.056820	0.077254	0.008773	-0.000753	0.007162	0.011942	1.000000	-0.015691	0.011367	0.011139
SalePrice	-0.085092	0.266549	0.799262	-0.101697	0.558426	0.532974	0.502196	0.433147	0.006018	0.183308	0.632529	0.621676	0.269373	-0.037660	0.706780	0.275823	-0.035817	0.545604	0.285056	0.143913	-0.119814	0.495474	0.474558	0.647812	0.640138	0.327143	0.312951	-0.128787	0.032225	0.112151	0.068403	-0.015691	1.000000	-0.558907	-0.240168
Years Sold	-0.037302	-0.024227	-0.597021	0.369570	-0.999055	-0.609747	-0.307065	-0.278673	0.027594	-0.130699	-0.407603	-0.310585	-0.017607	0.143973	-0.242510	-0.209849	0.029619	-0.468900	-0.268792	0.054227	0.139182	-0.113074	-0.170744	-0.537197	-0.480541	-0.228578	-0.199692	0.373774	-0.014794	0.041107	-0.004493	0.011367	-0.558907	1.000000	0.728069
Year Until Remod	-0.008367	-0.010451	-0.256948	0.505496	-0.727220	0.097670	-0.219680	-0.220328	-0.019367	-0.019176	-0.254174	-0.180573	0.116782	0.129186	-0.029186	-0.150131	-0.001676	-0.193849	-0.155062	0.050642	0.049882	0.030604	-0.099070	-0.306588	-0.279010	-0.099055	-0.039808	0.279852	0.012585	0.011449	-0.012689	0.011139	-0.240168	0.728069	1.000000

In [15]:

#Selecting the columns that have a correlation above 0.50 or below -0.50
corr_price = corr['SalePrice']
corr_price = (corr_price >= 0.50) + (corr_price <= -0.50)
corr = corr[corr_price]
corr = corr.loc[:,corr.index]
corr.style.background_gradient(cmap='viridis')

Out[15]:

	Overall Qual	Year Built	Year Remod/Add	Mas Vnr Area	Total Bsmt SF	1st Flr SF	Gr Liv Area	Full Bath	Garage Cars	Garage Area	SalePrice	Years Sold
Overall Qual	1.000000	0.597027	0.569609	0.423089	0.547766	0.477837	0.570556	0.522263	0.599392	0.563562	0.799262	-0.597021
Year Built	0.597027	1.000000	0.612095	0.306797	0.407764	0.310463	0.241726	0.469406	0.537033	0.480726	0.558426	-0.999055
Year Remod/Add	0.569609	0.612095	1.000000	0.191740	0.298393	0.242108	0.316855	0.457266	0.425434	0.375566	0.532974	-0.609747
Mas Vnr Area	0.423089	0.306797	0.191740	1.000000	0.393530	0.392280	0.400703	0.254471	0.356646	0.370479	0.502196	-0.307065
Total Bsmt SF	0.547766	0.407764	0.298393	0.393530	1.000000	0.800429	0.445108	0.325434	0.437861	0.485608	0.632529	-0.407603
1st Flr SF	0.477837	0.310463	0.242108	0.392280	0.800429	1.000000	0.562166	0.371584	0.439370	0.491208	0.621676	-0.310585
Gr Liv Area	0.570556	0.241726	0.316855	0.400703	0.445108	0.562166	1.000000	0.630321	0.488854	0.483971	0.706780	-0.242510
Full Bath	0.522263	0.469406	0.457266	0.254471	0.325434	0.371584	0.630321	1.000000	0.478206	0.406497	0.545604	-0.468900
Garage Cars	0.599392	0.537033	0.425434	0.356646	0.437861	0.439370	0.488854	0.478206	1.000000	0.888699	0.647812	-0.537197
Garage Area	0.563562	0.480726	0.375566	0.370479	0.485608	0.491208	0.483971	0.406497	0.888699	1.000000	0.640138	-0.480541
SalePrice	0.799262	0.558426	0.532974	0.502196	0.632529	0.621676	0.706780	0.545604	0.647812	0.640138	1.000000	-0.558907
Years Sold	-0.597021	-0.999055	-0.609747	-0.307065	-0.407603	-0.310585	-0.242510	-0.468900	-0.537197	-0.480541	-0.558907	1.000000

The above numeric features are the chosen ones. However, there are still a few columns that need to be removed because they correspond too much to each other

Garage Cars and Garage Area: Garage Area
Year Built and Year Remod/Add: Year Remod add
Total Bsmt SF and 1st flr SF: Total Bsmt SF

Those will be dropped for now.

In [16]:

corr = corr.drop(['Garage Area','Year Remod/Add','Total Bsmt SF'],axis=1)

In [17]:

#Corr has all the right numeric columns now. let's add those to a final value,
# and remove the target column SalePrice
final_numeric_cols = ['Overall Qual', 'Year Built', 'Mas Vnr Area', '1st Flr SF',
                      'Gr Liv Area', 'Full Bath', 'Garage Cars', 'Years Sold']

Short experiment¶

In [18]:

# Now we can experiment with the different numeric columns.
# Note that a multivariate test is impossible at this point before standardization.
univariate_test = {}
for col in final_numeric_cols:
    def select_features(df):
    
    
        #The sole purpose of this function is to select features
        return df[[col,'SalePrice']]

    select_df = select_features(data)
    rmse = train_test(select_df)

    univariate_test[col] = rmse

univariate_test

Out[18]:

{'Overall Qual': 49492.23698000454,
 'Year Built': 69220.60962857504,
 'Mas Vnr Area': 70899.7027183288,
 '1st Flr SF': 66145.85396434217,
 'Gr Liv Area': 57088.25161263909,
 'Full Bath': 66967.82905766825,
 'Garage Cars': 63016.29294369743,
 'Years Sold': 69229.55176985388}

Categorical values¶

Now it is time to do some research on the categorical values and see which ones are suited best to become dummy columns. To be selected the columns should be:

Informative:
- They have an array of different values. If most values in the columns are the same they can hardly tell us anything.
- They have a coherent scale, and are not random.
Workable, which means that they should not contain too many different values for our model. For example, if one column would result in 20 dummy columns there would be a lot of zero values which would not increase our model.

In [19]:

text_cols = ['MS Zoning', 'Street', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl',
       'Exterior 1st', 'Exterior 2nd', 'Exter Qual', 'Exter Cond',
       'Foundation', 'Heating', 'Heating QC', 'Central Air', 'Kitchen Qual',
       'Functional', 'Paved Drive']

Right now I will make a lot of visualizations to estimate the above conditions.

In [20]:

#Let's judge the values one by one
data['MS Zoning'].value_counts(normalize=True).plot.bar()
#This does not seem useful, 80% has the same value.

Out[20]:

<matplotlib.axes._subplots.AxesSubplot at 0x29efede4408>

In [21]:

data['Street'].value_counts(normalize=True).plot.bar()
#This one can be very easily dropped

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x29efef4c2c8>

In [22]:

data['Lot Shape'].value_counts(normalize=True).plot.bar()
#This one is informative, telling us about if the shape is irregular. 

Out[22]:

<matplotlib.axes._subplots.AxesSubplot at 0x29efef7c908>

In [23]:

data['Land Contour'].value_counts(normalize=True).plot.bar()
#This one could be useful, let's keep it.

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x29efefbab88>

In [24]:

data['Utilities'].value_counts(normalize=True).plot.bar()
#Easy drop

Out[24]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff010ac8>

In [25]:

data['Land Slope'].value_counts(normalize=True).plot.bar()
#Many have the same value, so it is better to drop it. 

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff04b408>

In [26]:

data['Neighborhood'].value_counts(normalize=True).plot.bar()
#This one has many values, but it is very informative. Hence I will keep it at first and maybe take it out later.

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff085048>

In [27]:

data['Condition 1'].value_counts(normalize=True).plot.bar()
#nope

Out[27]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff331e48>

In [28]:

data['Condition 2'].value_counts(normalize=True).plot.bar()
#nope

Out[28]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff3a9a48>

In [29]:

data['Bldg Type'].value_counts(normalize=True).plot.bar()
#Very informative, it's a keeper.
#Dropped on inspection.

Out[29]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff42f5c8>

In [30]:

data['House Style'].value_counts(normalize=True).plot.bar()
#Informative, keeper

Out[30]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff4a2388>

In [31]:

data['Roof Style'].value_counts(normalize=True).plot.bar()
#Drop

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff50e088>

In [32]:

data['Roof Matl'].value_counts(normalize=True).plot.bar()
#Drop

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff58dc48>

In [33]:

data['Exterior 1st'].value_counts(normalize=True).plot.bar()
#Drop

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff60a888>

In [34]:

data['Exterior 2nd'].value_counts(normalize=True).plot.bar()
#Drop

Out[34]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff6be288>

In [35]:

data['Exter Qual'].value_counts(normalize=True).plot.bar()
#Keep

Out[35]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff75ae88>

In [36]:

data['Exter Cond'].value_counts(normalize=True).plot.bar()
#Drop

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff7d69c8>

In [37]:

data['Foundation'].value_counts(normalize=True).plot.bar()
#Keep

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff82fa48>

In [38]:

data['Heating'].value_counts(normalize=True).plot.bar()
#Drop

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff8a8c48>

In [39]:

data['Heating QC'].value_counts(normalize=True).plot.bar()
#Keep

Out[39]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff91d808>

In [40]:

data['Central Air'].value_counts(normalize=True).plot.bar()
#This is a valuable attribute Keep

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff96d808>

In [41]:

data['Kitchen Qual'].value_counts(normalize=True).plot.bar()
#Keep!

Out[41]:

<matplotlib.axes._subplots.AxesSubplot at 0x29eff9e1348>

In [42]:

data['Functional'].value_counts(normalize=True).plot.bar()
#Keep, for now, the information is telling a lot.

Out[42]:

<matplotlib.axes._subplots.AxesSubplot at 0x29effa560c8>

In [43]:

data['Paved Drive'].value_counts(normalize=True).plot.bar()
#Keep

Out[43]:

<matplotlib.axes._subplots.AxesSubplot at 0x29effac1d08>

In [44]:

#After this manual selection I've kept:
text_cols = ['Lot Shape', 'Land Contour',
       'Lot Config', 'Neighborhood', 'Bldg Type', 'House Style',
       'Exter Qual','Foundation', 'Heating QC', 'Central Air', 'Kitchen Qual',
       'Functional', 'Paved Drive']

Now we can standardize the numeric columns and transform the categorical columns to dummies columns.

In [45]:

#Now we have all the right data in three dataframes.
text_cols_dummies = pd.get_dummies(data[text_cols])
normalized_numeric = (data[final_numeric_cols]-data[final_numeric_cols].mean())/data[final_numeric_cols].std()
target = data['SalePrice']

In [46]:

#After concatenating everything will be one happy data family.
data_final = pd.concat([text_cols_dummies,normalized_numeric,target],axis=1)

Out[46]:

	Lot Shape_IR1	Lot Shape_IR2	Lot Shape_IR3	Lot Shape_Reg	Land Contour_Bnk	Land Contour_HLS	Land Contour_Low	Land Contour_Lvl	Lot Config_Corner	Lot Config_CulDSac	Lot Config_FR2	Lot Config_FR3	Lot Config_Inside	Neighborhood_Blmngtn	Neighborhood_Blueste	Neighborhood_BrDale	Neighborhood_BrkSide	Neighborhood_ClearCr	Neighborhood_CollgCr	Neighborhood_Crawfor	Neighborhood_Edwards	Neighborhood_Gilbert	Neighborhood_Greens	Neighborhood_GrnHill	Neighborhood_IDOTRR	Neighborhood_Landmrk	Neighborhood_MeadowV	Neighborhood_Mitchel	Neighborhood_NAmes	Neighborhood_NPkVill	Neighborhood_NWAmes	Neighborhood_NoRidge	Neighborhood_NridgHt	Neighborhood_OldTown	Neighborhood_SWISU	Neighborhood_Sawyer	Neighborhood_SawyerW	Neighborhood_Somerst	Neighborhood_StoneBr	Neighborhood_Timber	Neighborhood_Veenker	Bldg Type_1Fam	Bldg Type_2fmCon	Bldg Type_Duplex	Bldg Type_Twnhs	Bldg Type_TwnhsE	House Style_1.5Fin	House Style_1.5Unf	House Style_1Story	House Style_2.5Fin	House Style_2.5Unf	House Style_2Story	House Style_SFoyer	House Style_SLvl	Exter Qual_Ex	Exter Qual_Fa	Exter Qual_Gd	Exter Qual_TA	Foundation_BrkTil	Foundation_CBlock	Foundation_PConc	Foundation_Slab	Foundation_Stone	Foundation_Wood	Heating QC_Ex	Heating QC_Fa	Heating QC_Gd	Heating QC_Po	Heating QC_TA	Central Air_N	Central Air_Y	Kitchen Qual_Ex	Kitchen Qual_Fa	Kitchen Qual_Gd	Kitchen Qual_Po	Kitchen Qual_TA	Functional_Maj1	Functional_Maj2	Functional_Min1	Functional_Min2	Functional_Mod	Functional_Sal	Functional_Sev	Functional_Typ	Paved Drive_N	Paved Drive_P	Paved Drive_Y	Overall Qual	Year Built	Mas Vnr Area	1st Flr SF	Gr Liv Area	Full Bath	Garage Cars	Years Sold	SalePrice
0	1	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	1	0	1	0	-0.067242	-0.375473	0.061036	1.266787	0.309212	-1.024618	0.306537	0.447846	215000
1	0	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	-0.775946	-0.342410	-0.565943	-0.672528	-1.194223	-1.024618	-1.008476	0.414834	105000
2	1	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	1	-0.067242	-0.441599	0.038644	0.432371	-0.337661	-1.024618	-1.008476	0.513872	172000
3	0	0	0	1	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	0	0	0	0	1	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	1	0.641462	-0.110970	-0.565943	2.425273	1.207317	0.783894	0.306537	0.183744	244000
4	1	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	1	0	0	0	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	-0.775946	0.847855	-0.565943	-0.590873	0.255801	0.783894	0.306537	-0.773624	189900
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2925	1	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	1	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	-0.067242	0.418037	-0.565943	-0.399493	-0.982555	-1.024618	0.306537	-0.476510	142500
2926	1	0	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	-0.775946	0.384974	-0.565943	-0.657218	-1.182354	-1.024618	0.306537	-0.443497	131000
2927	0	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	1	0	0	0	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	-0.775946	0.682541	-0.565943	-0.483700	-1.047836	-1.024618	-2.323489	-0.740612	132000
2928	0	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	1	0	1	0	0	0	0	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	-0.775946	0.087408	-0.565943	0.585475	-0.218968	-1.024618	0.306537	-0.146383	170000
2929	0	0	0	1	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	1	0	0	1	0	0	0	1	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	1	0	0	1	0.641462	0.715604	-0.039729	-0.417355	0.989715	0.783894	1.621550	-0.773624	188000

2930 rows × 96 columns

Now we are able to update our data pipeline.

In [63]:

def transform_features(data):
    #These are all the lines of code we have created above but in a compact fashion.
    per_missing = data.isnull().sum()/len(data)
    per_missing = per_missing[per_missing<0.05].index
    data = data[per_missing]
    text_data_cols = data.select_dtypes(include= ['object']).isnull().sum()
    text_drop_cols = text_data_cols[text_data_cols>0]
    data = data.drop(text_drop_cols.index, axis=1)
    numeric_data = data.select_dtypes(include= ['integer','float'])
    cols = numeric_data.columns
    mode = data.filter(cols).mode()
    data[cols] = data[cols].fillna(mode.iloc[0])
    data.isnull().sum().value_counts()
    years_sold = data['Yr Sold'] - data['Year Built']
    year_until_remod = data['Year Remod/Add'] - data['Year Built']
    data['Years Sold'] = years_sold
    data['Year Until Remod'] = year_until_remod
    data = data.drop(['Mo Sold','Yr Sold','Sale Type','Sale Condition','PID','Order'], axis=1)
    return data

def select_features(data):
    final_numeric_cols = ['Overall Qual', 'Year Built', 'Mas Vnr Area', '1st Flr SF',
                      'Gr Liv Area', 'Full Bath', 'Garage Cars', 'Years Sold']
    text_cols = ['Lot Shape', 'Land Contour',
       'Lot Config', 'Neighborhood', 'Bldg Type', 'House Style',
       'Exter Qual','Foundation', 'Heating QC', 'Central Air', 'Kitchen Qual',
       'Functional', 'Paved Drive']
    text_cols_dummies = pd.get_dummies(data[text_cols])
    normalized_numeric = (data[final_numeric_cols]-data[final_numeric_cols].mean())/data[final_numeric_cols].std()
    target = data['SalePrice']
    data_final = pd.concat([text_cols_dummies,normalized_numeric,target],axis=1)
    
    return data_final

def train_test(df):
    train = df[:1460]
    test = df[1460:]
    numeric_train = train.select_dtypes(include = ['integer','float'])
    numeric_test = test.select_dtypes(include = ['integer','float'])
    features = numeric_train.columns.drop('SalePrice')
    model = LinearRegression()
    model.fit(train[features],train['SalePrice'])
    prediction = model.predict(test[features])
    mse = mean_squared_error(test['SalePrice'],prediction)
    rmse = mse**0.5
    
    return rmse

selection = select_features(data)
rmse = train_test(selection)
print('rmse:',rmse)

rmse: 407005942187829.56

Now that is a VERY poor performance. Let's be more strict on the categorical columns see which ones we should drop.

After some playing around and experimenting a lot, these are the values that have performed the best. It's quite a big decrease!

By the way I have even removed some of the numerical columns. They might come back from the dead afterwards, because this was only useful to tweak the model at this point far enough. It could be that when we are cross validating that we should welcome some columns back.

In [101]:

def select_features(data):
    final_numeric_cols = ['Overall Qual', 'Mas Vnr Area',
                      'Gr Liv Area', 'Full Bath', 'Garage Cars', 'Years Sold']
    text_cols = ['Lot Shape', 'Land Contour',
       'Lot Config', 'Neighborhood', 'House Style',
       'Exter Qual', 'Heating QC', 'Central Air', 'Kitchen Qual',
       'Functional']
    text_cols_dummies = pd.get_dummies(data[text_cols])
    normalized_numeric = (data[final_numeric_cols]-data[final_numeric_cols].mean())/data[final_numeric_cols].std()
    target = data['SalePrice']
    data_final = pd.concat([text_cols_dummies,normalized_numeric,target],axis=1)
    
    return data_final

def train_test(df):
    train = df[:1460]
    test = df[1460:]
    numeric_train = train.select_dtypes(include = ['integer','float'])
    numeric_test = test.select_dtypes(include = ['integer','float'])
    features = numeric_train.columns.drop('SalePrice')
    model = LinearRegression()
    model.fit(train[features],train['SalePrice'])
    prediction = model.predict(test[features])
    mse = mean_squared_error(test['SalePrice'],prediction)
    rmse = mse**0.5
    
    return rmse
    
selection = select_features(data)
rmse = train_test(selection)
print('rmse:',rmse)

rmse: 35271.21165732622

Cross-Validation¶

Up to now we have only been applying holdout validation. Here we will change the last part of the function in order to be able to apply several types of validation.

In [112]:

def transform_features(data):
    #These are all the lines of code we have created above but in a compact fashion.
    per_missing = data.isnull().sum()/len(data)
    per_missing = per_missing[per_missing<0.05].index
    data = data[per_missing]
    text_data_cols = data.select_dtypes(include= ['object']).isnull().sum()
    text_drop_cols = text_data_cols[text_data_cols>0]
    data = data.drop(text_drop_cols.index, axis=1)
    numeric_data = data.select_dtypes(include= ['integer','float'])
    cols = numeric_data.columns
    mode = data.filter(cols).mode()
    data[cols] = data[cols].fillna(mode.iloc[0])
    data.isnull().sum().value_counts()
    years_sold = data['Yr Sold'] - data['Year Built']
    year_until_remod = data['Year Remod/Add'] - data['Year Built']
    data['Years Sold'] = years_sold
    data['Year Until Remod'] = year_until_remod
    data = data.drop(['Mo Sold','Yr Sold','Sale Type','Sale Condition','PID','Order'], axis=1)
    return data

def select_features(data):
    final_numeric_cols = ['Overall Qual', 'Year Built', 'Mas Vnr Area', '1st Flr SF',
                      'Gr Liv Area', 'Full Bath', 'Garage Cars', 'Years Sold']
    text_cols = ['Lot Shape', 'Land Contour',
       'Lot Config', 'Neighborhood', 'Bldg Type', 'House Style',
       'Exter Qual','Foundation', 'Heating QC', 'Central Air', 'Kitchen Qual',
       'Functional', 'Paved Drive']
    text_cols_dummies = pd.get_dummies(data[text_cols])
    normalized_numeric = (data[final_numeric_cols]-data[final_numeric_cols].mean())/data[final_numeric_cols].std()
    target = data['SalePrice']
    data_final = pd.concat([text_cols_dummies,normalized_numeric,target],axis=1)
    
    return data_final

def train_test(df,k=0):
    model = LinearRegression()
    
    if k==0:
        train = df[:1460]
        test = df[1460:]
        numeric_train = train.select_dtypes(include = ['integer','float'])
        numeric_test = test.select_dtypes(include = ['integer','float'])
        features = numeric_train.columns.drop('SalePrice')
        model.fit(train[features],train['SalePrice'])
        prediction = model.predict(test[features])
        mse = mean_squared_error(test['SalePrice'],prediction)
        rmse = mse**0.5
        
        return rmse
    
    if k==1:
        #This is done to shuffle the dataframe. frac=1 is needed as an argument.
        #Here we are doing Holdout validation, which means the model will be tested on both halves.
        shuffled_df = df.sample(frac=1)
        
        train = shuffled_df[:1460]
        test = shuffled_df[1460:]
        
        train_no_price = train.drop('SalePrice', axis=1)
        test_no_price = test.drop('SalePrice',axis=1)
        
        model.fit(train_no_price,train['SalePrice'])
        prediction_one = model.predict(test_no_price)
        
        mse_one = mean_squared_error(test['SalePrice'],prediction_one)
        rmse_one = mse_one**0.5
        
        model.fit(test_no_price,test['SalePrice'])
        prediction_two = model.predict(train_no_price)
        
        mse_two = mean_squared_error(train['SalePrice'],prediction_two)
        rmse_two = mse_two**0.5

        print(rmse_one,rmse_two)
        avg_rmse = (rmse_one + rmse_two) / 2
        return avg_rmse
    
    else:
        kf = KFold(k,shuffle=True,random_state = 5)
        rmse_values = []
        for train_index, test_index in kf.split(df):
            train = df.loc[train_index]
            test = df.loc[test_index]
            
            train_no_price = train.drop('SalePrice',axis=1)
            test_no_price = test.drop('SalePrice',axis=1)
            
            model.fit(train_no_price,train['SalePrice'])
            prediction = model.predict(test_no_price)
            
            mse = mean_squared_error(test['SalePrice'],prediction)
            rmse = mse**0.5
            
            rmse_values.append(rmse)
        
        print(rmse_values)
        avg_rmse = sum(rmse_values)/len(rmse_values)
        return avg_rmse

transformation = transform_features(raw_data)
selection = select_features(transformation)
rmse = train_test(selection, 10)
print('rmse:',rmse)

[22091309167529.234, 26325.783488498048, 23962.404978954568, 36133.86801129016, 209124838710047.94, 25351.71847370442, 40736.19352619256, 29633.301767522484, 27069.687985545977, 39692.875356692304]
rmse: 23121614812648.3

In [ ]: