Predicting House Sale Prices¶

*In this project, we will predict a House's Sale Price using its attributes. The data set we will be working with contains housing data for the city of Ames, Iowa, United States from 2006 to 2010. For each house, we have information such as Lot Area, Garage Area, Year Built, and more.*

Download
Documentation

*We will be using Linear Regression algorithm to predict a House's price accurately.*

In [1]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import matplotlib.style as style
style.use('fivethirtyeight')

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
%matplotlib inline

pd.set_option('display.max_columns', 500)
data = pd.read_csv('AmesHousing.txt', delimiter='\t')


data.head()

Out[1]:

	Order	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	Utilities	Lot Config	Land Slope	Neighborhood	Condition 1	Condition 2	Bldg Type	House Style	Overall Qual	Overall Cond	Year Built	Year Remod/Add	Roof Style	Roof Matl	Exterior 1st	Exterior 2nd	Mas Vnr Type	Mas Vnr Area	Exter Qual	Exter Cond	Foundation	Bsmt Qual	Bsmt Cond	Bsmt Exposure	BsmtFin Type 1	BsmtFin SF 1	BsmtFin Type 2	BsmtFin SF 2	Bsmt Unf SF	Total Bsmt SF	Heating	Heating QC	Central Air	Electrical	1st Flr SF	2nd Flr SF	Gr Liv Area	Bsmt Full Bath	Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	Kitchen Qual	TotRms AbvGrd	Functional	Fireplaces	Fireplace Qu	Garage Type	Garage Yr Blt	Garage Finish	Garage Cars	Garage Area	Garage Qual	Garage Cond	Paved Drive	Wood Deck SF	Open Porch SF	Screen Porch	Pool QC	Fence	Misc Feature	Misc Val	Mo Sold	Yr Sold	Sale Type	Sale Condition	SalePrice
0	1	526301100	20	RL	141.0	31770	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	5	1960	1960	Hip	CompShg	BrkFace	Plywood	Stone	112.0	TA	TA	CBlock	TA	Gd	Gd	BLQ	639.0	Unf	0.0	441.0	1080.0	GasA	Fa	Y	SBrkr	1656	0	1656	1.0	1	0	3	1	TA	7	Typ	2	Gd	Attchd	1960.0	Fin	2.0	528.0	TA	TA	P	210	62	0	NaN	NaN	NaN	0	5	2010	WD	Normal	215000
1	2	526350040	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Feedr	Norm	1Fam	1Story	5	6	1961	1961	Gable	CompShg	VinylSd	VinylSd	None	0.0	TA	TA	CBlock	TA	TA	No	Rec	468.0	LwQ	144.0	270.0	882.0	GasA	TA	Y	SBrkr	896	0	896	0.0	1	0	2	1	TA	5	Typ	0	NaN	Attchd	1961.0	Unf	1.0	730.0	TA	TA	Y	140	0	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal	105000
2	3	526351010	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	6	1958	1958	Hip	CompShg	Wd Sdng	Wd Sdng	BrkFace	108.0	TA	TA	CBlock	TA	TA	No	ALQ	923.0	Unf	0.0	406.0	1329.0	GasA	TA	Y	SBrkr	1329	0	1329	0.0	1	1	3	1	Gd	6	Typ	0	NaN	Attchd	1958.0	Unf	1.0	312.0	TA	TA	Y	393	36	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal	172000
3	4	526353030	20	RL	93.0	11160	Pave	NaN	Reg	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	7	5	1968	1968	Hip	CompShg	BrkFace	BrkFace	None	0.0	Gd	TA	CBlock	TA	TA	No	ALQ	1065.0	Unf	0.0	1045.0	2110.0	GasA	Ex	Y	SBrkr	2110	0	2110	1.0	2	1	3	1	Ex	8	Typ	2	TA	Attchd	1968.0	Fin	2.0	522.0	TA	TA	Y	0	0	0	NaN	NaN	NaN	0	4	2010	WD	Normal	244000
4	5	527105010	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	5	5	1997	1998	Gable	CompShg	VinylSd	VinylSd	None	0.0	TA	TA	PConc	Gd	TA	No	GLQ	791.0	Unf	0.0	137.0	928.0	GasA	Gd	Y	SBrkr	928	701	1629	0.0	2	1	3	1	TA	6	Typ	1	TA	Attchd	1997.0	Fin	2.0	482.0	TA	TA	Y	212	34	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal	189900

Method to quickly get info related to a dataframe

In [2]:

def data_shape(data):
    print('Shape')
    print(data.shape)
    print(len(str(data.shape)) * "-")
    
    print('\nData Types')
    print(data.dtypes.value_counts())
    print(len(str(data.shape)) * "-")
    
    print('\nColumns : Null Counts')
    null_counts = data.isnull().sum()
    print(null_counts[null_counts > 0])
    

data_shape(data)

Shape
(2930, 82)
----------

Data Types
object     43
int64      28
float64    11
dtype: int64
----------

Columns : Null Counts
Lot Frontage       490
Alley             2732
Mas Vnr Type        23
Mas Vnr Area        23
Bsmt Qual           80
Bsmt Cond           80
Bsmt Exposure       83
BsmtFin Type 1      80
BsmtFin SF 1         1
BsmtFin Type 2      81
BsmtFin SF 2         1
Bsmt Unf SF          1
Total Bsmt SF        1
Electrical           1
Bsmt Full Bath       2
Bsmt Half Bath       2
Fireplace Qu      1422
Garage Type        157
Garage Yr Blt      159
Garage Finish      159
Garage Cars          1
Garage Area          1
Garage Qual        159
Garage Cond        159
Pool QC           2917
Fence             2358
Misc Feature      2824
dtype: int64

Transforming Features¶

Creating more useful columns from existing Columns
Dropping columns which are not useful or leak information about the sale
Dropping Columns which have more then 5% missing Values

In [3]:

# Creating more useful columns from existing Columns
df = data.copy()
df['Years_since_remodelled'] = df['Yr Sold'] - df['Year Remod/Add']
df['Years_since_sold'] = df['Yr Sold'] - df['Year Built']
df = df[(df['Years_since_remodelled'] >= 0)]
df = df[(df['Years_since_sold'] >= 0)]

# Dropping columns which are not useful or leak information about the sale
df.drop(['Sale Condition', 'Sale Type', 'Mo Sold', 'Yr Sold', 'PID', 'Order', 
           'Year Remod/Add', 'Year Built'], axis=1, inplace=True)

# Dropping Columns which have more then 5% missing Values
null_counts = df.isnull().sum()
df.drop(null_counts[null_counts > df.shape[0]*0.05].index, axis=1, inplace=True)

data_shape(df)
df.head()

Shape
(2927, 65)
----------

Data Types
object     32
int64      24
float64     9
dtype: int64
----------

Columns : Null Counts
Mas Vnr Type      23
Mas Vnr Area      23
Bsmt Qual         80
Bsmt Cond         80
Bsmt Exposure     83
BsmtFin Type 1    80
BsmtFin SF 1       1
BsmtFin Type 2    81
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Electrical         1
Bsmt Full Bath     2
Bsmt Half Bath     2
Garage Cars        1
Garage Area        1
dtype: int64

Out[3]:

	MS SubClass	MS Zoning	Lot Area	Street	Lot Shape	Land Contour	Utilities	Lot Config	Land Slope	Neighborhood	Condition 1	Condition 2	Bldg Type	House Style	Overall Qual	Overall Cond	Roof Style	Roof Matl	Exterior 1st	Exterior 2nd	Mas Vnr Type	Mas Vnr Area	Exter Qual	Exter Cond	Foundation	Bsmt Qual	Bsmt Cond	Bsmt Exposure	BsmtFin Type 1	BsmtFin SF 1	BsmtFin Type 2	BsmtFin SF 2	Bsmt Unf SF	Total Bsmt SF	Heating	Heating QC	Central Air	Electrical	1st Flr SF	2nd Flr SF	Gr Liv Area	Bsmt Full Bath	Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	Kitchen Qual	TotRms AbvGrd	Functional	Fireplaces	Garage Cars	Garage Area	Paved Drive	Wood Deck SF	Open Porch SF	Screen Porch	Misc Val	SalePrice	Years_since_remodelled	Years_since_sold
0	20	RL	31770	Pave	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	5	Hip	CompShg	BrkFace	Plywood	Stone	112.0	TA	TA	CBlock	TA	Gd	Gd	BLQ	639.0	Unf	0.0	441.0	1080.0	GasA	Fa	Y	SBrkr	1656	0	1656	1.0	1	0	3	1	TA	7	Typ	2	2.0	528.0	P	210	62	0	0	215000	50	50
1	20	RH	11622	Pave	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Feedr	Norm	1Fam	1Story	5	6	Gable	CompShg	VinylSd	VinylSd	None	0.0	TA	TA	CBlock	TA	TA	No	Rec	468.0	LwQ	144.0	270.0	882.0	GasA	TA	Y	SBrkr	896	0	896	0.0	1	0	2	1	TA	5	Typ	0	1.0	730.0	Y	140	0	120	0	105000	49	49
2	20	RL	14267	Pave	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	6	Hip	CompShg	Wd Sdng	Wd Sdng	BrkFace	108.0	TA	TA	CBlock	TA	TA	No	ALQ	923.0	Unf	0.0	406.0	1329.0	GasA	TA	Y	SBrkr	1329	0	1329	0.0	1	1	3	1	Gd	6	Typ	0	1.0	312.0	Y	393	36	0	12500	172000	52	52
3	20	RL	11160	Pave	Reg	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	7	5	Hip	CompShg	BrkFace	BrkFace	None	0.0	Gd	TA	CBlock	TA	TA	No	ALQ	1065.0	Unf	0.0	1045.0	2110.0	GasA	Ex	Y	SBrkr	2110	0	2110	1.0	2	1	3	1	Ex	8	Typ	2	2.0	522.0	Y	0	0	0	0	244000	42	42
4	60	RL	13830	Pave	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	5	5	Gable	CompShg	VinylSd	VinylSd	None	0.0	TA	TA	PConc	Gd	TA	No	GLQ	791.0	Unf	0.0	137.0	928.0	GasA	Gd	Y	SBrkr	928	701	1629	0.0	2	1	3	1	TA	6	Typ	1	2.0	482.0	Y	212	34	0	0	189900	12	13

Feature Engineering¶

Imputing missing values with mode for numerical columns
Finding Highly correlated columns to target column
Dropping numerical columns with high collinearity
Dropping Text Columns with missing Values
Converting nominal columns into dummy variables

Numerical columns

In [4]:

numerical_cols = df[df.dtypes[df.dtypes != 'object'].index].copy()
data_shape(numerical_cols)

Shape
(2927, 33)
----------

Data Types
int64      24
float64     9
dtype: int64
----------

Columns : Null Counts
Mas Vnr Area      23
BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Bsmt Full Bath     2
Bsmt Half Bath     2
Garage Cars        1
Garage Area        1
dtype: int64

Imputing missing values with mode

In [5]:

numerical_cols.fillna(numerical_cols.mode().iloc[0], inplace=True)
data_shape(numerical_cols)

Shape
(2927, 33)
----------

Data Types
int64      24
float64     9
dtype: int64
----------

Columns : Null Counts
Series([], dtype: int64)

Finding Highly correlated columns to target Column
Dropping columns with high collinearilty

In [6]:

correlation_matrix = numerical_cols.corr().abs()
sorted_corrs = correlation_matrix.SalePrice.sort_values()
strong_corrs = sorted_corrs[sorted_corrs > 0.3]

print(len(strong_corrs))
print(strong_corrs)

sns.heatmap(numerical_cols[strong_corrs.index].corr())
plt.show()

16
Open Porch SF             0.316262
Wood Deck SF              0.328183
BsmtFin SF 1              0.439284
Fireplaces                0.474831
TotRms AbvGrd             0.498574
Mas Vnr Area              0.506983
Years_since_remodelled    0.534985
Full Bath                 0.546118
Years_since_sold          0.558979
1st Flr SF                0.635185
Garage Area               0.641425
Total Bsmt SF             0.644012
Garage Cars               0.648361
Gr Liv Area               0.717596
Overall Qual              0.801206
SalePrice                 1.000000
Name: SalePrice, dtype: float64

In [7]:

strong_corrs.drop(['Garage Cars', 'Total Bsmt SF', 'TotRms AbvGrd'], inplace=True)
high_corr_num_cols = numerical_cols[strong_corrs.index]

data_shape(high_corr_num_cols)
high_corr_num_cols.head()

Shape
(2927, 13)
----------

Data Types
int64      10
float64     3
dtype: int64
----------

Columns : Null Counts
Series([], dtype: int64)

Out[7]:

	Open Porch SF	Wood Deck SF	BsmtFin SF 1	Fireplaces	Mas Vnr Area	Years_since_remodelled	Full Bath	Years_since_sold	1st Flr SF	Garage Area	Gr Liv Area	Overall Qual	SalePrice
0	62	210	639.0	2	112.0	50	1	50	1656	528.0	1656	6	215000
1	0	140	468.0	0	0.0	49	1	49	896	730.0	896	5	105000
2	36	393	923.0	0	108.0	52	1	52	1329	312.0	1329	6	172000
3	0	0	1065.0	2	0.0	42	2	42	2110	522.0	2110	7	244000
4	34	212	791.0	1	0.0	12	2	13	928	482.0	1629	5	189900

Text Columns

Dropping Columns with missing values

In [8]:

text_cols = df[df.dtypes[df.dtypes == 'object'].index].copy()

# Dropping Text Columns with missing values
null_cols = text_cols.isnull().sum()
text_cols.drop(null_cols[null_cols > 0].index, axis=1, inplace=True)
data_shape(text_cols)

Shape
(2927, 25)
----------

Data Types
object    25
dtype: int64
----------

Columns : Null Counts
Series([], dtype: int64)

Converting nominal columns to dummy variables

In [9]:

## Nominal columns from the documentation
nominal_cols = ['MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 
                'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual', 'Roof Style', 'Roof Mat1', 
                'Exterior 1st',  'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air'] 

nominal_num_col = ['MS SubClass']

## Categorical Text Data
common_columns = [x for x in text_cols if x in nominal_cols]
categories_per_col = {x:len(text_cols[x].value_counts()) for x in common_columns}
col_less_then_thresh = [x for x in categories_per_col if categories_per_col[x] <= 10]

text_cols = text_cols[col_less_then_thresh].astype('category') 
categorical_text_data = pd.get_dummies(text_cols)

## Categorical Numerical Data
common_columns = [x for x in numerical_cols if x in nominal_num_col]
numerical_cols[common_columns] = numerical_cols[common_columns].astype('category')      
categorical_num_data = pd.get_dummies(numerical_cols.select_dtypes(include=['category']))

Concatenating all the dataframes together

In [10]:

final_data = pd.concat([high_corr_num_cols, categorical_text_data, categorical_num_data], axis=1)
data_shape(final_data)
final_data.head()

Shape
(2927, 97)
----------

Data Types
uint8      84
int64      10
float64     3
dtype: int64
----------

Columns : Null Counts
Series([], dtype: int64)

Out[10]:

	Open Porch SF	Wood Deck SF	BsmtFin SF 1	Fireplaces	Mas Vnr Area	Years_since_remodelled	Full Bath	Years_since_sold	1st Flr SF	Garage Area	Gr Liv Area	Overall Qual	SalePrice	MS Zoning_RH	MS Zoning_RL	Street_Pave	Land Contour_Lvl	Lot Config_Corner	Lot Config_Inside	Condition 1_Feedr	Condition 1_Norm	Condition 2_Norm	Bldg Type_1Fam	House Style_1Story	House Style_2Story	Roof Style_Gable	Roof Style_Hip	Foundation_CBlock	Foundation_PConc	Heating_GasA	Central Air_Y	MS SubClass_20	MS SubClass_60
0	62	210	639.0	2	112.0	50	1	50	1656	528.0	1656	6	215000	0	1	1	1	1	0	0	1	1	1	1	0	0	1	1	0	1	1	1	0
1	0	140	468.0	0	0.0	49	1	49	896	730.0	896	5	105000	1	0	1	1	0	1	1	0	1	1	1	0	1	0	1	0	1	1	1	0
2	36	393	923.0	0	108.0	52	1	52	1329	312.0	1329	6	172000	0	1	1	1	1	0	0	1	1	1	1	0	0	1	1	0	1	1	1	0
3	0	0	1065.0	2	0.0	42	2	42	2110	522.0	2110	7	244000	0	1	1	1	1	0	0	1	1	1	1	0	0	1	1	0	1	1	1	0
4	34	212	791.0	1	0.0	12	2	13	928	482.0	1629	5	189900	0	1	1	1	0	1	0	1	1	1	0	1	1	0	0	1	1	1	0	1

Creating a Pipeline with adjustable parameters¶

When we did our data cleaning, we decided to remove columns that had more than 5% missing values. We can incorporate our this into a function as an adjustable parameter. In addition, this function will perform all the data cleaning operations done above.

In [11]:

def transform_features(df, percent_missing=0.05):
    
    # Adding relevant features
    df['Years_since_remodelled'] = df['Yr Sold'] - df['Year Remod/Add']
    df['Years_since_sold'] = df['Yr Sold'] - df['Year Built']
    df = df[(df['Years_since_remodelled'] >= 0)]
    df = df[(df['Years_since_sold'] >= 0)]

    # Dropping columns which are not useful or leak information about the sale
    df.drop(['Sale Condition', 'Sale Type', 'Mo Sold', 'Yr Sold', 'PID', 'Order', 
               'Year Remod/Add', 'Year Built'], axis=1, inplace=True)

    # Dropping Columns which have more then given percentage missing Values
    null_counts = df.isnull().sum()
    df.drop(null_counts[null_counts > df.shape[0]*percent_missing].index, axis=1, inplace=True)
    
    return df

For the feature engineering, we chose columns that had more than 0.4 correlation with 'SalePrice' and removed any columns with more than 10 categories.

Once again, we will combine all the work done previously into a function with adjustable parameters.

In [12]:

def select_features(df, correlation_threshold = 0.4, cat_threshold = 10):
    ## Imputing missing values with mode
    numerical_cols = df[df.dtypes[df.dtypes != 'object'].index].copy()
    numerical_cols.fillna(numerical_cols.mode().iloc[0], inplace=True)
    
    ## Dropping columns less then given correlation threshold
    correlation_matrix = numerical_cols.corr().abs()
    sorted_corrs = correlation_matrix.SalePrice.sort_values()
    strong_corrs = sorted_corrs[sorted_corrs > correlation_threshold]
    
    ## Comment below line if correlation threshold > 0.4
#     strong_corrs.drop(['Garage Cars', 'Total Bsmt SF', 'TotRms AbvGrd'], inplace=True)
    high_corr_num_cols = numerical_cols[strong_corrs.index]
    
    
    ## Text Columns
    text_cols = df[df.dtypes[df.dtypes == 'object'].index].copy()
    null_cols = text_cols.isnull().sum()
    text_cols.drop(null_cols[null_cols > 0].index, axis=1, inplace=True)
    
    #Nominal columns from the documentation
    nominal_cols = ['MS Zoning', 'Street', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 
                    'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual', 'Roof Style', 'Roof Mat1', 
                    'Exterior 1st',  'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Central Air'] 
    
    nominal_num_col = ['MS SubClass']
    
    ## Categorical Text Data
    common_columns = [x for x in text_cols if x in nominal_cols]
    categories_per_col = {x:len(text_cols[x].value_counts()) for x in common_columns}
    col_less_then_thresh = [x for x in categories_per_col if categories_per_col[x] <= cat_threshold]
    
    text_cols = text_cols[col_less_then_thresh].astype('category') 
    categorical_text_data = pd.get_dummies(text_cols)
    
    # Categorical Numerical Data
    common_columns = [x for x in numerical_cols if x in nominal_num_col]
    numerical_cols[common_columns] = numerical_cols[common_columns].astype('category')      
    categorical_num_data = pd.get_dummies(numerical_cols.select_dtypes(include=['category']))
    
    ## Concatenating all the dataframes together
    final_data = pd.concat([high_corr_num_cols, categorical_text_data, categorical_num_data], axis=1)
    return final_data

Applying Linear Regression¶

Now we are ready to apply machine learning, we'll use the linear regression model from scikit-learn. Linear regression should work well here since our target column 'SalePrice' is a continuous value. We'll evaluate this model with RMSE as an error metric.

In [13]:

def train_and_test(data):
    train = data[0:1460]
    test = data[1460:]
    features = data.columns.drop(['SalePrice'])
    
    #train
    lr = LinearRegression()
    lr.fit(train[features], train['SalePrice'])
    #predict
    
    predictions = lr.predict(test[features])
    rmse = np.sqrt(mean_squared_error(test['SalePrice'], predictions))
    return rmse

data = pd.read_csv('AmesHousing.txt', delimiter='\t')

transformed_data = transform_features(data)
final_data = select_features(transformed_data)
result = train_and_test(final_data)

print(result)

final_data.head()

34917.70149125682

Out[13]:

	BsmtFin SF 1	Fireplaces	TotRms AbvGrd	Mas Vnr Area	Years_since_remodelled	Full Bath	Years_since_sold	1st Flr SF	Garage Area	Total Bsmt SF	Garage Cars	Gr Liv Area	Overall Qual	SalePrice	MS Zoning_RH	MS Zoning_RL	Street_Pave	Land Contour_Lvl	Lot Config_Corner	Lot Config_Inside	Condition 1_Feedr	Condition 1_Norm	Condition 2_Norm	Bldg Type_1Fam	House Style_1Story	House Style_2Story	Roof Style_Gable	Roof Style_Hip	Foundation_CBlock	Foundation_PConc	Heating_GasA	Central Air_Y	MS SubClass_20	MS SubClass_60
0	639.0	2	7	112.0	50	1	50	1656	528.0	1080.0	2.0	1656	6	215000	0	1	1	1	1	0	0	1	1	1	1	0	0	1	1	0	1	1	1	0
1	468.0	0	5	0.0	49	1	49	896	730.0	882.0	1.0	896	5	105000	1	0	1	1	0	1	1	0	1	1	1	0	1	0	1	0	1	1	1	0
2	923.0	0	6	108.0	52	1	52	1329	312.0	1329.0	1.0	1329	6	172000	0	1	1	1	1	0	0	1	1	1	1	0	0	1	1	0	1	1	1	0
3	1065.0	2	8	0.0	42	2	42	2110	522.0	2110.0	2.0	2110	7	244000	0	1	1	1	1	0	0	1	1	1	1	0	0	1	1	0	1	1	1	0
4	791.0	1	6	0.0	12	2	13	928	482.0	928.0	2.0	1629	5	189900	0	1	1	1	0	1	0	1	1	1	0	1	1	0	0	1	1	1	0	1

Cross Validation¶

We've selected the first 1460 rows as the training set, and the remaining data as the testing set. This is not really a good way to evaluate a model's performance because the error will change as soon as we shuffle the data.

We can use KFold cross validation to split the data in K number of folds. Using the KFold function from scikit learn, we can get the indices for the testing and training sets.

In [14]:

from sklearn.model_selection import KFold

def train_and_test2(data, k=2):  
    rf = LinearRegression()
    if k == 0:
        train = data[0:1460]
        test = data[1460:]
        features = data.columns.drop(['SalePrice'])
    
        #train
        rf.fit(train[features], train['SalePrice'])
        
        #predict    
        predictions = rf.predict(test[features])
        rmse = mean_squared_error(test['SalePrice'], predictions)**0.5
        return rmse
    
    elif k == 1:
        train = data[:1460]
        test = data[1460:]
        features = data.columns.drop(['SalePrice'])
        
        rf.fit(train[features], train["SalePrice"])
        predictions_one = rf.predict(test[features])        
        
        mse_one = mean_squared_error(test["SalePrice"], predictions_one)
        rmse_one = np.sqrt(mse_one)
        
        rf.fit(test[features], test["SalePrice"])
        predictions_two = rf.predict(train[features])        
       
        mse_two = mean_squared_error(train["SalePrice"], predictions_two)
        rmse_two = np.sqrt(mse_two)
        return np.mean([rmse_one, rmse_two])   
    
    else:
        kf = KFold(n_splits=k, shuffle=True, random_state = 2)
        rmse_list = []
        for train_index, test_index in kf.split(data):
            train = data.iloc[train_index]
            test = data.iloc[test_index]
            features = data.columns.drop(['SalePrice'])
    
            #train
            rf.fit(train[features], train['SalePrice'])
        
            #predict    
            predictions = rf.predict(test[features])
        
            rmse = mean_squared_error(test['SalePrice'], predictions)**0.5
            rmse_list.append(rmse)
        return np.mean(rmse_list)

In [17]:

data = pd.read_csv("AmesHousing.txt", delimiter='\t')

transformed_data = transform_features(data, percent_missing=0.05)
final_data = select_features(transformed_data, 0.4, 10)

results = []
for i in range(100):
    result = train_and_test2(final_data, k=i)
    results.append(result)
    
x = [i for i in range(100)]
y = results 
plt.plot(x, y)
plt.xlabel('Kfolds')
plt.ylabel('RMSE')
plt.title('RMSE vs KFolds')
plt.savefig('rmse.png')

min_rmse = min(results)

print(results.index(min_rmse), min_rmse)

99 29215.144646883306

Our error is actually the lowest, when k = 99. We can see the average RMSE goes down as we increase the number of folds. This makes sense as the RMSE shown on the graph above is an average of the cross validation tests. A larger K means we have less bias towards overestimating the model's true error. As a trade off, this requires a lot more computation time.

*That is it for now though, the goal of this project is to explore the fundamentals of Linear Regression.*