Predicting House Sale Price¶

We'll work with housing data for the city of Ames, Iowa, United States from 2006 to 2010. You can read more about why the data was collected here. You can also read about the different columns in the data here.

Our task is to predict the house sale price by using Linear Regression Model.

Importing Dependencies¶

In [106]:

# Import all the libraries and modules necessary for the project.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,accuracy_score
from sklearn import preprocessing
from sklearn.model_selection import train_test_split,KFold

Exploring Data Set¶

In [107]:

# creating pandas DataFrame.
house_df = pd.read_csv("AmesHousing.tsv",delimiter = '\t')

In [108]:

house_df.head(3)

Out[108]:

	Order	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	...	Pool QC	Fence	Misc Feature	Misc Val	Mo Sold	Yr Sold	Sale Type	Sale Condition	SalePrice
0	1	526301100	20	RL	141.0	31770	Pave	NaN	IR1	Lvl	...	NaN	NaN	NaN	0	5	2010	WD	Normal	215000
1	2	526350040	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	...	NaN	MnPrv	NaN	0	6	2010	WD	Normal	105000
2	3	526351010	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	...	NaN	NaN	Gar2	12500	6	2010	WD	Normal	172000

3 rows × 82 columns

In [109]:

house_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
Order              2930 non-null int64
PID                2930 non-null int64
MS SubClass        2930 non-null int64
MS Zoning          2930 non-null object
Lot Frontage       2440 non-null float64
Lot Area           2930 non-null int64
Street             2930 non-null object
Alley              198 non-null object
Lot Shape          2930 non-null object
Land Contour       2930 non-null object
Utilities          2930 non-null object
Lot Config         2930 non-null object
Land Slope         2930 non-null object
Neighborhood       2930 non-null object
Condition 1        2930 non-null object
Condition 2        2930 non-null object
Bldg Type          2930 non-null object
House Style        2930 non-null object
Overall Qual       2930 non-null int64
Overall Cond       2930 non-null int64
Year Built         2930 non-null int64
Year Remod/Add     2930 non-null int64
Roof Style         2930 non-null object
Roof Matl          2930 non-null object
Exterior 1st       2930 non-null object
Exterior 2nd       2930 non-null object
Mas Vnr Type       2907 non-null object
Mas Vnr Area       2907 non-null float64
Exter Qual         2930 non-null object
Exter Cond         2930 non-null object
Foundation         2930 non-null object
Bsmt Qual          2850 non-null object
Bsmt Cond          2850 non-null object
Bsmt Exposure      2847 non-null object
BsmtFin Type 1     2850 non-null object
BsmtFin SF 1       2929 non-null float64
BsmtFin Type 2     2849 non-null object
BsmtFin SF 2       2929 non-null float64
Bsmt Unf SF        2929 non-null float64
Total Bsmt SF      2929 non-null float64
Heating            2930 non-null object
Heating QC         2930 non-null object
Central Air        2930 non-null object
Electrical         2929 non-null object
1st Flr SF         2930 non-null int64
2nd Flr SF         2930 non-null int64
Low Qual Fin SF    2930 non-null int64
Gr Liv Area        2930 non-null int64
Bsmt Full Bath     2928 non-null float64
Bsmt Half Bath     2928 non-null float64
Full Bath          2930 non-null int64
Half Bath          2930 non-null int64
Bedroom AbvGr      2930 non-null int64
Kitchen AbvGr      2930 non-null int64
Kitchen Qual       2930 non-null object
TotRms AbvGrd      2930 non-null int64
Functional         2930 non-null object
Fireplaces         2930 non-null int64
Fireplace Qu       1508 non-null object
Garage Type        2773 non-null object
Garage Yr Blt      2771 non-null float64
Garage Finish      2771 non-null object
Garage Cars        2929 non-null float64
Garage Area        2929 non-null float64
Garage Qual        2771 non-null object
Garage Cond        2771 non-null object
Paved Drive        2930 non-null object
Wood Deck SF       2930 non-null int64
Open Porch SF      2930 non-null int64
Enclosed Porch     2930 non-null int64
3Ssn Porch         2930 non-null int64
Screen Porch       2930 non-null int64
Pool Area          2930 non-null int64
Pool QC            13 non-null object
Fence              572 non-null object
Misc Feature       106 non-null object
Misc Val           2930 non-null int64
Mo Sold            2930 non-null int64
Yr Sold            2930 non-null int64
Sale Type          2930 non-null object
Sale Condition     2930 non-null object
SalePrice          2930 non-null int64
dtypes: float64(11), int64(28), object(43)
memory usage: 1.8+ MB

Creating functions to transorm, train & test data¶

In [110]:

# transforming the features for training and testing.

def transform_features():
    train_df = house_df.iloc[:1460]
    test_df = house_df.iloc[1460:]
    return train_df,test_df

def select_features():
    
    
    return(['Gr Liv Area','Saleprice'])

def train_and_test():
    
    train,test = transform_features()
    features = select_features()
    features = features.remove('SalePrice')
    
    lr = LinearRegression()
    lr.fit(train[[features]],train['SalePrice'])
    predicted = lr.predict(test[[features]])
    rmse = mean_squared_error(test['SalePrice'],predicted)
    
    return rmse

Feature Engineering¶

Let's now start removing features with many missing values, diving deeper into potential categorical features, and transforming text and numerical columns.

1.) Remove column frame with more than 25% missing data.

2.) Imputing numerical column with most famous value with less than 5 % missing data.

3.) Imputing and dropping remaing numericals columns based on its significance.

4.) Creating New Column Like Year Built

5.) Dealing with the 'objective' column having missing values.

Removing Columns with more than 25% missing data.¶

In [111]:

prop = house_df.isnull().sum()/house_df.shape[0]
x = prop[prop >= .25]
x

Out[111]:

Alley           0.932423
Fireplace Qu    0.485324
Pool QC         0.995563
Fence           0.804778
Misc Feature    0.963823
dtype: float64

Above columns have more than 25% missing data. So, we will drop this column now and llater we will update this in transform_feature() function also.

In [112]:

cols = list(x.index)
cols

Out[112]:

['Alley', 'Fireplace Qu', 'Pool QC', 'Fence', 'Misc Feature']

In [113]:

house_df.drop(columns = cols,axis = 1,inplace = True)
house_df.head()

Out[113]:

	Order	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Lot Shape	Land Contour	Utilities	...	Screen Porch	Misc Val	Mo Sold	Yr Sold	Sale Type	Sale Condition	SalePrice
0	1	526301100	20	RL	141.0	31770	Pave	IR1	Lvl	AllPub	...	0	0	5	2010	WD	Normal	215000
1	2	526350040	20	RH	80.0	11622	Pave	Reg	Lvl	AllPub	...	120	0	6	2010	WD	Normal	105000
2	3	526351010	20	RL	81.0	14267	Pave	IR1	Lvl	AllPub	...	0	12500	6	2010	WD	Normal	172000
3	4	526353030	20	RL	93.0	11160	Pave	Reg	Lvl	AllPub	...	0	0	4	2010	WD	Normal	244000
4	5	527105010	60	RL	74.0	13830	Pave	IR1	Lvl	AllPub	...	0	0	3	2010	WD	Normal	189900

5 rows × 77 columns

In [114]:

prop = house_df.isnull().sum()/house_df.shape[0]
x = prop[prop >= .25]
x

Out[114]:

Series([], dtype: float64)

Imputing numerical column with most famous value with less than 5 % missing data¶

In [115]:

x = prop[prop <= 0.05]

In [116]:

numeric_df = house_df[list(x.index)].select_dtypes(include = 'number')

In [117]:

a = numeric_df.isnull().sum()
x = a[a!= 0]
x

Out[117]:

Mas Vnr Area      23
BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Bsmt Full Bath     2
Bsmt Half Bath     2
Garage Cars        1
Garage Area        1
dtype: int64

In [118]:

mode_dict = {}
for each in list(x.index):
    mode_dict[each] = house_df[each].mode()[0]

In [119]:

house_df = house_df.fillna(mode_dict)

In [120]:

house_df.select_dtypes(include = 'number').isnull().sum()

Out[120]:

Order                0
PID                  0
MS SubClass          0
Lot Frontage       490
Lot Area             0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Mas Vnr Area         0
BsmtFin SF 1         0
BsmtFin SF 2         0
Bsmt Unf SF          0
Total Bsmt SF        0
1st Flr SF           0
2nd Flr SF           0
Low Qual Fin SF      0
Gr Liv Area          0
Bsmt Full Bath       0
Bsmt Half Bath       0
Full Bath            0
Half Bath            0
Bedroom AbvGr        0
Kitchen AbvGr        0
TotRms AbvGrd        0
Fireplaces           0
Garage Yr Blt      159
Garage Cars          0
Garage Area          0
Wood Deck SF         0
Open Porch SF        0
Enclosed Porch       0
3Ssn Porch           0
Screen Porch         0
Pool Area            0
Misc Val             0
Mo Sold              0
Yr Sold              0
SalePrice            0
dtype: int64

Since,only two numerical data have missing values.The year on which garage built is an irrelevant column but the Lot Frontage is a relevant column. So, we will drop Garage Yr Blt column and fill mean value in Lot Frontage column.

Imputing and dropping remaing numericals columns based on its significance.¶

In [121]:

house_df = house_df.drop('Garage Yr Blt',axis = 1)

In [122]:

house_df = house_df.fillna({'Lot Frontage': house_df['Lot Frontage'].mean()})

In [123]:

house_df.select_dtypes(include = 'number').notnull().sum()

Out[123]:

Order              2930
PID                2930
MS SubClass        2930
Lot Frontage       2930
Lot Area           2930
Overall Qual       2930
Overall Cond       2930
Year Built         2930
Year Remod/Add     2930
Mas Vnr Area       2930
BsmtFin SF 1       2930
BsmtFin SF 2       2930
Bsmt Unf SF        2930
Total Bsmt SF      2930
1st Flr SF         2930
2nd Flr SF         2930
Low Qual Fin SF    2930
Gr Liv Area        2930
Bsmt Full Bath     2930
Bsmt Half Bath     2930
Full Bath          2930
Half Bath          2930
Bedroom AbvGr      2930
Kitchen AbvGr      2930
TotRms AbvGrd      2930
Fireplaces         2930
Garage Cars        2930
Garage Area        2930
Wood Deck SF       2930
Open Porch SF      2930
Enclosed Porch     2930
3Ssn Porch         2930
Screen Porch       2930
Pool Area          2930
Misc Val           2930
Mo Sold            2930
Yr Sold            2930
SalePrice          2930
dtype: int64

Creating New Column¶

In [124]:

house_df['years_until_remod'] = house_df['Year Built'] - house_df['Year Remod/Add']

In [125]:

house_df['house_age'] = house_df['Year Built'] - house_df['Yr Sold']

In [126]:

house_df = house_df.drop(['Year Built','Year Remod/Add','Yr Sold'],axis = 1)

Dealing with the 'objective' column having missing values.¶

In [127]:

x = house_df.isnull().sum()
a = x[x>0]
a

Out[127]:

Mas Vnr Type       23
Bsmt Qual          80
Bsmt Cond          80
Bsmt Exposure      83
BsmtFin Type 1     80
BsmtFin Type 2     81
Electrical          1
Garage Type       157
Garage Finish     159
Garage Qual       159
Garage Cond       159
dtype: int64

In [128]:

house_df = house_df.dropna(axis = 0)

Since, all the object columns have missing values near to 5% so, it's best to drop all these rows.

In [129]:

x = house_df.isnull().sum()
x[x>0]

Out[129]:

Series([], dtype: int64)

Updating `transform_features` function¶

In [130]:

def transform_features(house_df):
    
    prop = house_df.isnull().sum()/house_df.shape[0]
    x = prop[prop >= .25]
    cols = list(x.index)
    house_df.drop(columns = cols,axis = 1,inplace = True)
    
    x = prop[prop <= 0.05]
    numeric_df = house_df[list(x.index)].select_dtypes(include = 'number')
    mode_dict = {}
    for each in list(x.index):
        mode_dict[each] = house_df[each].mode()[0]
    house_df = house_df.fillna(mode_dict)
    
    house_df.drop('Garage Yr Blt',axis = 1,inplace = True)
    house_df = house_df.fillna({'Lot Frontage': house_df['Lot Frontage'].mean()})
    
    house_df['years_until_remod'] = house_df['Year Built'] - house_df['Year Remod/Add']
    house_df['house_age'] = house_df['Year Built'] - house_df['Yr Sold']
    house_df = house_df.drop(['Year Built','Year Remod/Add','Yr Sold'],axis = 1)
    
    house_df = house_df.dropna(axis = 0)
    
    return house_df

Feature Selection¶

Numerical Column¶

In [131]:

a = house_df.select_dtypes(include = 'number').columns

In [132]:

len(a)

Out[132]:

In [133]:

%matplotlib inline

corr =  house_df[a].corr().loc[:, ['SalePrice']]
plt.figure(figsize = (3,10))

sns.heatmap(corr[corr > 0.4].dropna(),annot = True)

Out[133]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f01d46d6f60>

In [134]:

corr_high = corr[corr > 0.4].dropna()
corr_high

Out[134]:

	SalePrice
Overall Qual	0.793249
Mas Vnr Area	0.499652
BsmtFin SF 1	0.412925
Total Bsmt SF	0.630305
1st Flr SF	0.630424
Gr Liv Area	0.715867
Full Bath	0.555861
TotRms AbvGrd	0.523363
Fireplaces	0.452517
Garage Cars	0.652448
Garage Area	0.633467
SalePrice	1.000000
house_age	0.546237

In [135]:

house_df.drop(columns = list(corr[corr<0.4].dropna().index),axis = 1 , inplace = True)

In [136]:

house_df.select_dtypes(include = ['number']).columns

Out[136]:

Index(['Overall Qual', 'Mas Vnr Area', 'BsmtFin SF 1', 'Total Bsmt SF',
       '1st Flr SF', 'Gr Liv Area', 'Full Bath', 'TotRms AbvGrd', 'Fireplaces',
       'Garage Cars', 'Garage Area', 'SalePrice', 'house_age'],
      dtype='object')

In [137]:

a = house_df.corr()
plt.figure(figsize = (10,10))
sns.heatmap(a[a>0.8],annot = True)

Out[137]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f01d45bdef0>

In [138]:

house_df.drop(columns = ['1st Flr SF','TotRms AbvGrd','Garage Cars'],axis = 1 , inplace = True)

Object Column¶

In [139]:

## Create a list of column names from documentation that are *meant* to be categorical
nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]

In [140]:

transform_cat_col = []

for col in nominal_features:
    if col in house_df.columns:
        transform_cat_col.append(col)

        
transform_cat_col

Out[140]:

['MS Zoning',
 'Street',
 'Land Contour',
 'Lot Config',
 'Neighborhood',
 'Condition 1',
 'Condition 2',
 'Bldg Type',
 'House Style',
 'Roof Style',
 'Roof Matl',
 'Exterior 1st',
 'Exterior 2nd',
 'Mas Vnr Type',
 'Foundation',
 'Heating',
 'Central Air',
 'Garage Type',
 'Sale Type',
 'Sale Condition']

In [141]:

a = house_df[transform_cat_col].describe()
a

Out[141]:

	MS Zoning	Street	Land Contour	Lot Config	Neighborhood	Condition 1	Condition 2	Bldg Type	House Style	Roof Style	Roof Matl	Exterior 1st	Exterior 2nd	Mas Vnr Type	Foundation	Heating	Central Air	Garage Type	Sale Type	Sale Condition
count	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678
unique	6	2	4	5	28	9	8	5	8	6	8	14	16	5	5	4	2	6	10	6
top	RL	Pave	Lvl	Inside	NAmes	Norm	Norm	1Fam	1Story	Gable	CompShg	VinylSd	VinylSd	None	PConc	GasA	Y	Attchd	WD	Normal
freq	2112	2669	2409	1946	416	2324	2650	2258	1343	2106	2641	965	952	1549	1242	2650	2559	1676	2317	2226

In [142]:

x = a.loc['unique',:]
x[x>10]

Out[142]:

Neighborhood    28
Exterior 1st    14
Exterior 2nd    16
Name: unique, dtype: object

In [143]:

house_df.drop(x[x>10].index,axis = 1,inplace = True)
transform_cat_col = list(x[x<=10].index)

In [144]:

a = house_df[transform_cat_col].describe()
a

Out[144]:

	MS Zoning	Street	Land Contour	Lot Config	Condition 1	Condition 2	Bldg Type	House Style	Roof Style	Roof Matl	Mas Vnr Type	Foundation	Heating	Central Air	Garage Type	Sale Type	Sale Condition
count	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678	2678
unique	6	2	4	5	9	8	5	8	6	8	5	5	4	2	6	10	6
top	RL	Pave	Lvl	Inside	Norm	Norm	1Fam	1Story	Gable	CompShg	None	PConc	GasA	Y	Attchd	WD	Normal
freq	2112	2669	2409	1946	2324	2650	2258	1343	2106	2641	1549	1242	2650	2559	1676	2317	2226

In [145]:

x = a.loc['freq',:]/a.loc['count',:]
x[x>.8]

Out[145]:

Street            0.996639
Land Contour      0.899552
Condition 1       0.867812
Condition 2       0.989544
Bldg Type         0.843167
Roof Matl         0.986184
Heating           0.989544
Central Air       0.955564
Sale Type         0.865198
Sale Condition    0.831217
dtype: object

In [146]:

house_df.drop(list(x[x>.8].index),axis = 1, inplace = True)

In [147]:

text_cols = house_df.select_dtypes(include = ['object']).columns
for col in text_cols:
    house_df[col] = house_df[col].astype('category')
transformed_df = pd.concat([house_df,pd.get_dummies(house_df[text_cols])],axis = 1).drop(text_cols,axis = 1)

In [148]:

transformed_df.head()

Out[148]:

	Overall Qual	Mas Vnr Area	BsmtFin SF 1	Total Bsmt SF	Gr Liv Area	Full Bath	Fireplaces	Garage Area	SalePrice	house_age	...	Garage Qual_TA	Garage Cond_TA	Paved Drive_P	Paved Drive_Y
0	6	112.0	639.0	1080.0	1656	1	2	528.0	215000	-50	...	1	1	1	0
1	5	0.0	468.0	882.0	896	1	0	730.0	105000	-49	...	1	1	0	1
2	6	108.0	923.0	1329.0	1329	1	0	312.0	172000	-52	...	1	1	0	1
3	7	0.0	1065.0	2110.0	2110	2	2	522.0	244000	-42	...	1	1	0	1
4	5	0.0	791.0	928.0	1629	2	1	482.0	189900	-13	...	1	1	0	1

5 rows × 135 columns

In [149]:

house_df.select_dtypes(include = ['object']).columns

Out[149]:

Index([], dtype='object')

Updating `select_features` function¶

In [150]:

def select_features(house_df):
    
    
    a = house_df.select_dtypes(include = 'number').columns
    corr =  house_df[a].corr().loc[:, ['SalePrice']]
    house_df.drop(columns = list(corr[corr<0.4].dropna().index),axis = 1 , inplace = True)
    
    
    house_df.drop(columns = ['1st Flr SF','TotRms AbvGrd','Garage Cars'],axis = 1 , inplace = True)
    
    
    nominal_features = ["PID", "MS SubClass", "MS Zoning", "Street", "Alley", "Land Contour", "Lot Config", "Neighborhood", 
                    "Condition 1", "Condition 2", "Bldg Type", "House Style", "Roof Style", "Roof Matl", "Exterior 1st", 
                    "Exterior 2nd", "Mas Vnr Type", "Foundation", "Heating", "Central Air", "Garage Type", 
                    "Misc Feature", "Sale Type", "Sale Condition"]
    transform_cat_col = []

    for col in nominal_features:
        if col in house_df.columns:
            transform_cat_col.append(col)
            
    a = house_df[transform_cat_col].describe() 
    x = a.loc['unique',:]
    house_df.drop(x[x>10].index,axis = 1,inplace = True)
    transform_cat_col = list(x[x<=10].index)
    
    a = house_df[transform_cat_col].describe()
    x = a.loc['freq',:]/a.loc['count',:]
    house_df.drop(list(x[x>.8].index),axis = 1, inplace = True)
    
    text_cols = house_df.select_dtypes(include = ['object']).columns
    for col in text_cols:
        house_df[col] = house_df[col].astype('category')
    transformed_df = pd.concat([house_df,pd.get_dummies(house_df[text_cols])],axis = 1).drop(text_cols,axis = 1)
    
    
    
    return transformed_df

In [151]:

# df = transform_features(house_df)
# df = select_features(df)
# df.head()

Train And Test¶

updating `train_and_test` function¶

In [185]:

def train_and_test(house_df,k=0):
    
    features = house_df.select_dtypes(include = ['number','float']).columns
    features = features.drop('SalePrice')
    target = 'SalePrice'
    lr = LinearRegression()
    
    if k == 0:
        train = house_df.iloc[:1460]
        test = house_df.iloc[1460:]
        
        lr.fit(train[features],train[target])
        predict = lr.predict(test[features])
        avg_rmse = mean_squared_error(test[target],predict)**(1/2)
    if k==1:
        
        shuffled_df = house_df.sample(frac=1, )
        train = house_df[:1460]
        test = house_df[1460:]
        
        lr.fit(train[features], train["SalePrice"])
        predictions_one = lr.predict(test[features])        
        
        mse_one = mean_squared_error(test["SalePrice"], predictions_one)
        rmse_one = np.sqrt(mse_one)
        
        lr.fit(test[features], test["SalePrice"])
        predictions_two = lr.predict(train[features])        
       
        mse_two = mean_squared_error(train["SalePrice"], predictions_two)
        rmse_two = np.sqrt(mse_two)
        
        avg_rmse = np.mean([rmse_one, rmse_two])
    else:
        rmse_val =[]
        kf = KFold(n_splits=k, shuffle = True,random_state=3)
        for train_index,test_index in kf.split(house_df[features]):
            X_train,X_test = house_df[features].iloc[train_index],house_df[features].iloc[test_index]
            Y_train,Y_test = house_df[target].iloc[train_index],house_df[target].iloc[test_index]
            
            
            lr.fit(X_train,Y_train)
            predictions = lr.predict(X_test)
            mse = mean_squared_error(Y_test,predictions)
            rmse = mse**(1/2)
            rmse_val.append(rmse)
            
        avg_rmse = np.mean(rmse_val)
            
    return rmse

In [186]:

rmse_avg = train_and_test(transformed_df,k = 4)

In [187]:

rmse_avg

Out[187]:

28582.91263711336