Jack Kolberg-Edelbrock, MS
In this project, I worked with the Ames, Iowa housing dataset with the goal of predicting home prices based upon the many data columns within the dataset. Using a combination of data cleaning and feature engineering, I produced a dataset with information usable by a machine learning model. Afterwards, I examined individual columns in the dataset to determine which columns would be non-collinear and a good fit for use in the machine learning model. Finally, I performed K nearest neighbor and linear regression fits on multiple subsets of the chosen data to obtain a linear regression RMSE of \$25,032, or 14% on an average sale price of \\$175,778
The Ames, Iowa housing dataset represents a classic exercise in machine learning. This extensive dataset contains information ranging from the unquestionably important square footage of a house down to the value of the dilapidated shed that the owner forgot was in their backyard. The multitude of datapoints provided in the dataset challenges learners to perform extensive data cleaning as well as think critically about the interplay of seemingly different pieces of data before including them in a machine learning model.
In this project, I performed used feature engineering and machine learning to analyze the Ames, Iowa housing dataset.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, KFold
data = pd.read_csv('./Data/AmesHousing.tsv', delimiter = '\t')
data_copy = data.copy()
This dictionary contains information for remapping values in most columns of the input data. There are two purposes for this transformation:
Heating QC = ['Ex', 'Gd', 'TA', 'Fa', 'Po', np.nan]
Heating QC = [2, 1, 0, -1, -2, -2]
'MS SubClass': 20
--> 'MS SubClass': '1StoryNew'
transform_dictionary = {'MS SubClass': { 20:'1StoryNew', 30:'1StoryOld', 40:'1StoryFinishedAttic',
45:'1.5StoryUnfinished', 50:'1.5StoryFinished', 60:'2StoryNew',
70:'2StoryOld', 75:'2.5Story', 80:'SplitLevel',
85:'SplitFoyer', 90:'Duplex', 120:'1StoryPlannedNew',
150:'1.5StoryPlanned', 160:'2StoryPlannedNew', 180:'MultiLevelPlanned',
190:'2FamilyConversion'},
'MS Zoning': { 'C':'Commercial', 'FV':'FloatingVillage', 'I':'Industrial',
'RH':'HighDensityResidential', 'RL':'LowDensityResidential', 'RP':'LowDensityResidential_Park',
'RM':'MediumDensityResidential'},
'Street': {'Grvl':'Gravel', 'Pave':'Paved'},
'Lot Shape': {'Reg':'Regular', 'IR1':'SlightlyIrregular', 'IR2':'ModeratelyIrregular', 'IR3':'Irregular'},
'Alley': {'Grvl':'Gravel', 'Pave':'Pave', np.nan:'None'},
'Utilities': {'AllPub':'All', 'NoSewr':'NoSewer', 'NoSeWa':'NoWater/NoSewer', 'ELO':'ElectricOnly'},
'Lot Config': {'Inside':'Inner', 'Corner':'Corner', 'CulDSac':'CulDSac','FR2':'2SideFrontage',
'FR3':'3SideFrontage'},
'Land Slope': {'Gtl':'Gentle', 'Mod':'Moderate', 'Sev':'Severe'},
'Neighborhood': {'Blmngtn':'Bloomington Heights', 'Blueste':'Bluestem', 'BrDale':'Briardale',
'BrkSide':'Brookside', 'ClearCr':'Clear Creek', 'CollgCr':'College Creek',
'Crawfor':'Crawford', 'Edwards':'Edwards', 'Gilbert':'Gilbert',
'Greens':'Greens', 'GrnHill':'Green Hill', 'IDOTRR':'Iowa DOT and Railroad',
'Landmrk':'Landmark', 'MeadowV':'Meadow Village', 'Mitchel':'Mitchell',
'NAmes':'North Ames', 'NPkVill':'Northpark Villa', 'NWAmes':'Northwest Ames',
'NoRidge':'Northridge', 'NridgHt':'Northridge Heights', 'OldTown':'Old Town',
'SWISU':'Southwest ISU', 'Sawyer':'Sawyer', 'SawyerW':'Sawyer West',
'Somerst':'Somerset', 'StoneBr':'Stone Brook', 'Timber':'Timberland',
'Veenker':'Veenker'},
'Condition 1': {'Artery':'OnArtery', 'Feedr':'OnFeeder', 'Norm':'NoSpecialCond',
'RRNn':'NearRailroad', 'RRAn':'AdjacentToRailroad', 'PosN':'NearPark',
'PosA':'AdjacentToPark', 'RRNe':'NearRailroad', 'RRAe':'AdjacentToRailroad'},
'Condition 2': {'Artery':'OnArtery', 'Feedr':'OnFeeder', 'Norm':'NoSpecialCond',
'RRNn':'NearRailroad', 'RRAn':'AdjacentToRailroad', 'PosN':'NearPark',
'PosA':'AdjacentToPark', 'RRNe':'NearRailroad', 'RRAe':'AdjacentToRailroad'},
'Bldg Type': {'1Fam':'SingleFamily', '2FmCon':'2FamilyConversion', 'Duplx':'Duplex',
'TwnhsE':'TownhouseEnd', 'TwnhsI':'TownhouseInner'},
'Exter Qual': {'Ex':2, 'Gd':1, 'TA':0, 'Fa':-1, 'Po':-2},
'Roof Style': {'Flat':'Flat', 'Gable':'Gable', 'Gambrel':'Barn', 'Hip':'Hip', 'Mansard':'Mansard',
'Shed':'Shed'},
'Roof Matl': {'ClyTile':'ClayOrTile', 'CompShingle':'StandardShingle', 'Membran':'Membrane',
'Metal':'Metal', 'Roll':'Roll', 'Tar&Grv':'TarGravel',
'WdShake':'WoodShake', 'WdShngl':'WoodShingle'},
'Exterior 1st': {'AsbShng':'AsbestosShingles', 'AsphShn':'AsphaltShingles', 'BrkCommon':'CommonBrick',
'BrkFace':'FaceBrick', 'CBlock':'CinderBlock', 'CementBd':'CementBoard',
'HdBoard':'HardBoard', 'ImStucc':'ImitationStucco', 'MetalSd':'MetalSiding',
'Other':'Other', 'Plywood':'Plywood', 'PreCast':'PreCast',
'Stone':'Stone', 'VinylSd':'Vinyl', 'Wd Sdng':'WoodSiding',
'WdShing':'WoodShingles'},
'Exterior 2nd': {'AsbShng':'AsbestosShingles', 'AsphShn':'AsphaltShingles', 'BrkCommon':'CommonBrick',
'BrkFace':'FaceBrick', 'CBlock':'CinderBlock', 'CementBd':'CementBoard',
'HdBoard':'HardBoard', 'ImStucc':'ImitationStucco', 'MetalSd':'MetalSiding',
'Other':'Other', 'Plywood':'Plywood', 'PreCast':'PreCast',
'Stone':'Stone', 'VinylSd':'Vinyl', 'Wd Sdng':'WoodSiding',
'WdShing':'WoodShingles'},
'Foundation': {'BrkTil':'BrickTile', 'CBlock':'CinderBlock', 'PConc':'PouredConcrete',
'Slab':'ConcreteSlab', 'Stone':'Stone', 'Wood':'Wood'},
'Bsmt Cond': {'Ex':2, 'Gd':1, 'TA':0, 'Fa':-1, 'Po':-2, np.nan:-2},
'Bsmt Exposure': {'Gd':'Good', 'Av':'Average', 'Mn':'Minimal', 'No':'None', 'NA':'NoBasement'},
'BsmtFin Type 1': {'GLQ':'GoodLivingQuarters', 'ALQ':'AverageLivingQuarters', 'BLQ':'PoorLivingQuarters',
'Rec':'RecRoom', 'LwQ':'BadLivingQuarters', 'Unf':'Unfinished',
np.nan:'None'},
'BsmtFin Type 2': {'GLQ':'GoodLivingQuarters', 'ALQ':'AverageLivingQuarters', 'BLQ':'PoorLivingQuarters',
'Rec':'RecRoom', 'LwQ':'BadLivingQuarters', 'Unf':'Unfinished',
np.nan:'None'},
'Heating': {'Floor':'FloorFurnace', 'GasA':'ForcedAir', 'GasW':'Radiant',
'Grav':'GravityFurnace', 'OthW':'NonGasWaterHeating', 'Wall':'WallFurnace'},
'Heating QC': {'Ex':2, 'Gd':1, 'TA':0, 'Fa':-1, 'Po':-2},
'Central Air': {'N':0, 'Y':1},
'Electrical': {'SBrkr':'StandardBreakers', 'FuseA':'60ARomex', 'FuseF':'60AMixed',
'FuseP':'60AKnobTube', 'Mix':'Mixed'},
'Kitchen Qual': {'Ex':2, 'Gd':1, 'TA':0, 'Fa':-1, 'Po':-2},
'Functional': {'Typ':'Typical', 'Min1':'MinimalProblems', 'Min2':'MinimalProblems',
'Mod':'ModerateProblems', 'Maj1':'MajorProblems', 'Maj2':'MajorProblems',
'Sev':'SevereProblems', 'Sal':'SalvageOnly'},
'Fireplace Qu': {'Ex':'ExceptionalMasonry', 'Gd':'Masonry', 'TA':'Prefab', 'Fa':'Prefab',
'Po':'Stove', np.nan:'None'},
'Garage Type': {'2Types':'TwoTypes', 'Attchd':'Attached', 'Bsment':'Basement',
'BuiltIn':'BuiltIn', 'CarPort':'CarPort', 'Detchd':'Detached',
np.nan:'None'},
'Garage Qual': {'Ex':2, 'Gd':1, 'TA':0, 'Fa':-1, 'Po':-2, np.nan:-2},
'Paved Drive': {'Y':'Paved', 'P':'Mixed', 'N':'Unpaved'},
'Pool QC': {'Ex':2, 'Gd':1, 'TA':0, 'Fa':-1, 'Po':-2, np.nan:-2},
'Fence': {'GdPrv':'GoodPrivacy', 'MnPrv':'MinimalPrivacy', 'GdWo':'GoodWood',
'MnWw':'MinimumWire', np.nan:'None'},
'Sale Condition': {'Normal':'Normal', 'Abnorml':'Foreclosure', 'AdjLand':'AdjoiningSale',
'Alloca':'Allocation', 'Family':'Family',
'Partial':'Partial'},
'Mo Sold': {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun', 7:'Jul',
8:'Aug', 9:'Sep', 10:'Oct', 11:'Nov', 12:'Dec'}}
This dictionary contains the information needed to group individual neighborhoods (previously transformed by the dictionary above) into the more general 'geographic' groupings as defined in the main body of the project.
neighborhood_grouping = {'Bloomington Heights':'Far North', 'Bluestem':'Southwest', 'Briardale':'Near North',
'Brookside':'Downtown', 'Clear Creek':'Near West', 'College Creek':'Far West',
'Crawford':'Southwest ISU', 'Edwards':'Near West', 'Gilbert':'Far North',
'Greens':'Northwest', 'Green Hill':'Far Southwest', 'Iowa DOT and Railroad':'Downtown',
'Landmark':'Far West', 'Meadow Village':'South', 'Mitchell':'South',
'North Ames':'Near North', 'Northpark Villa':'Near North', 'Northwest Ames':'Near North',
'Northridge':'Northwest', 'Northridge Heights':'Northwest', 'Old Town':'Downtown',
'Southwest ISU':'Southwest', 'Sawyer':'Near West', 'Sawyer West':'Far West',
'Somerset':'Northwest', 'Stone Brook':'Far North', 'Timberland':'Far Southwest',
'Veenker':'Near West'}
This function performs the majority of the data cleaning required in this project.
Parameters
Returns
Process description
def transform_features(df_in, transforms):
df = df_in.copy()
# ===========================
# Remap original values in df
# ---------------------------
for col in transforms.keys():
df[col] = df[col].map(transforms[col])
message = '===== Successfully remapped the dataframe ====='
print(f'\n{"=" * len(message)}\n{message}\n{"=" * len(message)}\n{list(transforms.keys())}\n')
# =============================
# Pre-process numerical columns
# -----------------------------
# List numeric columns according to na fill value
numeric_cols_fill_mean = ['Lot Frontage', 'Lot Area', 'Overall Cond', 'Year Built', 'Year Remod/Add', 'Exter Qual',
'Bsmt Cond', 'Heating QC', '1st Flr SF', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
'Kitchen Qual', 'Garage Yr Blt', 'Garage Qual','Pool QC', 'SalePrice']
numeric_cols_fill_zero = ['Total Bsmt SF', '2nd Flr SF', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Fireplaces',
'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch',
'Screen Porch', 'Misc Val',]
already_dummy_cols = ['Central Air']
# Cast numeric columns as floats
df[numeric_cols_fill_mean] = df[numeric_cols_fill_mean].astype(float)
df[numeric_cols_fill_zero] = df[numeric_cols_fill_zero].astype(float)
df[already_dummy_cols] = df[already_dummy_cols].astype(int)
# Fill missing values
for col in numeric_cols_fill_mean:
df[col] = df[col].fillna(df[col].mean())
for col in numeric_cols_fill_zero:
df[col] = df[col].fillna(0)
# ===============================
# Pre-process categorical columns
# -------------------------------
nl_categorical_cols = ['MS SubClass', 'MS Zoning', 'Street', 'Alley', 'Lot Shape', 'Utilities', 'Lot Config',
'Land Slope', 'Bldg Type', 'Roof Style', 'Roof Matl', 'Foundation', 'Bsmt Exposure',
'Heating', 'Electrical', 'Functional', 'Garage Type', 'Paved Drive',
'Fence', 'Mo Sold', 'Yr Sold', 'Sale Condition', 'Fireplace Qu', 'Neighborhood']
nl_categorical_prefixes = ['SubClass', 'Zoning', 'Street', 'Alley', 'LotShape', 'Utilities', 'Lot',
'LotSlope', 'Bldg', 'Roof', 'RoofMatl', 'Foundation', 'BsmtExposure',
'Heating', 'Electric', 'Condition', 'Garage', 'Driveway', 'Fence', 'MonthSold',
'YearSold', 'Sale', 'Fireplace', 'Nhood']
special_categorical_cols = ['Condition 1', 'Condition 2', 'BsmtFin Type 1', 'BsmtFin Type 2', 'Exterior 1st', 'Exterior 2nd']
special_categorical_prefixes = ['Location', 'Location', 'BasementFinish', 'BasementFinish', 'Exterior', 'Exterior']
# Cast categorical columns as objects
df[nl_categorical_cols] = df[nl_categorical_cols].astype(object)
df[special_categorical_cols] = df[special_categorical_cols].astype(object)
# Cast pre-dummied columns as ints
df[already_dummy_cols] = df[already_dummy_cols].astype(int)
# Create normal dummies
nl_dummies = pd.DataFrame()
for i in range(len(nl_categorical_cols)):
dummy = pd.get_dummies(df[nl_categorical_cols[i]], nl_categorical_prefixes[i])
nl_dummies = pd.concat([nl_dummies, dummy], axis = 1)
# ===================
# Feature Engineering
# -------------------
# Convert feature built years to feature ages (2021 = ref year)
df['Years Since Remodel'] = 2021 - df['Year Remod/Add']
df['Garage Age'] = 2021 - df['Garage Yr Blt']
# Create column to represent historical homes (pre 1930)
df['Historical'] = (df['Year Built'] < 1930).astype(int)
# Create summative quantitative columns
df['Total SF'] = df['Total Bsmt SF'] + df['1st Flr SF'] + df['2nd Flr SF']
df['Porch SF'] = df['Open Porch SF'] + df['Enclosed Porch'] + df['3Ssn Porch'] + df['Screen Porch']
df['Bathrooms'] = df['Bsmt Full Bath'] + df['Full Bath'] + 0.5 * (df['Bsmt Half Bath'] + df['Half Bath'])
# Create summative dummies (ie, if a row has two of the same conditions, the value in the dummy can be 2)
location_dummies = pd.get_dummies(df['Condition 1'], 'Location') + pd.get_dummies(df['Condition 2'], 'Location')
bsmtfin_dummies = pd.get_dummies(df['BsmtFin Type 1'], 'BsmtFin') + pd.get_dummies(df['BsmtFin Type 2'], 'BsmtFin')
exterior_dummies = pd.get_dummies(df['Exterior 1st'], 'Exterior') + pd.get_dummies(df['Exterior 2nd'], 'Exterior')
# -------------------------------------------------------------------------------------------
# Function to drop columns
def drop_columns(df, cols, message):
df = df.drop(cols, axis = 1)
message = f'===== Dropped {len(cols)} {message} ====='
print(f'{"="*len(message)}\n{message}\n{"="*len(message)}\n{cols} \n')
return df
# -------------------------------------------------------------------------------------------
# ================
# Clean up dataset
# ----------------
# Append dummies back to df
df = pd.concat([df, nl_dummies, location_dummies, bsmtfin_dummies, exterior_dummies], axis = 1)
# Drop consumed columns
consumed_cols = ['Year Remod/Add', 'Garage Yr Blt', 'Year Built']
df = drop_columns(df, consumed_cols, 'columns after converting them to nominal data.')
# Drop combined columns
combined_cols = ['Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Open Porch SF',
'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Bsmt Full Bath',
'Full Bath', 'Bsmt Half Bath', 'Half Bath']
df = drop_columns(df, combined_cols, 'columns after combining them into summative quantitative columns')
# Drop categorical columns
df = drop_columns(df, nl_categorical_cols, 'columns after converting them to dummies.')
df = drop_columns(df, special_categorical_cols, 'columns after converting them to dummies and combining them.')
# Drop uncategorizable and unquantifiable columns
uncat_unquant_cols = ['Order', 'PID']
df = drop_columns(df, uncat_unquant_cols, 'columns that contained uncategorizable and unquantifiable data.')
# Drop duplicated columns
duplicate_cols = ['Exter Cond', 'Garage Finish', 'Garage Cond', 'Bsmt Unf SF',
'Bsmt Qual', 'Garage Cars', 'BsmtFin SF 1', 'BsmtFin SF 2',
'House Style', 'Gr Liv Area', 'TotRms AbvGrd', 'Land Contour']
df = drop_columns(df, duplicate_cols, 'columns which contained data represented in other columns.')
# Drop unimportant columns
unimportant_cols = ['Mas Vnr Type', 'Mas Vnr Area','Sale Type', 'Misc Feature', 'Low Qual Fin SF']
df = drop_columns(df, unimportant_cols, 'columns which did not contain useful information.')
# Drop unused categorical columns
unused_categorical = ['Exterior_Other', 'Exterior_WoodShingles', 'Condition_SalvageOnly']
df = drop_columns(df, unused_categorical, 'categorical columns which contained no information.')
message = f'===== After cleaning there are {data_copy.shape[1]} data columns ====='
print(f'{"=" * len(message)}\n{message}\n{"=" * len(message)}')
return df
Selects fit features on the basis of a correlation cutoff.
Parameters
Returns
def select_features(df, positive_cutoff, negative_cutoff, target_col, exclude = []):
strong_pos = df.index[df.loc[:, target_col] > positive_cutoff]
strong_neg = df.index[df.loc[:, target_col] < negative_cutoff]
# Drop excluded features
for item in exclude:
if (exclude in list(strong_pos)):
strong_pos = strong_pos.drop(exclude)
if (exclude in list(strong_neg)):
strong_neg = strong_neg.drop(exclude)
return strong_pos.tolist(), strong_neg.tolist()
This function automates fitting and testing of a k_nearest_neighbors model.
Parameters
Returns
def train_and_test_kneighbors(train_df, test_df, fit_params, target_param, num_neighbors = 5, k = 2):
total_df = pd.concat([train_df, test_df], axis = 0)
kf = KFold(k, shuffle = True, random_state = 1)
knn = KNeighborsRegressor(n_neighbors = num_neighbors)
mses = cross_val_score(knn, total_df[fit_params], total_df[target_param],
scoring = 'neg_mean_squared_error', cv = kf)
rmses = np.absolute(mses) ** 0.5
return np.mean(rmses), np.std(rmses)
This function automates fitting and testing of a linear_regression model.
Parameters
Returns
def train_and_test_linear_regression(train_df, test_df, fit_params, target_param, k = 0):
total_df = pd.concat([train_df, test_df], axis = 0)
kf = KFold(k, shuffle = True, random_state = 1)
lr = LinearRegression()
mses = cross_val_score(lr, total_df[fit_params], total_df[target_param],
scoring = 'neg_mean_squared_error', cv = kf)
rmses = np.absolute(mses) ** 0.5
return np.mean(rmses), np.std(rmses)
data_transform = transform_features(data_copy, transform_dictionary)
I am interested in predicting prices of 'average' homes in Ames, so removing houses that fall outside of the 'average' category is important to avoid skewing our model. To identify outlying datapoints, I plotted histograms of three important characteristics of each house: price, house square footage, and lot size.
def plot_outliers(data_, name, num_stdev):
[mean, stdev] = [np.mean(data_), np.std(data_)]
[plus_three, minus_three] = [mean + 3 * stdev, mean - 3 * stdev]
range_ = plus_three - minus_three
# Generate plot
ax = sns.histplot(data_, binwidth = (range_)/40, edgecolor = 'black')
ax.set_xlim(0, mean + 5 * stdev)
ax.set_yscale('log')
# Annotate plot (vertical line for mean and +/- 3 standard deviations)
text_y = ax.get_ylim()[1] * 0.7
[text_x1, text_x2] = [minus_three + 0.03 * range_, plus_three - 0.2 * range_]
ax.axvline(x = mean, color = 'black', linestyle = '-')
ax.text(mean + 0.01 * (plus_three - minus_three), text_y, 'Mean', fontsize = 'large', fontweight = 'bold')
ax.axvline(x = plus_three, color = 'red', linestyle = '--')
ax.text(text_x2, text_y, '+3σ', color = 'red', fontsize = 'large', fontweight = 'bold')
if(text_x1 > 0):
ax.text(text_x1, text_y, '-3σ', color = 'red', fontsize = 'large', fontweight = 'bold')
ax.axvline(x = minus_three,
color = 'red', linestyle = '--')
# ax.set(title = name)
plt.title(name, size = 15)
outliers = (data_ < minus_three) | (data_ > plus_three)
return outliers
plt.subplots(1, 3, figsize = [15, 4])
plt.subplot(1, 3, 1)
sf_outliers = plot_outliers(data_transform['Total SF'], 'Square Footage Outliers', 3)
plt.subplot(1, 3, 2)
price_outliers = plot_outliers(data_transform['SalePrice'], 'Price Outliers', 3)
plt.subplot(1, 3, 3)
lot_area_outliers = plot_outliers(data_transform['Lot Area'], 'Lot Area Outliers', 3)
plt.show()
There are a number of houses that have either square footage, price, or lot area 3σ outside the mean. We can safely discard these listings as outliers.
# Combine individual outlier datasets
all_outliers = sf_outliers | price_outliers | lot_area_outliers
message = f'===== {sum(all_outliers)} outliers identified and removed ====='
print(f'{"=" * len(message)}\n{message}\n{"=" * len(message)}')
# Drop outliers
data_transform = data_transform.loc[~all_outliers, :]
Parameters
Returns
def plot_corrs(corrs, cols_to_plot, comparison_col, chart_rows, chart_cols, fig_width = 10, fig_height = 10):
num_corrs = len(cols_to_plot)
corrs_per_cell = int(np.round_(num_corrs/(chart_rows * chart_cols)))
fig, ax = plt.subplots(chart_rows, chart_cols, figsize=(fig_width, fig_height))
subplot_num = 1
show_cbar = False
# For each subplot
for col in range(1, chart_cols + 1):
for row in range(1, chart_rows + 1):
start_index = (subplot_num - 1) * corrs_per_cell
end_index = start_index + corrs_per_cell
# Ensure we are not exceeding the bounds of our dataset
if (end_index >= len(cols_to_plot)):
end_index = len(cols_to_plot)
elif (end_index + corrs_per_cell) > len(cols_to_plot) + 1:
end_index = len(cols_to_plot)
if (col == chart_cols) & (row == chart_rows):
show_cbar = True
cell_cols = []
for k in range(start_index, end_index):
cell_cols.append(cols_to_plot[k])
# Plot the appropriate correlation values in this subplot
plt.subplot(chart_rows, chart_cols, subplot_num)
sns.heatmap(pd.DataFrame(corrs.loc[cell_cols, comparison_col]), cmap = 'coolwarm', square = True,
vmin = -1, center = 0, vmax = 1, annot = True, cbar = show_cbar)
subplot_num += 1
plt.show()
Parameters
Returns
def drop_cols(df, col_group, current_cols, cols_to_keep):
# Keep requested columns
cols_to_drop = current_cols
for col in cols_to_keep:
cols_to_drop.remove(col)
# Drop columns
df = df.drop(cols_to_drop, axis = 'columns')
# Print out results
message = f'===== {len(cols_to_drop)} {col_group} columns were removed from analysis ====='
print(f'{len(message) * "="}\n{message}\n{len(message) * "="}\n')
print(cols_to_drop)
message = f'===== {df.shape[1]} columns remain ====='
print(f'\n{len(message) * "="}\n{message}\n{len(message) * "="}')
return df
With 202 columns, choosing which ones to use for our model is a major challenge. In addition to choosing columns that predict the SalePrice accurately, there are several common pitfalls we need to avoid:
First, lets examine how each of the 202 columns in the dataset correlate with SalePrice.
correlations = data_transform.corr()
plot_corrs(correlations, correlations.columns.tolist(), 'SalePrice', 1, 5, 20, 20)
With 202 values, a single heatmap is impossible to interpret. Instead, lets break the correlations out into smaller groups.
When looking for a house, there are several numbers that typically drive a purchase. These "vital statistics" include the number of bedrooms, number of bathrooms, and the square footage of the house.
vital_cols = ['Total SF', 'Bedroom AbvGr', 'Bathrooms']
plot_corrs(correlations, vital_cols, 'SalePrice', 1, 3, 10, 2)
The strongest predictor of price amongst a home's vital statistics is the total square footage of the house. After that, the number of bathrooms (which is in part related to the total square footage) is the strongest indicator.
All of these values are important factors in home price, but it is important to note that number of bedrooms and number of bathrooms does correlate with SalePrice. Despite this collinearity, I kept all three columns for use in the model
The Ames, IA dataset provides us with several "quality metrics" which roll many different factors into one number.
quality_metric_cols = ['Exter Qual', 'Garage Qual', 'Bsmt Cond',
'Overall Cond', 'Kitchen Qual', 'Overall Qual']
functional_cols = data_transform.columns[data_transform.columns.str.contains('Condition')].tolist()
quality_metric_cols = quality_metric_cols + functional_cols
plot_corrs(correlations, quality_metric_cols, 'SalePrice', 1, 5, 20, 1)
Kitchen quality, exterior quality, and Overall Quality are strongly correlated with higher home prices. Surprisingly, the "Functional" condition of the house does not have a strong correlation with house price other than typical functionality predicts a higher price than any reduced functionality.
data_transform = drop_cols(data_transform, 'Quality Metrics', quality_metric_cols, ['Exter Qual', 'Kitchen Qual'])
Examining the correlation between neighborhood and sales price, we are confronted with two problems:
neighborhoods_original = data.loc[~all_outliers, ['Neighborhood', 'SalePrice']]
neighborhoods_map = transform_dictionary['Neighborhood']
plt.subplots(1, 1, figsize = [18, 5])
sns.stripplot(data = neighborhoods_original, x = 'Neighborhood', y = 'SalePrice')
x = plt.xticks(rotation = 60)
x = plt.title('Sale Prices by Neighborhood', size = 20)
Even with these problems, it does appear that there is a relationship between neighborhood and SalePrice. This is not surprising, since location can often drive real estate prices more strongly than house characteristics.
To circumvent the issues of insufficient data and collinearity, I grouped neighborhoods geographically, assuming that 'areas' of the city would have similar property values and desirabilities. Using a survey map of Ames, Iowa, I divided the 26 neighborhoods into 10 geographic groups as shown below: