The purpose of this notebook is to visually examine the nominal features, discard the useless ones among them, and create new factor variables.
The "main" plot used in this notebook is Gr Liv Area vs. SalePrice as the overall living area is the most correlated predictor (which is also very intuitive). Many of the nominal variables change the slopes of the regression lines for sub-groups of data points significantly.
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.ensemble import IsolationForest
from utils import (
ALL_COLUMNS,
NOMINAL_VARIABLES,
TARGET_VARIABLES,
load_clean_data,
encode_ordinals,
print_column_list,
)
random_state = np.random.RandomState(42)
pd.set_option("display.max_columns", 120)
sns.set_style("white")
df = load_clean_data("data/data_clean_with_transformations.csv")
df.shape
(2898, 86)
df.head()
1st Flr SF | 1st Flr SF (box-cox-0) | 2nd Flr SF | 3Ssn Porch | Alley | Bedroom AbvGr | Bldg Type | Bsmt Cond | Bsmt Exposure | Bsmt Full Bath | Bsmt Half Bath | Bsmt Qual | Bsmt Unf SF | BsmtFin SF 1 | BsmtFin SF 2 | BsmtFin Type 1 | BsmtFin Type 2 | Central Air | Condition 1 | Condition 2 | Electrical | Enclosed Porch | Exter Cond | Exter Qual | Exterior 1st | Exterior 2nd | Fence | Fireplace Qu | Fireplaces | Foundation | Full Bath | Functional | Garage Area | Garage Cars | Garage Cond | Garage Finish | Garage Qual | Garage Type | Gr Liv Area | Gr Liv Area (box-cox-0) | Half Bath | Heating | Heating QC | House Style | Kitchen AbvGr | Kitchen Qual | Land Contour | Land Slope | Lot Area | Lot Area (box-cox-0.1) | Lot Config | Lot Shape | Low Qual Fin SF | MS SubClass | MS Zoning | Mas Vnr Area | Mas Vnr Type | Misc Feature | Misc Val | Mo Sold | Neighborhood | Open Porch SF | Overall Cond | Overall Qual | Paved Drive | Pool Area | Pool QC | Roof Matl | Roof Style | Sale Condition | Sale Type | Screen Porch | Street | TotRms AbvGrd | Total Bath | Total Bsmt SF | Total Porch SF | Total SF | Total SF (box-cox-0.2) | Utilities | Wood Deck SF | Year Built | Year Remod/Add | Yr Sold | SalePrice | SalePrice (box-cox-0) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Order | PID | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 526301100 | 1656.0 | 7.412160 | 0.0 | 0.0 | NA | 3 | 1Fam | Gd | Gd | 1 | 0 | TA | 441.0 | 639.0 | 0.0 | BLQ | Unf | Y | Norm | Norm | SBrkr | 0.0 | TA | TA | BrkFace | Plywood | NA | Gd | 2 | CBlock | 1 | Typ | 528.0 | 2 | TA | Fin | TA | Attchd | 1656.0 | 7.412160 | 0 | GasA | Fa | 1Story | 1 | TA | Lvl | Gtl | 31770.0 | 18.196923 | Corner | IR1 | 0.0 | 020 | RL | 112.0 | Stone | NA | 0.0 | 5 | Names | 62.0 | 5 | 6 | P | 0.0 | NA | CompShg | Hip | Normal | WD | 0.0 | Pave | 7 | 2.0 | 1080.0 | 272.0 | 2736.0 | 19.344072 | AllPub | 210.0 | 1960 | 1960 | 2010 | 215000.0 | 12.278393 |
2 | 526350040 | 896.0 | 6.797940 | 0.0 | 0.0 | NA | 2 | 1Fam | TA | No | 0 | 0 | TA | 270.0 | 468.0 | 144.0 | Rec | LwQ | Y | Feedr | Norm | SBrkr | 0.0 | TA | TA | VinylSd | VinylSd | MnPrv | NA | 0 | CBlock | 1 | Typ | 730.0 | 1 | TA | Unf | TA | Attchd | 896.0 | 6.797940 | 0 | GasA | TA | 1Story | 1 | TA | Lvl | Gtl | 11622.0 | 15.499290 | Inside | Reg | 0.0 | 020 | RH | 0.0 | None | NA | 0.0 | 6 | Names | 0.0 | 6 | 5 | Y | 0.0 | NA | CompShg | Gable | Normal | WD | 120.0 | Pave | 5 | 1.0 | 882.0 | 260.0 | 1778.0 | 17.333478 | AllPub | 140.0 | 1961 | 1961 | 2010 | 105000.0 | 11.561716 |
3 | 526351010 | 1329.0 | 7.192182 | 0.0 | 0.0 | NA | 3 | 1Fam | TA | No | 0 | 0 | TA | 406.0 | 923.0 | 0.0 | ALQ | Unf | Y | Norm | Norm | SBrkr | 0.0 | TA | TA | Wd Sdng | Wd Sdng | NA | NA | 0 | CBlock | 1 | Typ | 312.0 | 1 | TA | Unf | TA | Attchd | 1329.0 | 7.192182 | 1 | GasA | TA | 1Story | 1 | Gd | Lvl | Gtl | 14267.0 | 16.027549 | Corner | IR1 | 0.0 | 020 | RL | 108.0 | BrkFace | Gar2 | 12500.0 | 6 | Names | 36.0 | 6 | 6 | Y | 0.0 | NA | CompShg | Hip | Normal | WD | 0.0 | Pave | 6 | 1.5 | 1329.0 | 429.0 | 2658.0 | 19.203658 | AllPub | 393.0 | 1958 | 1958 | 2010 | 172000.0 | 12.055250 |
4 | 526353030 | 2110.0 | 7.654443 | 0.0 | 0.0 | NA | 3 | 1Fam | TA | No | 1 | 0 | TA | 1045.0 | 1065.0 | 0.0 | ALQ | Unf | Y | Norm | Norm | SBrkr | 0.0 | TA | Gd | BrkFace | BrkFace | NA | TA | 2 | CBlock | 2 | Typ | 522.0 | 2 | TA | Fin | TA | Attchd | 2110.0 | 7.654443 | 1 | GasA | Ex | 1Story | 1 | Ex | Lvl | Gtl | 11160.0 | 15.396064 | Corner | Reg | 0.0 | 020 | RL | 0.0 | None | NA | 0.0 | 4 | Names | 0.0 | 5 | 7 | Y | 0.0 | NA | CompShg | Hip | Normal | WD | 0.0 | Pave | 8 | 3.5 | 2110.0 | 0.0 | 4220.0 | 21.548042 | AllPub | 0.0 | 1968 | 1968 | 2010 | 244000.0 | 12.404924 |
5 | 527105010 | 928.0 | 6.833032 | 701.0 | 0.0 | NA | 3 | 1Fam | TA | No | 0 | 0 | Gd | 137.0 | 791.0 | 0.0 | GLQ | Unf | Y | Norm | Norm | SBrkr | 0.0 | TA | TA | VinylSd | VinylSd | MnPrv | TA | 1 | PConc | 2 | Typ | 482.0 | 2 | TA | Fin | TA | Attchd | 1629.0 | 7.395722 | 1 | GasA | Gd | 2Story | 1 | TA | Lvl | Gtl | 13830.0 | 15.946705 | Inside | IR1 | 0.0 | 060 | RL | 0.0 | None | NA | 0.0 | 3 | Gilbert | 34.0 | 5 | 5 | Y | 0.0 | NA | CompShg | Gable | Normal | WD | 0.0 | Pave | 6 | 2.5 | 928.0 | 246.0 | 2557.0 | 19.016856 | AllPub | 212.0 | 1997 | 1998 | 2010 | 189900.0 | 12.154253 |
Newly created variables are collected in the new_variables list.
new_variables = []
Certain characteristics of a house are assumed to have a "binary" influence on the sales price. For example, the existence of a pool could be an important predictor while the exact size of the pool can be deemed not so important.
The below cell creates boolean factor variables out of a set of numeric variables.
derived_variables = {
"has 2nd Flr": "2nd Flr SF",
"has Bsmt": "Total Bsmt SF",
"has Fireplace": "Fireplaces",
"has Garage": "Garage Area",
"has Pool": "Pool Area",
"has Porch": "Total Porch SF",
}
# Factorize numeric columns.
for factor_column, column in derived_variables.items():
df[factor_column] = df[column].apply(lambda x: 1 if x > 0 else 0)
derived_variables = list(derived_variables.keys())
new_variables.extend(derived_variables)
df[derived_variables].head()
has 2nd Flr | has Bsmt | has Fireplace | has Garage | has Pool | has Porch | ||
---|---|---|---|---|---|---|---|
Order | PID | ||||||
1 | 526301100 | 0 | 1 | 1 | 1 | 0 | 1 |
2 | 526350040 | 0 | 1 | 0 | 1 | 0 | 1 |
3 | 526351010 | 0 | 1 | 0 | 1 | 0 | 1 |
4 | 526353030 | 0 | 1 | 1 | 1 | 0 | 0 |
5 | 527105010 | 1 | 1 | 1 | 1 | 0 | 1 |
A second floor may have a positive effect on the sales price. However, having a second floor correlates with overall living space. The individual effect is therefore not as clear as it seems in the plot below.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="has 2nd Flr", s=15, data=df);
Nearly all houses in Ames, IA, have a basement. Therefore, has Bsmt is most likely not an important predictor.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="has Bsmt", s=15, data=df);
Bigger houses are more likely to have a fireplace. Thus, the variable has Fireplace might be an interesting predictor.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="has Fireplace", s=15, data=df);
Holding the overall living area fixed adding a garage seems to affect the price positively. Thus, has Garage seems like an interesting predictor as well.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="has Garage", s=15, data=df);
Unfortunately, almost no one in Ames, IA, has a pool. The predictor has Pool seems quite uninteresting.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="has Pool", s=15, data=df);
Most houses have a porch.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="has Porch", s=15, data=df);
The instructors' notes say:
For instructors who cover nominal variables in their class, I would suggest incorporating the neighborhood variable into their models by converting it to a set of dummy (indicator) variables. I have found that the coefficients for the continuous variables tend to have values with more realistic interpretations when used in conjunction with the neighborhood variable.
Indeed, plotting the price distributions by neighborhood reveals significant differences in the price level.
_, ax = plt.subplots(figsize=(10, 8))
sns.boxplot(x="Neighborhood", y="SalePrice", data=df, ax=ax)
ax.set_title("Prices by Neighborhood", fontsize=24)
ax.set_xlabel("Neighborhood", fontsize=18)
ax.tick_params(axis='x', labelrotation = 45)
ax.set_ylabel("House Price", fontsize=18);
The 28 neighborhoods are encoded as factor variables.
neighborhood = pd.get_dummies(df["Neighborhood"], prefix="nhood", dtype=int)
df = pd.concat([df, neighborhood], axis=1)
del df["Neighborhood"]
new_variables.extend(neighborhood.columns)
df[neighborhood.columns].shape
(2898, 28)
df[neighborhood.columns].head()
nhood_Blmngtn | nhood_Blueste | nhood_BrDale | nhood_BrkSide | nhood_ClearCr | nhood_CollgCr | nhood_Crawfor | nhood_Edwards | nhood_Gilbert | nhood_Greens | nhood_GrnHill | nhood_IDOTRR | nhood_Landmrk | nhood_MeadowV | nhood_Mitchel | nhood_Names | nhood_NoRidge | nhood_NPkVill | nhood_NridgHt | nhood_NWAmes | nhood_OldTown | nhood_SWISU | nhood_Sawyer | nhood_SawyerW | nhood_Somerst | nhood_StoneBr | nhood_Timber | nhood_Veenker | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Order | PID | ||||||||||||||||||||||||||||
1 | 526301100 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 526350040 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 526351010 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 526353030 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 527105010 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
This section investigates the rest of the nominal variables with regard to which realizations / encoding might be a useful predictor.
print_column_list(set(NOMINAL_VARIABLES) - set(["Neighborhood"]))
Alley Type of alley access to property Bldg Type Type of dwelling Central Air Central air conditioning Condition 1 Proximity to various conditions Condition 2 Proximity to various conditions (if more than one is present) Exterior 1st Exterior covering on house Exterior 2nd Exterior covering on house (if more than one material) Foundation Type of foundation Garage Type Garage location Heating Type of heating House Style Style of dwelling Land Contour Flatness of the property Lot Config Lot configuration MS SubClass Identifies the type of dwelling involved in the sale. MS Zoning Identifies the general zoning classification of the sale. Mas Vnr Type Masonry veneer type Misc Feature Miscellaneous feature not covered in other categories Roof Matl Roof material Roof Style Type of roof Sale Condition Condition of sale Sale Type Type of sale Street Type of road access to property
Almost no house has access to an alley.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Alley", s=15, data=df);
del df["Alley"]
The type of a building clearly affects the valuation. The two types of townhouses as well as the 2-family condo and duplex type are summarized into a single category. This makes sense a) semantically, and b) by looking at the two sub-clusters in the scatter plot.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Bldg Type", s=15, data=df);
# Unify the two townhouse types into one.
df["Bldg Type"] = df["Bldg Type"].apply(
lambda x: "Twnhs" if x in ("TwnhsE", "TwnhsI") else x
)
# Unify the two kinds of 2-family homes.
df["Bldg Type"] = df["Bldg Type"].apply(
lambda x: "2Fam" if x in ("2FmCon", "Duplx") else x
)
build_type = pd.get_dummies(df["Bldg Type"], prefix="build_type", dtype=int)
df = pd.concat([df, build_type], axis=1)
del df["Bldg Type"]
new_variables.extend(build_type.columns)
df[build_type.columns].head()
build_type_1Fam | build_type_2Fam | build_type_Twnhs | ||
---|---|---|---|---|
Order | PID | |||
1 | 526301100 | 1 | 0 | 0 |
2 | 526350040 | 1 | 0 | 0 |
3 | 526351010 | 1 | 0 | 0 |
4 | 526353030 | 1 | 0 | 0 |
5 | 527105010 | 1 | 0 | 0 |
Air conditioning clearly increases the valuation ("steeper" regression line with respect to the overall living area).
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Central Air", s=15, data=df);
Use a new variable name to cleary show that the variable's dtype is changed from str to int.
df["air_cond"] = df["Central Air"].apply(lambda x: 1 if x == "Y" else 0).astype(int)
del df["Central Air"]
new_variables.append("air_cond")
df[["air_cond"]].head()
air_cond | ||
---|---|---|
Order | PID | |
1 | 526301100 | 1 |
2 | 526350040 | 1 |
3 | 526351010 | 1 |
4 | 526353030 | 1 |
5 | 527105010 | 1 |
The columns Condition 1 and Condition 2 have the same realizations and can be regarded as "tags" given to a house indicating the nearby presence of a) a major street, b) a railroad, or c) a park.
The default tag "Norm" (implying no "condition") is given to 86% of the houses (this realization should therefore not be regarded as a tag!).
From the comparison of the grouped scatter plots below, it can be assumed that the proximity of a major street decreases the valuation (lower regression slope through the cloud of blue and orange dots). Therefore, a factor variable major_street is extracted indicating the proximity of an "artery" or "feeder" street.
Further, a factor variable railway is extracted as a relatively high proportion of the houses has such a tag. From the plots, a railway seems to not affect the valuations strongly.
Lastly, a factor variable park is extracted. From the plots, this does not seem to affect the valuation much.
List the "raw" realizations:
(
(
df["Condition 1"].value_counts() + df["Condition 2"].value_counts()
)
.sort_values(ascending=False)[1:]
)
Feedr 174 Artery 97 RRAn 48 PosN 43 RRAe 29 PosA 24 RRNn 11 RRNe 6 Name: count, dtype: int64
# Condition 2 is only filled with anything other than "Norm"
# if Condition 1 already has such a tag.
assert not ((df["Condition 1"] == "Norm") & (df["Condition 2"] != "Norm")).any()
# 86% of the houses actually have no tag.
round(100* (df["Condition 1"] == "Norm").sum() / df.shape[0])
86
From a simple scatter plot it is hard to see any significant impact by a predictor.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Condition 1", s=15, data=df);
However, plotting the groups seperately reveals different slopes.
street = ["Artery", "Feedr"]
railway = ["RRNn", "RRAn", "RRNe", "RRAe"]
park = ["PosA", "PosN"]
plot = sns.lmplot(
x="Gr Liv Area", y="SalePrice", col="Condition 1", hue="Condition 1",
col_order=["Norm"] + street + railway + park,
data=df, robust=True, col_wrap=4, ci=None, truncate=True, scatter_kws={"s": 15},
)
# Adjust font sizes.
for ax in plot.axes:
ax.set_title(ax.get_title(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=16)
ax.set_ylabel(ax.get_ylabel(), fontsize=16)
Extract factor variables major_street, railway, and park.
df["major_street"] = 0
df.loc[
df["Condition 1"].isin(street) | df["Condition 2"].isin(street),
"major_street",
] = 1
df["railway"] = 0
df.loc[
df["Condition 1"].isin(railway) | df["Condition 2"].isin(railway),
"railway",
] = 1
df["park"] = 0
df.loc[
df["Condition 1"].isin(park) | df["Condition 2"].isin(park),
"park",
] = 1
del df["Condition 1"]
del df["Condition 2"]
new_variables.extend(["major_street", "railway", "park"])
Show summary of counts:
df[["major_street", "railway", "park"]].sum()
major_street 264 railway 94 park 60 dtype: int64
df[["major_street", "railway", "park"]].head()
major_street | railway | park | ||
---|---|---|---|---|
Order | PID | |||
1 | 526301100 | 0 | 0 | 0 |
2 | 526350040 | 1 | 0 | 0 |
3 | 526351010 | 0 | 0 | 0 |
4 | 526353030 | 0 | 0 | 0 |
5 | 527105010 | 0 | 0 | 0 |
This dimensions tells the main material with which the houses are made of. The category is too diverse and the various grouped scatter plots did not reveal differing slopes. For simplicity, this variable is dropped.
This variable actually also represents tags that could be associated with a house (possibly up to two different tags).
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Exterior 1st", s=15, data=df);
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Exterior 2nd", s=15, data=df);
del df["Exterior 1st"]
del df["Exterior 2nd"]
# Also discard the associated ordinal variables.
del df["Exter Cond"]
del df["Exter Qual"]
The type of foundation appears to have an effect. However, only three of the six realizations occur in a large number. Factor variables found_BrkTil, found_CBlock, and found_PConc are extracted but not regarded as "interesting".
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Foundation", s=15, data=df);
df["Foundation"].value_counts()
Foundation PConc 1282 CBlock 1242 BrkTil 310 Slab 48 Stone 11 Wood 5 Name: count, dtype: int64
plot = sns.lmplot(
x="Gr Liv Area", y="SalePrice", col="Foundation", hue="Foundation",
col_order=["PConc", "CBlock", "BrkTil"],
data=df, robust=True, col_wrap=4, ci=None, truncate=True, scatter_kws={"s": 15},
)
# Adjust font sizes.
for ax in plot.axes:
ax.set_title(ax.get_title(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=16)
ax.set_ylabel(ax.get_ylabel(), fontsize=16)
foundation = pd.get_dummies(df["Foundation"], prefix="found", dtype=int)
# Only keep the top 3 realizations.
del foundation["found_Slab"]
del foundation["found_Stone"]
del foundation["found_Wood"]
df = pd.concat([df, foundation], axis=1)
del df["Foundation"]
new_variables.extend(["found_BrkTil", "found_CBlock", "found_PConc"])
df[foundation.columns].head()
found_BrkTil | found_CBlock | found_PConc | ||
---|---|---|---|---|
Order | PID | |||
1 | 526301100 | 0 | 1 | 0 |
2 | 526350040 | 0 | 1 | 0 |
3 | 526351010 | 0 | 1 | 0 |
4 | 526353030 | 0 | 1 | 0 |
5 | 527105010 | 0 | 0 | 1 |
As can be expected, the Garage Type looks very similar to the above has Garage variable. Therefore, it is dropped.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Garage Type", s=15, data=df);
del df["Garage Type"]
Most of the houses have gas. The variable is not helpful.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Heating", s=15, data=df);
del df["Heating"]
# Also discard the associated ordinal variable.
del df["Heating QC"]
In summary, this variable is very similar to the above derived variable has 2nd Flr. Therefore, it is dropped.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="House Style", s=15, data=df);
del df["House Style"]
This variable is assumed to contain the same information as the ordinal variable Land Slope and is dropped.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Land Contour", s=15, data=df);
del df["Land Contour"]
This variable shows no good pattern and is dropped.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Lot Config", s=15, data=df);
del df["Lot Config"]
By looking at this variable's realizations, one can see that several distinct features are lumped together in one. In particular, the above variables has 2nd Flr and *build_type_** and the future age related features at the bottom of this notebook together should comprise the same patterns in a more advantagous way. Thus, the column is dropped.
list(ALL_COLUMNS["MS SubClass"]["lookups"].values())
['1-STORY 1946 & NEWER ALL STYLES', '1-STORY 1945 & OLDER', '1-STORY W/FINISHED ATTIC ALL AGES', '1-1/2 STORY - UNFINISHED ALL AGES', '1-1/2 STORY FINISHED ALL AGES', '2-STORY 1946 & NEWER', '2-STORY 1945 & OLDER', '2-1/2 STORY ALL AGES', 'SPLIT OR MULTI-LEVEL', 'SPLIT FOYER', 'DUPLEX - ALL STYLES AND AGES', '1-STORY PUD (Planned Unit Development) - 1946 & NEWER', '1-1/2 STORY PUD - ALL AGES', '2-STORY PUD - 1946 & NEWER', 'PUD - MULTILEVEL - INCL SPLIT LEV/FOYER', '2 FAMILY CONVERSION - ALL STYLES AND AGES']
del df["MS SubClass"]
This variable is dropped as most houses are located in a "residential" zone.
df["MS Zoning"].value_counts()
MS Zoning RL 2252 RM 459 FV 131 RH 27 C 25 A 2 I 2 RP 0 Name: count, dtype: int64
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="MS Zoning", s=15, data=df);
del df["MS Zoning"]
None of the groups have a slope differing from the overall one.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Mas Vnr Type", s=15, data=df);
del df["Mas Vnr Type"]
This variable is basically a "other" field with no pattern.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Misc Feature", s=15, data=df);
del df["Misc Feature"]
Roofs in Ames, IA, are not special enough to make a difference in the price. Even "hip" roofs seem already priced in bigger houses.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Roof Matl", s=15, data=df);
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Roof Style", s=15, data=df);
del df["Roof Matl"]
del df["Roof Style"]
Partial and abnormal (= foreclosure) sales seem to make a change with higher and lower prices respectively. These two types will be encoded in factor variables partial_sale and abnormal_sale. The impact seems to be not big though.
df["Sale Condition"].value_counts()
Sale Condition Normal 2396 Partial 233 Abnorml 189 Family 46 Alloca 22 AdjLand 12 Name: count, dtype: int64
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Sale Condition", s=15, data=df);
plot = sns.lmplot(
x="Gr Liv Area", y="SalePrice", col="Sale Condition", hue="Sale Condition",
data=df, robust=True, col_wrap=4, ci=None, truncate=True, scatter_kws={"s": 15},
)
# Adjust font sizes.
for ax in plot.axes:
ax.set_title(ax.get_title(), fontsize=20)
ax.set_xlabel(ax.get_xlabel(), fontsize=16)
ax.set_ylabel(ax.get_ylabel(), fontsize=16)
df["partial_sale"] = df["Sale Condition"].apply(lambda x: 1 if x == "Partial" else 0)
df["abnormal_sale"] = df["Sale Condition"].apply(lambda x: 1 if x == "Abnorml" else 0)
Homes that are sold for the first time cleare are priced higher. A factor variable new_home is introduced.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="Sale Type", s=15, data=df);
df["new_home"] = df["Sale Type"].apply(lambda x: 1 if x == "New" else 0)
new_variables.extend(["partial_sale", "abnormal_sale", "new_home"])
del df["Sale Condition"]
del df["Sale Type"]
Show summary of counts:
df[["partial_sale", "abnormal_sale", "new_home"]].sum()
partial_sale 233 abnormal_sale 189 new_home 227 dtype: int64
df[["partial_sale", "abnormal_sale", "new_home"]].head()
partial_sale | abnormal_sale | new_home | ||
---|---|---|---|---|
Order | PID | |||
1 | 526301100 | 0 | 0 | 0 |
2 | 526350040 | 0 | 0 | 0 |
3 | 526351010 | 0 | 0 | 0 |
4 | 526353030 | 0 | 0 | 0 |
5 | 527105010 | 0 | 0 | 0 |
Looking at the value counts this variable is pretty useless.
df["Street"].value_counts()
Street Pave 2886 Grvl 12 Name: count, dtype: int64
del df["Street"]
The dataset was put together over several years. Therefore, the variables with year numbers need to be aligned to indicate the right ages.
# For one house the year of being remodeled is one year
# before it was built. That input error is corrected.
input_error = (df["Year Remod/Add"] < df["Year Built"])
assert input_error.sum() == 1
df.loc[input_error, "Year Remod/Add"] = df.loc[input_error, "Year Built"]
Introduce a factor variable remodeled. Almost half the houses were remodeled at some point in time.
remodeled = (df["Year Remod/Add"] > df["Year Built"])
df["remodeled"] = 0
df.loc[remodeled, "remodeled"] = 1
round(100 * remodeled.sum() / df.shape[0])
46
Create discrete variables years_since_built and years_since_remodeled.
df["years_since_built"] = df["Yr Sold"] - df["Year Built"]
df["years_since_remodeled"] = df["Yr Sold"] - df["Year Remod/Add"]
df[["years_since_built", "years_since_remodeled"]].hist(bins=20);
Two factor variables recently_built and recently_remodeled are created indicating that the corresponding action took place in the last 10 years. The two scatter plots below suggest that these groups of "recent vs. old" affect the price.
df["recently_built"] = df["years_since_built"].apply(lambda x: 1 if x <= 10 else 0)
df["recently_remodeled"] = df["years_since_remodeled"].apply(lambda x: 1 if x <= 10 else 0)
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="recently_built", s=15, data=df);
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="recently_remodeled", s=15, data=df);
del df["Yr Sold"]
del df["Year Built"]
del df["Year Remod/Add"]
age_columns = [
"remodeled", "years_since_built", "years_since_remodeled",
"recently_built", "recently_remodeled",
]
new_variables.extend(age_columns)
df[age_columns].head()
remodeled | years_since_built | years_since_remodeled | recently_built | recently_remodeled | ||
---|---|---|---|---|---|---|
Order | PID | |||||
1 | 526301100 | 0 | 50 | 50 | 0 | 0 |
2 | 526350040 | 0 | 49 | 49 | 0 | 0 |
3 | 526351010 | 0 | 52 | 52 | 0 | 0 |
4 | 526353030 | 0 | 42 | 42 | 0 | 0 |
5 | 527105010 | 1 | 13 | 12 | 0 | 0 |
The instructors' notes state:
Five observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will quickly indicate these
points). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these five unusual observations) before assigning it to students.
To apply a more "rigorous" approach, outlier detection is conducted with a so-called Isolation Forest.
# Use only numeric columns that are strongly correlated with the target.
# This mitigates the risk that a "not so good" chosen factor variable introduced
# in this notebook causes an observation to be removed as an outlier.
with open("data/correlated_variables.json", "r") as file:
content = json.loads(file.read())
strongly_correlated = content["strongly_correlated"]
df_encoded = encode_ordinals(df[list(set(strongly_correlated) & set(df.columns))])
iso = IsolationForest(n_estimators=100, bootstrap=True, contamination=0.005, random_state=random_state)
outliers = pd.DataFrame(
iso.fit_predict(df_encoded), columns=["outlier"], index=df.index
)
outliers["outlier"] = outliers["outlier"].apply(lambda x: 1 if x < 0 else 0)
df = pd.concat([df, outliers], axis=1)
The five aforementioned outliers are among the ones detected.
sns.scatterplot(x="Gr Liv Area", y="SalePrice", hue="outlier", s=15, data=df);
We concur with the notes from the paper and remove only the sales of houses with more than 4000 square feet.
# Remove the outliers.
df = df[df["Gr Liv Area"] <= 4000]
# Re-order the columns for convenience.
final_columns = (
sorted(set(list(ALL_COLUMNS.keys()) + new_variables) & set(df.columns))
+ TARGET_VARIABLES
)
df = df[final_columns]
Discarding useless and adding new predictors changed the final dataset significantly.
df.shape
(2893, 109)
df.head()
1st Flr SF | 1st Flr SF (box-cox-0) | 2nd Flr SF | 3Ssn Porch | Bedroom AbvGr | Bsmt Cond | Bsmt Exposure | Bsmt Full Bath | Bsmt Half Bath | Bsmt Qual | Bsmt Unf SF | BsmtFin SF 1 | BsmtFin SF 2 | BsmtFin Type 1 | BsmtFin Type 2 | Electrical | Enclosed Porch | Fence | Fireplace Qu | Fireplaces | Full Bath | Functional | Garage Area | Garage Cars | Garage Cond | Garage Finish | Garage Qual | Gr Liv Area | Gr Liv Area (box-cox-0) | Half Bath | Kitchen AbvGr | Kitchen Qual | Land Slope | Lot Area | Lot Area (box-cox-0.1) | Lot Shape | Low Qual Fin SF | Mas Vnr Area | Misc Val | Mo Sold | Open Porch SF | Overall Cond | Overall Qual | Paved Drive | Pool Area | Pool QC | Screen Porch | TotRms AbvGrd | Total Bath | Total Bsmt SF | Total Porch SF | Total SF | Total SF (box-cox-0.2) | Utilities | Wood Deck SF | abnormal_sale | air_cond | build_type_1Fam | build_type_2Fam | build_type_Twnhs | found_BrkTil | found_CBlock | found_PConc | has 2nd Flr | has Bsmt | has Fireplace | has Garage | has Pool | has Porch | major_street | new_home | nhood_Blmngtn | nhood_Blueste | nhood_BrDale | nhood_BrkSide | nhood_ClearCr | nhood_CollgCr | nhood_Crawfor | nhood_Edwards | nhood_Gilbert | nhood_Greens | nhood_GrnHill | nhood_IDOTRR | nhood_Landmrk | nhood_MeadowV | nhood_Mitchel | nhood_NPkVill | nhood_NWAmes | nhood_Names | nhood_NoRidge | nhood_NridgHt | nhood_OldTown | nhood_SWISU | nhood_Sawyer | nhood_SawyerW | nhood_Somerst | nhood_StoneBr | nhood_Timber | nhood_Veenker | park | partial_sale | railway | recently_built | recently_remodeled | remodeled | years_since_built | years_since_remodeled | SalePrice | SalePrice (box-cox-0) | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Order | PID | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1 | 526301100 | 1656.0 | 7.412160 | 0.0 | 0.0 | 3 | Gd | Gd | 1 | 0 | TA | 441.0 | 639.0 | 0.0 | BLQ | Unf | SBrkr | 0.0 | NA | Gd | 2 | 1 | Typ | 528.0 | 2 | TA | Fin | TA | 1656.0 | 7.412160 | 0 | 1 | TA | Gtl | 31770.0 | 18.196923 | IR1 | 0.0 | 112.0 | 0.0 | 5 | 62.0 | 5 | 6 | P | 0.0 | NA | 0.0 | 7 | 2.0 | 1080.0 | 272.0 | 2736.0 | 19.344072 | AllPub | 210.0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 50 | 50 | 215000.0 | 12.278393 |
2 | 526350040 | 896.0 | 6.797940 | 0.0 | 0.0 | 2 | TA | No | 0 | 0 | TA | 270.0 | 468.0 | 144.0 | Rec | LwQ | SBrkr | 0.0 | MnPrv | NA | 0 | 1 | Typ | 730.0 | 1 | TA | Unf | TA | 896.0 | 6.797940 | 0 | 1 | TA | Gtl | 11622.0 | 15.499290 | Reg | 0.0 | 0.0 | 0.0 | 6 | 0.0 | 6 | 5 | Y | 0.0 | NA | 120.0 | 5 | 1.0 | 882.0 | 260.0 | 1778.0 | 17.333478 | AllPub | 140.0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 49 | 49 | 105000.0 | 11.561716 |
3 | 526351010 | 1329.0 | 7.192182 | 0.0 | 0.0 | 3 | TA | No | 0 | 0 | TA | 406.0 | 923.0 | 0.0 | ALQ | Unf | SBrkr | 0.0 | NA | NA | 0 | 1 | Typ | 312.0 | 1 | TA | Unf | TA | 1329.0 | 7.192182 | 1 | 1 | Gd | Gtl | 14267.0 | 16.027549 | IR1 | 0.0 | 108.0 | 12500.0 | 6 | 36.0 | 6 | 6 | Y | 0.0 | NA | 0.0 | 6 | 1.5 | 1329.0 | 429.0 | 2658.0 | 19.203658 | AllPub | 393.0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 52 | 52 | 172000.0 | 12.055250 |
4 | 526353030 | 2110.0 | 7.654443 | 0.0 | 0.0 | 3 | TA | No | 1 | 0 | TA | 1045.0 | 1065.0 | 0.0 | ALQ | Unf | SBrkr | 0.0 | NA | TA | 2 | 2 | Typ | 522.0 | 2 | TA | Fin | TA | 2110.0 | 7.654443 | 1 | 1 | Ex | Gtl | 11160.0 | 15.396064 | Reg | 0.0 | 0.0 | 0.0 | 4 | 0.0 | 5 | 7 | Y | 0.0 | NA | 0.0 | 8 | 3.5 | 2110.0 | 0.0 | 4220.0 | 21.548042 | AllPub | 0.0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 42 | 42 | 244000.0 | 12.404924 |
5 | 527105010 | 928.0 | 6.833032 | 701.0 | 0.0 | 3 | TA | No | 0 | 0 | Gd | 137.0 | 791.0 | 0.0 | GLQ | Unf | SBrkr | 0.0 | MnPrv | TA | 1 | 2 | Typ | 482.0 | 2 | TA | Fin | TA | 1629.0 | 7.395722 | 1 | 1 | TA | Gtl | 13830.0 | 15.946705 | IR1 | 0.0 | 0.0 | 0.0 | 3 | 34.0 | 5 | 5 | Y | 0.0 | NA | 0.0 | 6 | 2.5 | 928.0 | 246.0 | 2557.0 | 19.016856 | AllPub | 212.0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 13 | 12 | 189900.0 | 12.154253 |
df.to_csv("data/data_clean_with_transformations_and_factors.csv")