Predicting House Sale Price

House%20Price3.jpg

Introduction

The data used for this project is drawn from a paper written in 2017 by Dean de Cock. You can read more about why the data was collected here. You can also read about the different columns in the data here.

The paper presents a data set describing the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains 2930 observations and a large number of explanatory variables (23 nominal, 23 ordinal, 14 discrete, and 20 continuous) involved in assessing home values. The data set was provided for students as a final project in an undergraduate regression course.

One important value of an estimated regression equation is its ability to predict the effects on Y of a change in one or more values of the independent variables. The value of this is obvious. Careful policy cannot be made without estimates of the effects that may result. It is the desire for particular results that drive the formation of most policy. Regression models can be, and have been, invaluable aids in forming such policies.

Project Goal

The goal for this project is to use regression analysis in various forms of python coding and available library resources (sklearn.model, statsmodels, ...) to:

  • determine what primary independent variables (house features) of the total 79 available are that best can be used as a predictor estimate for house sale price.
  • determine how good the predictor equation is as a fit via various calculated values: RMSE (root mean square error), R-Squared value).
In [1]:
# import a whole pile of key python libary
# modules to execute various code commands.

import pandas as pd
import numpy as np
import random
import string
import matplotlib.pyplot as plt
import seaborn as sns
from numpy.random import seed, randint
from IPython.display import HTML
from IPython.display import display, Markdown
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
from sklearn import linear_model
import warnings
warnings.filterwarnings('ignore')

# read data file provided and observe data structure and contents.
df = pd.read_csv('AmesHousing.txt', delimiter='\t', na_values=['NaN'])
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   object 
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   object 
 7   Alley            198 non-null    object 
 8   Lot Shape        2930 non-null   object 
 9   Land Contour     2930 non-null   object 
 10  Utilities        2930 non-null   object 
 11  Lot Config       2930 non-null   object 
 12  Land Slope       2930 non-null   object 
 13  Neighborhood     2930 non-null   object 
 14  Condition 1      2930 non-null   object 
 15  Condition 2      2930 non-null   object 
 16  Bldg Type        2930 non-null   object 
 17  House Style      2930 non-null   object 
 18  Overall Qual     2930 non-null   int64  
 19  Overall Cond     2930 non-null   int64  
 20  Year Built       2930 non-null   int64  
 21  Year Remod/Add   2930 non-null   int64  
 22  Roof Style       2930 non-null   object 
 23  Roof Matl        2930 non-null   object 
 24  Exterior 1st     2930 non-null   object 
 25  Exterior 2nd     2930 non-null   object 
 26  Mas Vnr Type     2907 non-null   object 
 27  Mas Vnr Area     2907 non-null   float64
 28  Exter Qual       2930 non-null   object 
 29  Exter Cond       2930 non-null   object 
 30  Foundation       2930 non-null   object 
 31  Bsmt Qual        2850 non-null   object 
 32  Bsmt Cond        2850 non-null   object 
 33  Bsmt Exposure    2847 non-null   object 
 34  BsmtFin Type 1   2850 non-null   object 
 35  BsmtFin SF 1     2929 non-null   float64
 36  BsmtFin Type 2   2849 non-null   object 
 37  BsmtFin SF 2     2929 non-null   float64
 38  Bsmt Unf SF      2929 non-null   float64
 39  Total Bsmt SF    2929 non-null   float64
 40  Heating          2930 non-null   object 
 41  Heating QC       2930 non-null   object 
 42  Central Air      2930 non-null   object 
 43  Electrical       2929 non-null   object 
 44  1st Flr SF       2930 non-null   int64  
 45  2nd Flr SF       2930 non-null   int64  
 46  Low Qual Fin SF  2930 non-null   int64  
 47  Gr Liv Area      2930 non-null   int64  
 48  Bsmt Full Bath   2928 non-null   float64
 49  Bsmt Half Bath   2928 non-null   float64
 50  Full Bath        2930 non-null   int64  
 51  Half Bath        2930 non-null   int64  
 52  Bedroom AbvGr    2930 non-null   int64  
 53  Kitchen AbvGr    2930 non-null   int64  
 54  Kitchen Qual     2930 non-null   object 
 55  TotRms AbvGrd    2930 non-null   int64  
 56  Functional       2930 non-null   object 
 57  Fireplaces       2930 non-null   int64  
 58  Fireplace Qu     1508 non-null   object 
 59  Garage Type      2773 non-null   object 
 60  Garage Yr Blt    2771 non-null   float64
 61  Garage Finish    2771 non-null   object 
 62  Garage Cars      2929 non-null   float64
 63  Garage Area      2929 non-null   float64
 64  Garage Qual      2771 non-null   object 
 65  Garage Cond      2771 non-null   object 
 66  Paved Drive      2930 non-null   object 
 67  Wood Deck SF     2930 non-null   int64  
 68  Open Porch SF    2930 non-null   int64  
 69  Enclosed Porch   2930 non-null   int64  
 70  3Ssn Porch       2930 non-null   int64  
 71  Screen Porch     2930 non-null   int64  
 72  Pool Area        2930 non-null   int64  
 73  Pool QC          13 non-null     object 
 74  Fence            572 non-null    object 
 75  Misc Feature     106 non-null    object 
 76  Misc Val         2930 non-null   int64  
 77  Mo Sold          2930 non-null   int64  
 78  Yr Sold          2930 non-null   int64  
 79  Sale Type        2930 non-null   object 
 80  Sale Condition   2930 non-null   object 
 81  SalePrice        2930 non-null   int64  
dtypes: float64(11), int64(28), object(43)
memory usage: 1.8+ MB
None

Initial Column Removals

There are certain columns which have a high quantity of missing values as shown in the summary of information above. I will remove those at this point of the analysis that have more than 25% missing values.

In [2]:
df = df.drop(['Alley', 'Fireplace Qu', 'Pool QC',\
              'Fence', 'Misc Feature'], axis = 1)

Start with Numerical Data

In [3]:
# extract numerical data classified as 'flota64 and 'int64'.
numerical_df = df.select_dtypes(include=['float64', 'int64'])
print(numerical_df.info(), '\n')
print(numerical_df.head(3), '\n')

# identify columns with missing values and isolate ones with less than 5%.
df_null_counts = numerical_df.isnull().sum()
print(df_null_counts), ']n'
fill_missing_values = df[df_null_counts[(df_null_counts>0) & (df_null_counts<146)].index]

# confirm successful isolation
print(fill_missing_values.isnull().sum())

# fill missing values with column mean rather than removing them.
fill_missing_values = fill_missing_values.fillna(fill_missing_values.mean())

# confirm no missing values present in any columns.
fill_missing_values.isna().sum().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 39 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   Lot Frontage     2440 non-null   float64
 4   Lot Area         2930 non-null   int64  
 5   Overall Qual     2930 non-null   int64  
 6   Overall Cond     2930 non-null   int64  
 7   Year Built       2930 non-null   int64  
 8   Year Remod/Add   2930 non-null   int64  
 9   Mas Vnr Area     2907 non-null   float64
 10  BsmtFin SF 1     2929 non-null   float64
 11  BsmtFin SF 2     2929 non-null   float64
 12  Bsmt Unf SF      2929 non-null   float64
 13  Total Bsmt SF    2929 non-null   float64
 14  1st Flr SF       2930 non-null   int64  
 15  2nd Flr SF       2930 non-null   int64  
 16  Low Qual Fin SF  2930 non-null   int64  
 17  Gr Liv Area      2930 non-null   int64  
 18  Bsmt Full Bath   2928 non-null   float64
 19  Bsmt Half Bath   2928 non-null   float64
 20  Full Bath        2930 non-null   int64  
 21  Half Bath        2930 non-null   int64  
 22  Bedroom AbvGr    2930 non-null   int64  
 23  Kitchen AbvGr    2930 non-null   int64  
 24  TotRms AbvGrd    2930 non-null   int64  
 25  Fireplaces       2930 non-null   int64  
 26  Garage Yr Blt    2771 non-null   float64
 27  Garage Cars      2929 non-null   float64
 28  Garage Area      2929 non-null   float64
 29  Wood Deck SF     2930 non-null   int64  
 30  Open Porch SF    2930 non-null   int64  
 31  Enclosed Porch   2930 non-null   int64  
 32  3Ssn Porch       2930 non-null   int64  
 33  Screen Porch     2930 non-null   int64  
 34  Pool Area        2930 non-null   int64  
 35  Misc Val         2930 non-null   int64  
 36  Mo Sold          2930 non-null   int64  
 37  Yr Sold          2930 non-null   int64  
 38  SalePrice        2930 non-null   int64  
dtypes: float64(11), int64(28)
memory usage: 892.9 KB
None 

   Order        PID  MS SubClass  Lot Frontage  Lot Area  Overall Qual  \
0      1  526301100           20         141.0     31770             6   
1      2  526350040           20          80.0     11622             5   
2      3  526351010           20          81.0     14267             6   

   Overall Cond  Year Built  Year Remod/Add  Mas Vnr Area  ...  Wood Deck SF  \
0             5        1960            1960         112.0  ...           210   
1             6        1961            1961           0.0  ...           140   
2             6        1958            1958         108.0  ...           393   

   Open Porch SF  Enclosed Porch  3Ssn Porch  Screen Porch  Pool Area  \
0             62               0           0             0          0   
1              0               0           0           120          0   
2             36               0           0             0          0   

   Misc Val  Mo Sold  Yr Sold  SalePrice  
0         0        5     2010     215000  
1         0        6     2010     105000  
2     12500        6     2010     172000  

[3 rows x 39 columns] 

Order                0
PID                  0
MS SubClass          0
Lot Frontage       490
Lot Area             0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Mas Vnr Area        23
BsmtFin SF 1         1
BsmtFin SF 2         1
Bsmt Unf SF          1
Total Bsmt SF        1
1st Flr SF           0
2nd Flr SF           0
Low Qual Fin SF      0
Gr Liv Area          0
Bsmt Full Bath       2
Bsmt Half Bath       2
Full Bath            0
Half Bath            0
Bedroom AbvGr        0
Kitchen AbvGr        0
TotRms AbvGrd        0
Fireplaces           0
Garage Yr Blt      159
Garage Cars          1
Garage Area          1
Wood Deck SF         0
Open Porch SF        0
Enclosed Porch       0
3Ssn Porch           0
Screen Porch         0
Pool Area            0
Misc Val             0
Mo Sold              0
Yr Sold              0
SalePrice            0
dtype: int64
Mas Vnr Area      23
BsmtFin SF 1       1
BsmtFin SF 2       1
Bsmt Unf SF        1
Total Bsmt SF      1
Bsmt Full Bath     2
Bsmt Half Bath     2
Garage Cars        1
Garage Area        1
dtype: int64
Out[3]:
0

Continue Numerical Data Cleaning

Years Until Remodelled

  • Data regarding specific years ('Year Remod/Add', 'Year Built') may not be usable unto themselves as predictor variables. Howerve, calculating the difference between the two may be of value to include.

House Square Feet Data

  • There are numerous columns with house square feet values with some for specific floors of house. It may be most efficient to consolidate the individual floor square feet into total house square feet as a single independent variable. One of the columns is called 'GR Liv Area'. I have a feeling that is the sum of the 1st and 2nd level square feet. I will check to see if that is true.
In [4]:
# drop numerical columns with missing values from original file. 
df_temp = df.drop(['Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2',\
                   'Bsmt Unf SF', 'Total Bsmt SF', 'Bsmt Full Bath',\
                   'Bsmt Half Bath', 'Garage Cars', 'Garage Area',\
                  ], axis = 1)

# restore numerical columns to original file that had
# missing values replaced with column mean.
df2 = pd.concat([df_temp, fill_missing_values], axis=1)

# add new variables based on calculations.
df2['yrs_until_remod'] = df2['Year Remod/Add'] - df2['Year Built']
df2['Total_SF'] = df2['1st Flr SF'] + df2['2nd Flr SF'] + df2['Total Bsmt SF']
df2['1st_plus_2nd'] = df2['1st Flr SF'] + df2['2nd Flr SF']
print(df2.info())

# drop variables that would unlikely contribute to house price prediciton.
df2 = df2.drop(['PID', 'Lot Frontage', 'Year Built',\
                'Year Remod/Add', 'Garage Yr Blt', \
                'Misc Val', 'Mo Sold', 'Yr Sold'], axis = 1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 80 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   PID              2930 non-null   int64  
 2   MS SubClass      2930 non-null   int64  
 3   MS Zoning        2930 non-null   object 
 4   Lot Frontage     2440 non-null   float64
 5   Lot Area         2930 non-null   int64  
 6   Street           2930 non-null   object 
 7   Lot Shape        2930 non-null   object 
 8   Land Contour     2930 non-null   object 
 9   Utilities        2930 non-null   object 
 10  Lot Config       2930 non-null   object 
 11  Land Slope       2930 non-null   object 
 12  Neighborhood     2930 non-null   object 
 13  Condition 1      2930 non-null   object 
 14  Condition 2      2930 non-null   object 
 15  Bldg Type        2930 non-null   object 
 16  House Style      2930 non-null   object 
 17  Overall Qual     2930 non-null   int64  
 18  Overall Cond     2930 non-null   int64  
 19  Year Built       2930 non-null   int64  
 20  Year Remod/Add   2930 non-null   int64  
 21  Roof Style       2930 non-null   object 
 22  Roof Matl        2930 non-null   object 
 23  Exterior 1st     2930 non-null   object 
 24  Exterior 2nd     2930 non-null   object 
 25  Mas Vnr Type     2907 non-null   object 
 26  Exter Qual       2930 non-null   object 
 27  Exter Cond       2930 non-null   object 
 28  Foundation       2930 non-null   object 
 29  Bsmt Qual        2850 non-null   object 
 30  Bsmt Cond        2850 non-null   object 
 31  Bsmt Exposure    2847 non-null   object 
 32  BsmtFin Type 1   2850 non-null   object 
 33  BsmtFin Type 2   2849 non-null   object 
 34  Heating          2930 non-null   object 
 35  Heating QC       2930 non-null   object 
 36  Central Air      2930 non-null   object 
 37  Electrical       2929 non-null   object 
 38  1st Flr SF       2930 non-null   int64  
 39  2nd Flr SF       2930 non-null   int64  
 40  Low Qual Fin SF  2930 non-null   int64  
 41  Gr Liv Area      2930 non-null   int64  
 42  Full Bath        2930 non-null   int64  
 43  Half Bath        2930 non-null   int64  
 44  Bedroom AbvGr    2930 non-null   int64  
 45  Kitchen AbvGr    2930 non-null   int64  
 46  Kitchen Qual     2930 non-null   object 
 47  TotRms AbvGrd    2930 non-null   int64  
 48  Functional       2930 non-null   object 
 49  Fireplaces       2930 non-null   int64  
 50  Garage Type      2773 non-null   object 
 51  Garage Yr Blt    2771 non-null   float64
 52  Garage Finish    2771 non-null   object 
 53  Garage Qual      2771 non-null   object 
 54  Garage Cond      2771 non-null   object 
 55  Paved Drive      2930 non-null   object 
 56  Wood Deck SF     2930 non-null   int64  
 57  Open Porch SF    2930 non-null   int64  
 58  Enclosed Porch   2930 non-null   int64  
 59  3Ssn Porch       2930 non-null   int64  
 60  Screen Porch     2930 non-null   int64  
 61  Pool Area        2930 non-null   int64  
 62  Misc Val         2930 non-null   int64  
 63  Mo Sold          2930 non-null   int64  
 64  Yr Sold          2930 non-null   int64  
 65  Sale Type        2930 non-null   object 
 66  Sale Condition   2930 non-null   object 
 67  SalePrice        2930 non-null   int64  
 68  Mas Vnr Area     2930 non-null   float64
 69  BsmtFin SF 1     2930 non-null   float64
 70  BsmtFin SF 2     2930 non-null   float64
 71  Bsmt Unf SF      2930 non-null   float64
 72  Total Bsmt SF    2930 non-null   float64
 73  Bsmt Full Bath   2930 non-null   float64
 74  Bsmt Half Bath   2930 non-null   float64
 75  Garage Cars      2930 non-null   float64
 76  Garage Area      2930 non-null   float64
 77  yrs_until_remod  2930 non-null   int64  
 78  Total_SF         2930 non-null   float64
 79  1st_plus_2nd     2930 non-null   int64  
dtypes: float64(12), int64(30), object(38)
memory usage: 1.8+ MB
None

What is Data Leakage in Machine Learning?

Data leakage can cause you to create overly optimistic if not completely invalid predictive models.

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

How could data leakage pose a problem?

  • "The reality is that as a data scientist, you’re at risk of producing a data leakage situation any time you prepare, clean your data, impute missing values, remove outliers, etc. You might be distorting the data in the process of preparing it to the point that you’ll build a model that works well on your “clean” dataset, but will totally suck when applied in the real-world situation where you actually want to apply it."

Visualize House Square Feet by House Locations

In [5]:
# use seaborn library to create boxplot.
fig, ax = plt.subplots()
fig.set_size_inches(12, 10)

column=['Gr Liv Area', '1st_plus_2nd', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Total_SF']
ax = sns.boxplot(data=df2[column], palette='colorblind')

plt.title('Boxplot of Square Feet by House Location', fontsize=20, pad=20)
plt.xlabel('House Locations', fontsize=20, labelpad = 18) 
plt.ylabel('Square Feet', fontsize=20, labelpad = 18)
plt.yticks(fontsize=13)
plt.xticks(fontsize=13)
Out[5]:
(array([0, 1, 2, 3, 4, 5]),
 [Text(0, 0, 'Gr Liv Area'),
  Text(1, 0, '1st_plus_2nd'),
  Text(2, 0, 'Total Bsmt SF'),
  Text(3, 0, '1st Flr SF'),
  Text(4, 0, '2nd Flr SF'),
  Text(5, 0, 'Total_SF')])

House Square Feet Decision Time

The graph above confirms the sum of 1st and 2nd floor square footage is equal to 'Gr Liv Area' square feet. So I will remove columns 'Gr Liv Area' and '1st_plus_2nd'.

Rather than analyzing square feet by each floor separately, I feel it would be most appropriate to analyze total square feet since there are houses with no second floor (i.e. 'Ranch-Type' or 'Bungalow') and some with no basement. So, I will eliminate the columns of square feet by floor.

In [6]:
df2 = df2.drop(['Gr Liv Area', '1st_plus_2nd', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF'], axis = 1)

numerical_df2 = df2.select_dtypes(include=['float64', 'int64'])
print(numerical_df2.info())

numerical_df2.isna().sum().sum()

null_series = numerical_df2.isnull().sum()
print(null_series)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 29 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2930 non-null   int64  
 1   MS SubClass      2930 non-null   int64  
 2   Lot Area         2930 non-null   int64  
 3   Overall Qual     2930 non-null   int64  
 4   Overall Cond     2930 non-null   int64  
 5   Low Qual Fin SF  2930 non-null   int64  
 6   Full Bath        2930 non-null   int64  
 7   Half Bath        2930 non-null   int64  
 8   Bedroom AbvGr    2930 non-null   int64  
 9   Kitchen AbvGr    2930 non-null   int64  
 10  TotRms AbvGrd    2930 non-null   int64  
 11  Fireplaces       2930 non-null   int64  
 12  Wood Deck SF     2930 non-null   int64  
 13  Open Porch SF    2930 non-null   int64  
 14  Enclosed Porch   2930 non-null   int64  
 15  3Ssn Porch       2930 non-null   int64  
 16  Screen Porch     2930 non-null   int64  
 17  Pool Area        2930 non-null   int64  
 18  SalePrice        2930 non-null   int64  
 19  Mas Vnr Area     2930 non-null   float64
 20  BsmtFin SF 1     2930 non-null   float64
 21  BsmtFin SF 2     2930 non-null   float64
 22  Bsmt Unf SF      2930 non-null   float64
 23  Bsmt Full Bath   2930 non-null   float64
 24  Bsmt Half Bath   2930 non-null   float64
 25  Garage Cars      2930 non-null   float64
 26  Garage Area      2930 non-null   float64
 27  yrs_until_remod  2930 non-null   int64  
 28  Total_SF         2930 non-null   float64
dtypes: float64(9), int64(20)
memory usage: 664.0 KB
None
Order              0
MS SubClass        0
Lot Area           0
Overall Qual       0
Overall Cond       0
Low Qual Fin SF    0
Full Bath          0
Half Bath          0
Bedroom AbvGr      0
Kitchen AbvGr      0
TotRms AbvGrd      0
Fireplaces         0
Wood Deck SF       0
Open Porch SF      0
Enclosed Porch     0
3Ssn Porch         0
Screen Porch       0
Pool Area          0
SalePrice          0
Mas Vnr Area       0
BsmtFin SF 1       0
BsmtFin SF 2       0
Bsmt Unf SF        0
Bsmt Full Bath     0
Bsmt Half Bath     0
Garage Cars        0
Garage Area        0
yrs_until_remod    0
Total_SF           0
dtype: int64

Compute Pairwise Correlation

In [7]:
# compute the pairwise correlation coefficients between all of the columns in train_subset
df2_subset = df2[null_series.index]
corr_df = df2_subset.corr(method ='pearson')
Sale_Price_Corr = abs(corr_df['SalePrice'])

sorted_corrs = Sale_Price_Corr.sort_values(ascending=True)
print(sorted_corrs)
BsmtFin SF 2       0.005890
Order              0.031408
3Ssn Porch         0.032225
Bsmt Half Bath     0.035815
Low Qual Fin SF    0.037660
Pool Area          0.068403
MS SubClass        0.085092
Overall Cond       0.101697
Screen Porch       0.112151
Kitchen AbvGr      0.119814
Enclosed Porch     0.128787
Bedroom AbvGr      0.143913
Bsmt Unf SF        0.182805
yrs_until_remod    0.240168
Lot Area           0.266549
Bsmt Full Bath     0.275894
Half Bath          0.285056
Open Porch SF      0.312951
Wood Deck SF       0.327143
BsmtFin SF 1       0.432794
Fireplaces         0.474558
TotRms AbvGrd      0.495474
Mas Vnr Area       0.505784
Full Bath          0.545604
Garage Area        0.640385
Garage Cars        0.647861
Total_SF           0.793054
Overall Qual       0.799262
SalePrice          1.000000
Name: SalePrice, dtype: float64

Check for Collinearity

The candidate features listed above are sorted by how strongly they're correlated with the SalePrice column. I will keep only the features that have a correlation of 0.3 or higher. This cutoff is a bit arbitrary, however I will start with that.

The next thing I will look for is potential collinearity between some of these feature columns. Collinearity is when 2 feature columns are highly correlated and stand the risk of duplicating information. If we have 2 features that convey the same information using 2 different measures or metrics, we don't need to keep both.

Generate a correlation matrix heatmap using Seaborn to visually compare the correlations and look for problematic pairwise feature correlations is an effective way to pick out potential cases of collinearity. Looking for outlier values in the heatmap makes this visual representation easier than just looking at the table of pairwise comparisons.

In [8]:
# plot seaborn heatmap.
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12,10))
strong_corrs = sorted_corrs[sorted_corrs > 0.3]

corrmat = df2_subset[strong_corrs.index]
corr2_df = corrmat.corr(method ='pearson')

ax = sns.heatmap(corr2_df)

Observations and Action

The diagonal of white squares reflect the correlations of the variables with themselves (i.e. perfect correlation of 1.0)

Based on the correlation matrix heatmap (i.e. color shade comparisons), we can tell that the following pair of columns are strongly correlated:

  • Garage Area and Garage Cars

If we read the descriptions of these two columns from the data documentation, we can tell that each of the columns reflects very similar information. I'll remove 'Garage Cars'.

(Almost) Final Numerical House Features

Establish the final numerical house features to test predictability of house sale price.

In [9]:
final_corr_cols = strong_corrs.drop(['Garage Cars'])
print(final_corr_cols, '\n')
features = final_corr_cols.drop(['SalePrice']).index
display(Markdown('<h3><span style="color:blue"> Ten Numerical House Features  </span></h3>'))
print(features, '\n')
Open Porch SF    0.312951
Wood Deck SF     0.327143
BsmtFin SF 1     0.432794
Fireplaces       0.474558
TotRms AbvGrd    0.495474
Mas Vnr Area     0.505784
Full Bath        0.545604
Garage Area      0.640385
Total_SF         0.793054
Overall Qual     0.799262
SalePrice        1.000000
Name: SalePrice, dtype: float64 

Ten Numerical House Features

Index(['Open Porch SF', 'Wood Deck SF', 'BsmtFin SF 1', 'Fireplaces',
       'TotRms AbvGrd', 'Mas Vnr Area', 'Full Bath', 'Garage Area', 'Total_SF',
       'Overall Qual'],
      dtype='object') 

Create Train and Test Samples

Now I'm ready to split the 'cleaned' numerical data into 'Train' and 'Test' groups. I will split the data equally (50%) between the two.

There seems to be arguments both ways to either split the full data set into train and test without shuffling randomly or to include random shuffling. I have chosen to include randomizing.

In [10]:
# split dataframe file in half into 'Train' and 'Test' dataframes.
# include randimize shuffling to prevent confounding of results.
train=df2.sample(frac=0.5,random_state=1) #random state is a seed value
test=df2.drop(train.index)

# confirm equal split of data.
print(len(train), len(test), '\n')

target = 'SalePrice'
clean_test = test[final_corr_cols.index].dropna()

# use sklearn linear model to calculate magnitude of errors (MSE & RMSE).
lr = LinearRegression()
lr.fit(train[features], train['SalePrice'])

train_predictions = lr.predict(train[features])
test_predictions = lr.predict(clean_test[features])

train_mse = mean_squared_error(train_predictions, train[target])
test_mse = mean_squared_error(test_predictions, clean_test[target])

train_rmse = np.sqrt(train_mse)
test_rmse = np.sqrt(test_mse)

print('Train Data Root Mean Square Error =', '{:.0f}'.format(train_rmse), '\n')
print('Test Data Root Mean Square Error =', '{:.0f}'.format(test_rmse))
1465 1465 

Train Data Root Mean Square Error = 37231 

Test Data Root Mean Square Error = 34335

Observations and Next Step

RMSE

  • Root mean square error (RMSE) is a method of measuring the difference between values predicted by a model and their actual values. It is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are. In other words, it tells you how concentrated the data is around the line of best fit. Lower values of RMSE indicate better fit.

So what do the two Root Mean Square Error values above, one from 'Train' data and the other from 'Test' data tell us? Well, the value for 'Test' data is lower than the value from 'Train' which I suppose is good (i.e. lower error). But, does it tell us anything about how well the 10 features included in the Multiple Regression model fit as a predictor model for housing sale price? As a numerical individual value, I really don't think so.

R-Squared

  • I prefer to know the calculated R-Squared or (Adjusted R-Squared) value which we can obtain with other Python library modules (e.g. statsmodels). R-Squared tells me the estimated total explained variability on the depdendent variable offered by the independent variables in the model.
  • For example, suppose the calculated R-Squared value in this project was 30%. Is that good? Well, it tells me that the variables I chose in the model only explains 30% of the total variablity of house sale price. This means that variables not included in the model account for the other 70% unexplained variability. So, no; that is not a good result.
  • On the other hand, if R-Squared turned out to be say, 80%, I would say that is a good start. That leaves 20% unexplained variability. The importance of the predictor equation would have an impact on how far I would want to go in terms of finding the key independent variables that would at least generate an R-Squared or (Adjusted R-Squared) value of, 80%, 85%, 90% ...

In Multiple Linear Regression, adjusted R-Squared (corrected for sample size and regression coefficients) is more appropriate than R-Squared as an increasing number of X variables also increases R-Squared. Adjusted R-Squared is always lower than the R-Squared.

Adjusted-R²

  • The concept behind Adjusted-R² is simple. To get Adjusted-R², we penalize R² each time a new regression variable is added. Specifically, we scale (1-R²) by a factor that is directly proportional to the number of regression variables. Greater is the number of regression variables in the model, greater is this scaling factor and greater is the downward adjustment to R².
In [11]:
# create column groups from 1 variable to 10 and compare
# calculated R-Squared values for each.
col1 = ['Overall Qual']
col2 = ['Overall Qual', 'Total_SF']
col3 = ['Overall Qual', 'Total_SF', 'Garage Area']
col4 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath']
col5 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
         'Mas Vnr Area']
col6 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
         'Mas Vnr Area', 'TotRms AbvGrd']
col7 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
         'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces']
col8 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
         'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1']
col9 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
         'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1',\
         'Wood Deck SF']
col10 = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
         'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1',\
         'Wood Deck SF', 'Open Porch SF']
list = [col1, col2, col3, col4, col5, col6, col7, col8, col9, col10]
j = 0
for n in list:
    X = test[n]
    Y = test['SalePrice']
 
    # with statsmodels
    X = sm.add_constant(X) # adding a constant
    j = j + 1 
    model = sm.OLS(Y, X).fit()
    predictions = model.predict(X)
    display(Markdown('<h3><span style="color:blue"> {} Features  </span></h3>'.format(j)))
    print(n, '\n')
    rsquared = model.rsquared
    print('R-Squared Value =', "{:.2f}%".format(rsquared*100))

1 Features

['Overall Qual'] 

R-Squared Value = 64.03%

2 Features

['Overall Qual', 'Total_SF'] 

R-Squared Value = 76.90%

3 Features

['Overall Qual', 'Total_SF', 'Garage Area'] 

R-Squared Value = 78.40%

4 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath'] 

R-Squared Value = 78.41%

5 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area'] 

R-Squared Value = 79.06%

6 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd'] 

R-Squared Value = 79.07%

7 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces'] 

R-Squared Value = 79.84%

8 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1'] 

R-Squared Value = 80.76%

9 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1', 'Wood Deck SF'] 

R-Squared Value = 80.93%

10 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'TotRms AbvGrd', 'Fireplaces', 'BsmtFin SF 1', 'Wood Deck SF', 'Open Porch SF'] 

R-Squared Value = 80.94%

R-Squared Results

Of the 10 features included in the model, the following three seem to have little additional impact on total explained variability: 'TotRms AbvGrd', 'Wood Deck SF', 'Open Porch SF'.

I will rerun the Multiple Regression Analysis after removing these three variables.

In [12]:
# limiting the independent variables to seven.
col = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
         'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1']
X = test[col]
Y = test['SalePrice']
 
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

# with statsmodels
X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
display(Markdown('<h3><span style="color:blue"> {} Features  </span></h3>'.format(7)))
print(col, '\n')

print_model = model.summary()
print(print_model)
Intercept: 
 -84039.20489769935
Coefficients: 
 [2.20689106e+04 2.99955690e+01 5.00006747e+01 7.77247138e+03
 3.30347939e+01 1.03369485e+04 1.88517499e+01]

7 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1'] 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.807
Model:                            OLS   Adj. R-squared:                  0.806
Method:                 Least Squares   F-statistic:                     871.3
Date:                Wed, 14 Jul 2021   Prob (F-statistic):               0.00
Time:                        16:33:11   Log-Likelihood:                -17379.
No. Observations:                1465   AIC:                         3.477e+04
Df Residuals:                    1457   BIC:                         3.482e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -8.404e+04   4386.895    -19.157      0.000   -9.26e+04   -7.54e+04
Overall Qual  2.207e+04    925.548     23.844      0.000    2.03e+04    2.39e+04
Total_SF        29.9956      1.850     16.214      0.000      26.367      33.624
Garage Area     50.0007      5.424      9.219      0.000      39.361      60.640
Full Bath     7772.4714   2142.082      3.628      0.000    3570.577     1.2e+04
Mas Vnr Area    33.0348      5.665      5.832      0.000      21.923      44.147
Fireplaces    1.034e+04   1613.944      6.405      0.000    7171.046    1.35e+04
BsmtFin SF 1    18.8517      2.309      8.164      0.000      14.322      23.381
==============================================================================
Omnibus:                      544.756   Durbin-Watson:                   1.624
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            50591.128
Skew:                          -0.767   Prob(JB):                         0.00
Kurtosis:                      31.748   Cond. No.                     1.37e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.37e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Results with 7 Features

Finding 7 features out of the original 29 numerical columns that yield an adjusted R-Squared value of 80.6% I would say is very good for a start.

I woud like to investigate 'Sale Condition' as a variable and determine what to do with it.

Types of Sale Conditions

In [13]:
# determine how many unique 'values' there are for 'Sale Conditions'.
print(df2['Sale Condition'].value_counts(dropna=False))
Normal     2413
Partial     245
Abnorml     190
Family       46
Alloca       24
AdjLand      12
Name: Sale Condition, dtype: int64

Investigate Sale Conditions

I see in the output above that there are 6 types of 'Sale Condition' categories. Let's generate a box plot and compare house sale price for each Sale Condition category.

In [14]:
fig, ax = plt.subplots()
fig.set_size_inches(12, 10)

sns.boxplot(y=df2['SalePrice'], x=df2['Sale Condition'], 
                 data=df2, 
                 palette='colorblind')

plt.title('Boxplot of House Sale Price by Sale Condition', fontsize=20, pad=20)
plt.xlabel('Sale Condition', fontsize=18, labelpad = 18) 
plt.ylabel('House Sale Price', fontsize=18, labelpad = 18)
plt.yticks(fontsize=13)
plt.xticks(fontsize=13)
Out[14]:
(array([0, 1, 2, 3, 4, 5]),
 [Text(0, 0, 'Normal'),
  Text(1, 0, 'Partial'),
  Text(2, 0, 'Family'),
  Text(3, 0, 'Abnorml'),
  Text(4, 0, 'Alloca'),
  Text(5, 0, 'AdjLand')])

Numerical Data Within Sale Condition = Normal

The Sale Condition classed as 'Normal' represents just over 82% of all houses in this data file. Including the other five types of 'Sale Condition' categories shown above may 'muddy' the water regarding house sale price predictability. There may be significant differences in their respective means.

Let's see if we confine the data to Sale Condition = 'Normal' only if the overall model predictability increases (i.e R-Squared).

In [15]:
# filter 'Sale Condition' to one category = 'Normal'.
df2_normal = df2[df2['Sale Condition'] == 'Normal']
print(df2_normal['Sale Condition'].value_counts())

# split dataframe file in half into 'Train' and 'Test' dataframes.
# include randimize shuffling to prevent confounding of results.
train=df2_normal.sample(frac=0.5,random_state=1) #random state is a seed value
test=df2_normal.drop(train.index)
print(len(train), len(test), '\n')

col = ['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath',\
         'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1']
X = test[col]
Y = test['SalePrice']
 
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

# with statsmodels
X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
display(Markdown('<h3><span style="color:blue"> {} Features  </span></h3>'.format(7)))
print(col, '\n')

print_model = model.summary()
print(print_model)
Normal    2413
Name: Sale Condition, dtype: int64
1206 1207 

Intercept: 
 -80393.62619124033
Coefficients: 
 [18838.59607641    39.75151283    42.1614011   3792.99175005
    39.89610188  7595.27313596    23.99813149]

7 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Full Bath', 'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1'] 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.848
Model:                            OLS   Adj. R-squared:                  0.847
Method:                 Least Squares   F-statistic:                     954.4
Date:                Wed, 14 Jul 2021   Prob (F-statistic):               0.00
Time:                        16:33:11   Log-Likelihood:                -14065.
No. Observations:                1207   AIC:                         2.815e+04
Df Residuals:                    1199   BIC:                         2.819e+04
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -8.039e+04   4066.339    -19.771      0.000   -8.84e+04   -7.24e+04
Overall Qual  1.884e+04    852.965     22.086      0.000    1.72e+04    2.05e+04
Total_SF        39.7515      1.866     21.300      0.000      36.090      43.413
Garage Area     42.1614      5.019      8.400      0.000      32.314      52.009
Full Bath     3792.9918   1921.934      1.974      0.049      22.264    7563.719
Mas Vnr Area    39.8961      5.277      7.560      0.000      29.542      50.250
Fireplaces    7595.2731   1429.507      5.313      0.000    4790.660    1.04e+04
BsmtFin SF 1    23.9981      2.196     10.928      0.000      19.690      28.307
==============================================================================
Omnibus:                      401.861   Durbin-Watson:                   1.678
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3654.498
Skew:                           1.272   Prob(JB):                         0.00
Kurtosis:                      11.136   Cond. No.                     1.36e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.36e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Sale Condition = 'Normal' Result

We see that by creating a predictor model within Sale Condition = 'Normal, the adjusted R-Squared value increased from 80.6% to 84.7%

Whether someone would want to limit a house sale price predictor model to only 'Normal' sale condition or across all 6 catagories would depend on how critical it would be to maximize adjusted R-Squared value.

One More Change to Consider

Looking at the output table of coefficients above, I see that coefficient 'Full Bath' has a P>|t| value of 0.049.

This indicates a somewhat 'weak' variable in the equation relative to the others. I will take it out of the equation and see if Adjusted R-Squared lowers.

In [16]:
# omit category 'Full Bath' from equation.
col = ['Overall Qual', 'Total_SF', 'Garage Area', \
         'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1']
X = test[col] # here we have 10 variables for multiple regression.
Y = test['SalePrice']
 
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, Y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

# with statsmodels
X = sm.add_constant(X) # adding a constant
 
model = sm.OLS(Y, X).fit()
predictions = model.predict(X) 
display(Markdown('<h3><span style="color:blue"> {} Features  </span></h3>'.format(6)))
print(col, '\n')

print_model = model.summary()
print(print_model)
Intercept: 
 -79844.43133164069
Coefficients: 
 [19114.77818564    41.32715704    42.72905189    39.54209009
  7479.86412996    23.02168246]

6 Features

['Overall Qual', 'Total_SF', 'Garage Area', 'Mas Vnr Area', 'Fireplaces', 'BsmtFin SF 1'] 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.847
Model:                            OLS   Adj. R-squared:                  0.847
Method:                 Least Squares   F-statistic:                     1110.
Date:                Wed, 14 Jul 2021   Prob (F-statistic):               0.00
Time:                        16:33:11   Log-Likelihood:                -14067.
No. Observations:                1207   AIC:                         2.815e+04
Df Residuals:                    1200   BIC:                         2.818e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const        -7.984e+04   4061.696    -19.658      0.000   -8.78e+04   -7.19e+04
Overall Qual  1.911e+04    842.421     22.690      0.000    1.75e+04    2.08e+04
Total_SF        41.3272      1.689     24.470      0.000      38.014      44.641
Garage Area     42.7291      5.017      8.517      0.000      32.886      52.572
Mas Vnr Area    39.5421      5.281      7.488      0.000      29.182      49.903
Fireplaces    7479.8641   1430.032      5.231      0.000    4674.223    1.03e+04
BsmtFin SF 1    23.0217      2.142     10.747      0.000      18.819      27.225
==============================================================================
Omnibus:                      390.909   Durbin-Watson:                   1.681
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3420.680
Skew:                           1.241   Prob(JB):                         0.00
Kurtosis:                      10.865   Cond. No.                     1.36e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.36e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Result within 'Normal' Sale Condition and 6 Features

The adjusted R-Squared value remained the same: 84.7%. This I would say is quite good with only 6 house features used.

Linear Regression with Category Type Data

Let's see if we can glean some category type columns from the data set and yield reasonable predictions of house sale price.

In [17]:
# extract non numeric columns from original data set.
text_df2 = df2.select_dtypes(include=['object'])

text_df3 = pd.DataFrame()
print(text_df2.info(), '\n')

# remove 'object' based columns with less than 5 unique values.
# will most likely not be significant in predictor model.
for n in text_df2:
    print(n, '    Unique Values Count =    ', text_df2[n].nunique())
    new = text_df2[n].nunique()
    if new > 4:
        text_df3 = pd.concat([text_df3,text_df2[n]], axis=1)
    else:
        None
print('\n')    
print(text_df3.head(3), '\n')
print(text_df3.info(), '\n')

df_null_counts2 = text_df3.isnull().sum()
print(df_null_counts2, '\n')
fill_missing_values2 = df[df_null_counts2[df_null_counts2>0].index]

print(fill_missing_values2.isnull().sum(), '\n')

# determine qty. of each unique value.
# this will determine what to change the missing values
# to - the mode among the unique values in each category.
for n in fill_missing_values2:
    print(fill_missing_values2[n].value_counts(dropna=False))
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 38 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   MS Zoning       2930 non-null   object
 1   Street          2930 non-null   object
 2   Lot Shape       2930 non-null   object
 3   Land Contour    2930 non-null   object
 4   Utilities       2930 non-null   object
 5   Lot Config      2930 non-null   object
 6   Land Slope      2930 non-null   object
 7   Neighborhood    2930 non-null   object
 8   Condition 1     2930 non-null   object
 9   Condition 2     2930 non-null   object
 10  Bldg Type       2930 non-null   object
 11  House Style     2930 non-null   object
 12  Roof Style      2930 non-null   object
 13  Roof Matl       2930 non-null   object
 14  Exterior 1st    2930 non-null   object
 15  Exterior 2nd    2930 non-null   object
 16  Mas Vnr Type    2907 non-null   object
 17  Exter Qual      2930 non-null   object
 18  Exter Cond      2930 non-null   object
 19  Foundation      2930 non-null   object
 20  Bsmt Qual       2850 non-null   object
 21  Bsmt Cond       2850 non-null   object
 22  Bsmt Exposure   2847 non-null   object
 23  BsmtFin Type 1  2850 non-null   object
 24  BsmtFin Type 2  2849 non-null   object
 25  Heating         2930 non-null   object
 26  Heating QC      2930 non-null   object
 27  Central Air     2930 non-null   object
 28  Electrical      2929 non-null   object
 29  Kitchen Qual    2930 non-null   object
 30  Functional      2930 non-null   object
 31  Garage Type     2773 non-null   object
 32  Garage Finish   2771 non-null   object
 33  Garage Qual     2771 non-null   object
 34  Garage Cond     2771 non-null   object
 35  Paved Drive     2930 non-null   object
 36  Sale Type       2930 non-null   object
 37  Sale Condition  2930 non-null   object
dtypes: object(38)
memory usage: 870.0+ KB
None 

MS Zoning     Unique Values Count =     7
Street     Unique Values Count =     2
Lot Shape     Unique Values Count =     4
Land Contour     Unique Values Count =     4
Utilities     Unique Values Count =     3
Lot Config     Unique Values Count =     5
Land Slope     Unique Values Count =     3
Neighborhood     Unique Values Count =     28
Condition 1     Unique Values Count =     9
Condition 2     Unique Values Count =     8
Bldg Type     Unique Values Count =     5
House Style     Unique Values Count =     8
Roof Style     Unique Values Count =     6
Roof Matl     Unique Values Count =     8
Exterior 1st     Unique Values Count =     16
Exterior 2nd     Unique Values Count =     17
Mas Vnr Type     Unique Values Count =     5
Exter Qual     Unique Values Count =     4
Exter Cond     Unique Values Count =     5
Foundation     Unique Values Count =     6
Bsmt Qual     Unique Values Count =     5
Bsmt Cond     Unique Values Count =     5
Bsmt Exposure     Unique Values Count =     4
BsmtFin Type 1     Unique Values Count =     6
BsmtFin Type 2     Unique Values Count =     6
Heating     Unique Values Count =     6
Heating QC     Unique Values Count =     5
Central Air     Unique Values Count =     2
Electrical     Unique Values Count =     5
Kitchen Qual     Unique Values Count =     5
Functional     Unique Values Count =     8
Garage Type     Unique Values Count =     6
Garage Finish     Unique Values Count =     3
Garage Qual     Unique Values Count =     5
Garage Cond     Unique Values Count =     5
Paved Drive     Unique Values Count =     3
Sale Type     Unique Values Count =     10
Sale Condition     Unique Values Count =     6


  MS Zoning Lot Config Neighborhood Condition 1 Condition 2 Bldg Type  \
0        RL     Corner        NAmes        Norm        Norm      1Fam   
1        RH     Inside        NAmes       Feedr        Norm      1Fam   
2        RL     Corner        NAmes        Norm        Norm      1Fam   

  House Style Roof Style Roof Matl Exterior 1st  ... Heating Heating QC  \
0      1Story        Hip   CompShg      BrkFace  ...    GasA         Fa   
1      1Story      Gable   CompShg      VinylSd  ...    GasA         TA   
2      1Story        Hip   CompShg      Wd Sdng  ...    GasA         TA   

  Electrical Kitchen Qual Functional Garage Type Garage Qual Garage Cond  \
0      SBrkr           TA        Typ      Attchd          TA          TA   
1      SBrkr           TA        Typ      Attchd          TA          TA   
2      SBrkr           Gd        Typ      Attchd          TA          TA   

  Sale Type Sale Condition  
0       WD          Normal  
1       WD          Normal  
2       WD          Normal  

[3 rows x 28 columns] 

<class 'pandas.core.frame.DataFrame'>
Index: 2930 entries, 0 to 2929
Data columns (total 28 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   MS Zoning       2930 non-null   object
 1   Lot Config      2930 non-null   object
 2   Neighborhood    2930 non-null   object
 3   Condition 1     2930 non-null   object
 4   Condition 2     2930 non-null   object
 5   Bldg Type       2930 non-null   object
 6   House Style     2930 non-null   object
 7   Roof Style      2930 non-null   object
 8   Roof Matl       2930 non-null   object
 9   Exterior 1st    2930 non-null   object
 10  Exterior 2nd    2930 non-null   object
 11  Mas Vnr Type    2907 non-null   object
 12  Exter Cond      2930 non-null   object
 13  Foundation      2930 non-null   object
 14  Bsmt Qual       2850 non-null   object
 15  Bsmt Cond       2850 non-null   object
 16  BsmtFin Type 1  2850 non-null   object
 17  BsmtFin Type 2  2849 non-null   object
 18  Heating         2930 non-null   object
 19  Heating QC      2930 non-null   object
 20  Electrical      2929 non-null   object
 21  Kitchen Qual    2930 non-null   object
 22  Functional      2930 non-null   object
 23  Garage Type     2773 non-null   object
 24  Garage Qual     2771 non-null   object
 25  Garage Cond     2771 non-null   object
 26  Sale Type       2930 non-null   object
 27  Sale Condition  2930 non-null   object
dtypes: object(28)
memory usage: 663.8+ KB
None 

MS Zoning           0
Lot Config          0
Neighborhood        0
Condition 1         0
Condition 2         0
Bldg Type           0
House Style         0
Roof Style          0
Roof Matl           0
Exterior 1st        0
Exterior 2nd        0
Mas Vnr Type       23
Exter Cond          0
Foundation          0
Bsmt Qual          80
Bsmt Cond          80
BsmtFin Type 1     80
BsmtFin Type 2     81
Heating             0
Heating QC          0
Electrical          1
Kitchen Qual        0
Functional          0
Garage Type       157
Garage Qual       159
Garage Cond       159
Sale Type           0
Sale Condition      0
dtype: int64 

Mas Vnr Type       23
Bsmt Qual          80
Bsmt Cond          80
BsmtFin Type 1     80
BsmtFin Type 2     81
Electrical          1
Garage Type       157
Garage Qual       159
Garage Cond       159
dtype: int64 

None       1752
BrkFace     880
Stone       249
BrkCmn       25
NaN          23
CBlock        1
Name: Mas Vnr Type, dtype: int64
TA     1283
Gd     1219
Ex      258
Fa       88
NaN      80
Po        2
Name: Bsmt Qual, dtype: int64
TA     2616
Gd      122
Fa      104
NaN      80
Po        5
Ex        3
Name: Bsmt Cond, dtype: int64
GLQ    859
Unf    851
ALQ    429
Rec    288
BLQ    269
LwQ    154
NaN     80
Name: BsmtFin Type 1, dtype: int64
Unf    2499
Rec     106
LwQ      89
NaN      81
BLQ      68
ALQ      53
GLQ      34
Name: BsmtFin Type 2, dtype: int64
SBrkr    2682
FuseA     188
FuseF      50
FuseP       8
Mix         1
NaN         1
Name: Electrical, dtype: int64
Attchd     1731
Detchd      782
BuiltIn     186
NaN         157
Basment      36
2Types       23
CarPort      15
Name: Garage Type, dtype: int64
TA     2615
NaN     159
Fa      124
Gd       24
Po        5
Ex        3
Name: Garage Qual, dtype: int64
TA     2665
NaN     159
Fa       74
Gd       15
Po       14
Ex        3
Name: Garage Cond, dtype: int64

Missing Value Fill In Decisions

  • TA = Typical/Average
  • Gd = Good

  • Mas Vnr Type - This is masonry veneer type and 60% is None. Therefore I will remove this column.

  • Bsmt Qual - Split the 80 NaN values between TA and Gd.
  • Bsmt Cond - 89% is TA, so allocate the missng values to TA.
  • BsmtFin Type 1 - Split the 80 NaN values between GLQ and Unf.
  • BsmtFin Type 2 - 85% is Unf, so allocate the missng values to Unf.
  • Electrical - 92% is SBrkr, so allocate the missng values to SBrkr.
  • Garage Type - Split the 157 NaN values between Attchd (70%) and Detchd (30%).
  • Garage Qual - 89% is TA, so allocate the missng values to TA.
  • Garage Cond - 91% is TA, so allocate the missng values to TA.
In [18]:
fill_missing_values2['Bsmt Qual'].replace('nan', np.nan, inplace=True)    
mask = fill_missing_values2['Bsmt Qual'].isna() 
ind = fill_missing_values2['Bsmt Qual'].loc[mask].sample(frac=0.5).index
fill_missing_values2.loc[ind, 'Bsmt Qual'] = 'TA'
fill_missing_values2['Bsmt Qual'] = fill_missing_values2['Bsmt Qual'].fillna('Gd')    
print(fill_missing_values2['Bsmt Qual']. value_counts(dropna=False)) 

fill_missing_values2['Bsmt Cond'] = fill_missing_values2['Bsmt Cond'].fillna('TA')
print(fill_missing_values2['Bsmt Cond']. value_counts(dropna=False))

fill_missing_values2['BsmtFin Type 1'].replace('nan', np.nan, inplace=True)    
mask = fill_missing_values2['BsmtFin Type 1'].isna() 
ind = fill_missing_values2['BsmtFin Type 1'].loc[mask].sample(frac=0.5).index
fill_missing_values2.loc[ind, 'BsmtFin Type 1'] = 'GLQ'
fill_missing_values2['BsmtFin Type 1'] = fill_missing_values2['BsmtFin Type 1'].fillna('Unf')   
print(fill_missing_values2['BsmtFin Type 1']. value_counts(dropna=False))

fill_missing_values2['BsmtFin Type 2'] = fill_missing_values2['BsmtFin Type 2'].fillna('Unf')
print(fill_missing_values2['BsmtFin Type 2']. value_counts(dropna=False))

fill_missing_values2['Electrical'] = fill_missing_values2['Electrical'].fillna('SBrkr')
print(fill_missing_values2['Electrical']. value_counts(dropna=False))

fill_missing_values2['Garage Type'].replace('nan', np.nan, inplace=True)    
mask = fill_missing_values2['Garage Type'].isna() 
ind = fill_missing_values2['Garage Type'].loc[mask].sample(frac=0.7).index
fill_missing_values2.loc[ind, 'Garage Type'] = 'Attchd'
fill_missing_values2['Garage Type'] = fill_missing_values2['Garage Type'].fillna('Detchd')    
print(fill_missing_values2['Garage Type']. value_counts(dropna=False))

fill_missing_values2['Garage Qual'] = fill_missing_values2['Garage Qual'].fillna('TA')
print(fill_missing_values2['Garage Qual']. value_counts(dropna=False))

fill_missing_values2['Garage Cond'] = fill_missing_values2['Garage Cond'].fillna('TA')
print(fill_missing_values2['Garage Cond']. value_counts(dropna=False))
print(fill_missing_values2.isnull().sum())
TA    1323
Gd    1259
Ex     258
Fa      88
Po       2
Name: Bsmt Qual, dtype: int64
TA    2696
Gd     122
Fa     104
Po       5
Ex       3
Name: Bsmt Cond, dtype: int64
GLQ    899
Unf    891
ALQ    429
Rec    288
BLQ    269
LwQ    154
Name: BsmtFin Type 1, dtype: int64
Unf    2580
Rec     106
LwQ      89
BLQ      68
ALQ      53
GLQ      34
Name: BsmtFin Type 2, dtype: int64
SBrkr    2683
FuseA     188
FuseF      50
FuseP       8
Mix         1
Name: Electrical, dtype: int64
Attchd     1841
Detchd      829
BuiltIn     186
Basment      36
2Types       23
CarPort      15
Name: Garage Type, dtype: int64
TA    2774
Fa     124
Gd      24
Po       5
Ex       3
Name: Garage Qual, dtype: int64
TA    2824
Fa      74
Gd      15
Po      14
Ex       3
Name: Garage Cond, dtype: int64
Mas Vnr Type      23
Bsmt Qual          0
Bsmt Cond          0
BsmtFin Type 1     0
BsmtFin Type 2     0
Electrical         0
Garage Type        0
Garage Qual        0
Garage Cond        0
dtype: int64

Create Categorical Dataframe with No Missing Values

In [19]:
# replace columns with missing values with
# modified ones having no missing values.
df_cat_temp = text_df3.drop(['Mas Vnr Type', 'Bsmt Qual', 'Bsmt Cond', \
                   'BsmtFin Type 1', 'BsmtFin Type 2', 'Electrical', \
                   'Garage Type', 'Garage Qual', 'Garage Cond',\
                  ], axis = 1)

text_df3 = pd.concat([df_cat_temp, fill_missing_values2], axis=1)
text_df3 = text_df3.drop(['Mas Vnr Type'], axis = 1)

print(text_df3.info())
# confirm no missing values.
print(text_df3.isnull().sum().sum())
<class 'pandas.core.frame.DataFrame'>
Index: 2930 entries, 0 to 2929
Data columns (total 27 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   MS Zoning       2930 non-null   object
 1   Lot Config      2930 non-null   object
 2   Neighborhood    2930 non-null   object
 3   Condition 1     2930 non-null   object
 4   Condition 2     2930 non-null   object
 5   Bldg Type       2930 non-null   object
 6   House Style     2930 non-null   object
 7   Roof Style      2930 non-null   object
 8   Roof Matl       2930 non-null   object
 9   Exterior 1st    2930 non-null   object
 10  Exterior 2nd    2930 non-null   object
 11  Exter Cond      2930 non-null   object
 12  Foundation      2930 non-null   object
 13  Heating         2930 non-null   object
 14  Heating QC      2930 non-null   object
 15  Kitchen Qual    2930 non-null   object
 16  Functional      2930 non-null   object
 17  Sale Type       2930 non-null   object
 18  Sale Condition  2930 non-null   object
 19  Bsmt Qual       2930 non-null   object
 20  Bsmt Cond       2930 non-null   object
 21  BsmtFin Type 1  2930 non-null   object
 22  BsmtFin Type 2  2930 non-null   object
 23  Electrical      2930 non-null   object
 24  Garage Type     2930 non-null   object
 25  Garage Qual     2930 non-null   object
 26  Garage Cond     2930 non-null   object
dtypes: object(27)
memory usage: 640.9+ KB
None
0

Linear Regression for Categorical Variables

In [20]:
X = text_df3
Y = df2['SalePrice']

# create 'dummy' columns in order to be able to
# perform linear regression analysis on categorical data.
X = pd.get_dummies(data=X, drop_first=True)
print('There are', len(X.columns), 'dummy columns created.', '\n')
print(len(X))
from sklearn import linear_model
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .50, random_state = 40)

model = sm.OLS(Y_test, X_test).fit()
predictions = model.predict(X)

print_model = model.summary()
print(print_model)
There are 184 dummy columns created. 

2930
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.822
Model:                            OLS   Adj. R-squared:                  0.798
Method:                 Least Squares   F-statistic:                     34.14
Date:                Wed, 14 Jul 2021   Prob (F-statistic):               0.00
Time:                        16:33:12   Log-Likelihood:                -17284.
No. Observations:                1465   AIC:                         3.492e+04
Df Residuals:                    1290   BIC:                         3.584e+04
Df Model:                         174                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
MS Zoning_C (all)       7.772e+04   1.83e+04      4.242      0.000    4.18e+04    1.14e+05
MS Zoning_FV            1.146e+05   1.67e+04      6.877      0.000    8.19e+04    1.47e+05
MS Zoning_I (all)       1.914e+04   3.77e+04      0.507      0.612   -5.49e+04    9.32e+04
MS Zoning_RH            1.229e+05   1.78e+04      6.902      0.000    8.79e+04    1.58e+05
MS Zoning_RL            1.205e+05   1.47e+04      8.168      0.000    9.15e+04    1.49e+05
MS Zoning_RM            1.218e+05   1.51e+04      8.085      0.000    9.23e+04    1.51e+05
Lot Config_CulDSac      1.007e+04   4692.480      2.145      0.032     861.303    1.93e+04
Lot Config_FR2          -1.14e+04   6381.047     -1.786      0.074   -2.39e+04    1120.779
Lot Config_FR3         -1.718e+04   1.51e+04     -1.139      0.255   -4.68e+04    1.24e+04
Lot Config_Inside      -1556.7795   2606.623     -0.597      0.550   -6670.465    3556.906
Neighborhood_Blueste   -7801.7947   1.67e+04     -0.468      0.640   -4.05e+04    2.49e+04
Neighborhood_BrDale    -1.873e+04   1.53e+04     -1.223      0.222   -4.88e+04    1.13e+04
Neighborhood_BrkSide   -4.506e+04   1.27e+04     -3.541      0.000      -7e+04   -2.01e+04
Neighborhood_ClearCr     149.3453   1.34e+04      0.011      0.991   -2.62e+04    2.65e+04
Neighborhood_CollgCr   -2.714e+04   1.01e+04     -2.688      0.007    -4.7e+04   -7329.576
Neighborhood_Crawfor    1.662e+04   1.13e+04      1.471      0.142   -5545.276    3.88e+04
Neighborhood_Edwards   -5.512e+04    1.1e+04     -5.012      0.000   -7.67e+04   -3.35e+04
Neighborhood_Gilbert    -3.51e+04   1.07e+04     -3.289      0.001    -5.6e+04   -1.42e+04
Neighborhood_Greens     5.276e+04   3.69e+04      1.430      0.153   -1.96e+04    1.25e+05
Neighborhood_GrnHill    9.973e-10   1.96e-10      5.081      0.000    6.12e-10    1.38e-09
Neighborhood_IDOTRR    -5.078e+04   1.39e+04     -3.648      0.000   -7.81e+04   -2.35e+04
Neighborhood_Landmrk   -2.621e+04   3.78e+04     -0.693      0.489      -1e+05     4.8e+04
Neighborhood_MeadowV    -7.18e+04   1.56e+04     -4.601      0.000   -1.02e+05   -4.12e+04
Neighborhood_Mitchel   -4.192e+04   1.12e+04     -3.740      0.000   -6.39e+04   -1.99e+04
Neighborhood_NAmes     -3.858e+04   1.07e+04     -3.600      0.000   -5.96e+04   -1.76e+04
Neighborhood_NPkVill    1.702e+04   2.81e+04      0.606      0.545   -3.81e+04    7.21e+04
Neighborhood_NWAmes    -1.965e+04   1.14e+04     -1.722      0.085    -4.2e+04    2735.529
Neighborhood_NoRidge    5.284e+04   1.16e+04      4.550      0.000    3.01e+04    7.56e+04
Neighborhood_NridgHt    2.508e+04   1.05e+04      2.386      0.017    4460.308    4.57e+04
Neighborhood_OldTown   -5.102e+04   1.26e+04     -4.035      0.000   -7.58e+04   -2.62e+04
Neighborhood_SWISU     -3.731e+04   1.31e+04     -2.839      0.005   -6.31e+04   -1.15e+04
Neighborhood_Sawyer    -3.895e+04   1.13e+04     -3.446      0.001   -6.11e+04   -1.68e+04
Neighborhood_SawyerW   -2.856e+04    1.1e+04     -2.595      0.010   -5.02e+04   -6969.205
Neighborhood_Somerst   -5970.6198   1.21e+04     -0.494      0.622   -2.97e+04    1.78e+04
Neighborhood_StoneBr    5.144e+04   1.19e+04      4.315      0.000    2.81e+04    7.48e+04
Neighborhood_Timber    -4661.3655   1.13e+04     -0.413      0.680   -2.68e+04    1.75e+04
Neighborhood_Veenker    1.575e+04   1.59e+04      0.994      0.321   -1.53e+04    4.68e+04
Condition 1_Feedr       4821.2500   7373.224      0.654      0.513   -9643.574    1.93e+04
Condition 1_Norm        8019.4388   5917.846      1.355      0.176   -3590.219    1.96e+04
Condition 1_PosA        4.331e+04   1.91e+04      2.265      0.024    5796.701    8.08e+04
Condition 1_PosN        1.471e+04   1.07e+04      1.372      0.170   -6329.459    3.57e+04
Condition 1_RRAe       -1.227e+04   1.11e+04     -1.109      0.268    -3.4e+04    9427.839
Condition 1_RRAn        6990.9353   9366.367      0.746      0.456   -1.14e+04    2.54e+04
Condition 1_RRNe       -1.342e+04   2.14e+04     -0.626      0.531   -5.55e+04    2.86e+04
Condition 1_RRNn        -364.2623   2.23e+04     -0.016      0.987    -4.4e+04    4.33e+04
Condition 2_Feedr         1.9e+04   2.53e+04      0.752      0.452   -3.06e+04    6.86e+04
Condition 2_Norm        8378.1302   2.16e+04      0.389      0.698   -3.39e+04    5.07e+04
Condition 2_PosA        3.544e+04   4.52e+04      0.784      0.433   -5.33e+04    1.24e+05
Condition 2_PosN       -1.678e+04   3.42e+04     -0.491      0.623   -8.38e+04    5.02e+04
Condition 2_RRAe         9.18e-11   1.48e-10      0.622      0.534   -1.98e-10    3.81e-10
Condition 2_RRAn        1.651e+04    4.2e+04      0.393      0.694   -6.59e+04    9.89e+04
Condition 2_RRNn        3.047e+04   4.19e+04      0.727      0.467   -5.17e+04    1.13e+05
Bldg Type_2fmCon       -1.135e+04   6992.811     -1.624      0.105   -2.51e+04    2365.144
Bldg Type_Duplex        1905.3256   6147.617      0.310      0.757   -1.02e+04     1.4e+04
Bldg Type_Twnhs        -5.808e+04   7731.124     -7.513      0.000   -7.32e+04   -4.29e+04
Bldg Type_TwnhsE       -4.571e+04   5125.030     -8.920      0.000   -5.58e+04   -3.57e+04
House Style_1.5Unf     -1.789e+04   1.19e+04     -1.504      0.133   -4.12e+04    5448.042
House Style_1Story     -1.327e+04   3737.956     -3.551      0.000   -2.06e+04   -5939.984
House Style_2.5Fin      2097.6550   1.84e+04      0.114      0.909   -3.39e+04    3.81e+04
House Style_2.5Unf       3.62e+04   1.16e+04      3.115      0.002    1.34e+04     5.9e+04
House Style_2Story      4525.0635   3955.381      1.144      0.253   -3234.622    1.23e+04
House Style_SFoyer     -1.592e+04   7405.048     -2.150      0.032   -3.05e+04   -1397.046
House Style_SLvl       -1.327e+04   6201.639     -2.139      0.033   -2.54e+04   -1100.577
Roof Style_Gable        2.398e+04   2.15e+04      1.114      0.265   -1.82e+04    6.62e+04
Roof Style_Gambrel      1.739e+04   2.38e+04      0.732      0.464   -2.92e+04     6.4e+04
Roof Style_Hip          4.057e+04   2.17e+04      1.871      0.062   -1972.228    8.31e+04
Roof Style_Mansard      1.622e+04   2.61e+04      0.621      0.534    -3.5e+04    6.74e+04
Roof Style_Shed         6741.2864    3.1e+04      0.217      0.828   -5.42e+04    6.76e+04
Roof Matl_CompShg       6.758e+04   1.89e+04      3.570      0.000    3.04e+04    1.05e+05
Roof Matl_Membran       1.007e+05   3.81e+04      2.645      0.008     2.6e+04    1.75e+05
Roof Matl_Metal         7.109e+04   3.69e+04      1.929      0.054   -1223.448    1.43e+05
Roof Matl_Roll          8.284e+04    3.8e+04      2.181      0.029    8312.239    1.57e+05
Roof Matl_Tar&Grv       9.898e+04   1.98e+04      5.002      0.000    6.02e+04    1.38e+05
Roof Matl_WdShake        1.07e+05   2.38e+04      4.493      0.000    6.03e+04    1.54e+05
Roof Matl_WdShngl       1.521e+05   2.32e+04      6.567      0.000    1.07e+05    1.98e+05
Exterior 1st_AsphShn   -4.236e+04   5.45e+04     -0.778      0.437   -1.49e+05    6.45e+04
Exterior 1st_BrkComm    5.465e+04   2.73e+04      2.003      0.045    1114.294    1.08e+05
Exterior 1st_BrkFace    5.277e+04   1.69e+04      3.125      0.002    1.96e+04    8.59e+04
Exterior 1st_CBlock     1.036e+05      4e+04      2.589      0.010    2.51e+04    1.82e+05
Exterior 1st_CemntBd    7.774e+04   2.38e+04      3.262      0.001     3.1e+04    1.24e+05
Exterior 1st_HdBoard    2.162e+04   1.62e+04      1.333      0.183   -1.02e+04    5.34e+04
Exterior 1st_ImStucc    -998.7501   4.03e+04     -0.025      0.980      -8e+04     7.8e+04
Exterior 1st_MetalSd     798.1393   1.85e+04      0.043      0.966   -3.54e+04     3.7e+04
Exterior 1st_Plywood    3.003e+04   1.59e+04      1.888      0.059   -1180.718    6.12e+04
Exterior 1st_PreCast    7.169e+04   2.06e+04      3.483      0.001    3.13e+04    1.12e+05
Exterior 1st_Stone      8852.2871   4.82e+04      0.184      0.854   -8.58e+04    1.03e+05
Exterior 1st_Stucco     2.899e+04   1.84e+04      1.573      0.116   -7173.441    6.52e+04
Exterior 1st_VinylSd    9010.0302   1.72e+04      0.525      0.600   -2.47e+04    4.27e+04
Exterior 1st_Wd Sdng    1.444e+04   1.56e+04      0.929      0.353   -1.61e+04     4.5e+04
Exterior 1st_WdShing    1.805e+04   1.71e+04      1.054      0.292   -1.55e+04    5.16e+04
Exterior 2nd_AsphShn    2.656e+04    4.2e+04      0.632      0.528   -5.59e+04    1.09e+05
Exterior 2nd_Brk Cmn    -2.08e+04   3.07e+04     -0.677      0.499   -8.11e+04    3.95e+04
Exterior 2nd_BrkFace    -4.02e+04   1.94e+04     -2.071      0.039   -7.83e+04   -2125.412
Exterior 2nd_CBlock     1.532e+04   4.48e+04      0.342      0.733   -7.27e+04    1.03e+05
Exterior 2nd_CmentBd   -3.239e+04   2.46e+04     -1.318      0.188   -8.06e+04    1.58e+04
Exterior 2nd_HdBoard   -6978.1506   1.74e+04     -0.402      0.688   -4.11e+04    2.71e+04
Exterior 2nd_ImStucc   -3109.0141   2.14e+04     -0.145      0.885   -4.52e+04     3.9e+04
Exterior 2nd_MetalSd    1.956e+04   1.93e+04      1.015      0.310   -1.82e+04    5.73e+04
Exterior 2nd_Other      3.283e+04   3.99e+04      0.823      0.411   -4.55e+04    1.11e+05
Exterior 2nd_Plywood   -5383.4469   1.67e+04     -0.323      0.747   -3.81e+04    2.73e+04
Exterior 2nd_PreCast    7.169e+04   2.06e+04      3.483      0.001    3.13e+04    1.12e+05
Exterior 2nd_Stone     -3.235e+04   3.12e+04     -1.036      0.300   -9.36e+04    2.89e+04
Exterior 2nd_Stucco    -1.605e+04   1.95e+04     -0.823      0.411   -5.43e+04    2.22e+04
Exterior 2nd_VinylSd    1.071e+04   1.79e+04      0.598      0.550   -2.44e+04    4.58e+04
Exterior 2nd_Wd Sdng    4019.5004   1.67e+04      0.241      0.809   -2.87e+04    3.67e+04
Exterior 2nd_Wd Shng   -7688.7615   1.74e+04     -0.442      0.659   -4.18e+04    2.64e+04
Exter Cond_Fa          -2.337e+04   1.91e+04     -1.224      0.221   -6.08e+04    1.41e+04
Exter Cond_Gd          -1.071e+04   1.81e+04     -0.591      0.555   -4.63e+04    2.49e+04
Exter Cond_Po          -2035.1372   4.06e+04     -0.050      0.960   -8.17e+04    7.76e+04
Exter Cond_TA          -1.104e+04    1.8e+04     -0.614      0.539   -4.63e+04    2.42e+04
Foundation_CBlock       2378.3209   4607.669      0.516      0.606   -6661.025    1.14e+04
Foundation_PConc        6226.2124   4892.425      1.273      0.203   -3371.770    1.58e+04
Foundation_Slab        -2.776e+04   9012.495     -3.080      0.002   -4.54e+04   -1.01e+04
Foundation_Stone        2.063e+04   1.41e+04      1.463      0.144   -7037.951    4.83e+04
Foundation_Wood         9958.1167   2.58e+04      0.386      0.699   -4.06e+04    6.05e+04
Heating_GasA            1.309e+05   1.99e+04      6.579      0.000    9.18e+04     1.7e+05
Heating_GasW            1.611e+05   2.22e+04      7.244      0.000    1.17e+05    2.05e+05
Heating_Grav            1.277e+05   2.55e+04      5.007      0.000    7.77e+04    1.78e+05
Heating_OthW            1.226e+05   3.78e+04      3.241      0.001    4.84e+04    1.97e+05
Heating_Wall             1.38e+05   2.75e+04      5.021      0.000    8.41e+04    1.92e+05
Heating QC_Fa          -1.527e+04   6553.711     -2.330      0.020   -2.81e+04   -2414.048
Heating QC_Gd          -6667.1623   2992.798     -2.228      0.026   -1.25e+04    -795.878
Heating QC_Po           3753.1146   3.87e+04      0.097      0.923   -7.22e+04    7.97e+04
Heating QC_TA          -1.347e+04   2981.541     -4.519      0.000   -1.93e+04   -7624.306
Kitchen Qual_Fa        -7.084e+04   9187.102     -7.711      0.000   -8.89e+04   -5.28e+04
Kitchen Qual_Gd        -4.806e+04   5397.136     -8.904      0.000   -5.86e+04   -3.75e+04
Kitchen Qual_Po          5.66e-11   3.93e-11      1.441      0.150   -2.05e-11    1.34e-10
Kitchen Qual_TA        -6.727e+04   5931.792    -11.340      0.000   -7.89e+04   -5.56e+04
Functional_Maj2        -6.411e+04   2.18e+04     -2.934      0.003   -1.07e+05   -2.13e+04
Functional_Min1        -2.567e+04   1.36e+04     -1.885      0.060   -5.24e+04    1040.244
Functional_Min2        -2.354e+04   1.37e+04     -1.722      0.085   -5.04e+04    3285.456
Functional_Mod          -1.83e+04   1.52e+04     -1.203      0.229   -4.82e+04    1.16e+04
Functional_Sal         -5.615e+04   3.96e+04     -1.416      0.157   -1.34e+05    2.16e+04
Functional_Sev         -9.226e+04    3.8e+04     -2.427      0.015   -1.67e+05   -1.77e+04
Functional_Typ         -2.368e+04   1.19e+04     -1.982      0.048   -4.71e+04    -239.588
Sale Type_CWD            523.9004   1.62e+04      0.032      0.974   -3.13e+04    3.24e+04
Sale Type_Con          -3.638e+04   3.59e+04     -1.015      0.310   -1.07e+05     3.4e+04
Sale Type_ConLD         1.206e+04   1.21e+04      0.998      0.318   -1.16e+04    3.58e+04
Sale Type_ConLI        -1030.9229   1.73e+04     -0.060      0.952   -3.49e+04    3.29e+04
Sale Type_ConLw        -1.856e+04    1.8e+04     -1.033      0.302   -5.38e+04    1.67e+04
Sale Type_New            750.2250   2.23e+04      0.034      0.973    -4.3e+04    4.45e+04
Sale Type_Oth           8665.0277   1.88e+04      0.462      0.644   -2.82e+04    4.55e+04
Sale Type_VWD           2.215e-11   4.06e-11      0.546      0.585   -5.75e-11    1.02e-10
Sale Type_WD             121.7545   6439.496      0.019      0.985   -1.25e+04    1.28e+04
Sale Condition_AdjLand  6863.9745   1.95e+04      0.351      0.725   -3.15e+04    4.52e+04
Sale Condition_Alloca   2.228e+04   1.19e+04      1.867      0.062   -1130.113    4.57e+04
Sale Condition_Family   8252.8469   8775.724      0.940      0.347   -8963.408    2.55e+04
Sale Condition_Normal   1.338e+04   4419.296      3.028      0.003    4710.496    2.21e+04
Sale Condition_Partial  2.675e+04   2.14e+04      1.248      0.212   -1.53e+04    6.88e+04
Bsmt Qual_Fa           -5.658e+04   8527.603     -6.635      0.000   -7.33e+04   -3.99e+04
Bsmt Qual_Gd           -4.634e+04   4848.777     -9.557      0.000   -5.59e+04   -3.68e+04
Bsmt Qual_Po           -5.846e+04   3.24e+04     -1.802      0.072   -1.22e+05    5193.998
Bsmt Qual_TA           -5.738e+04   5874.823     -9.767      0.000   -6.89e+04   -4.59e+04
Bsmt Cond_Fa           -1.145e+04   2.61e+04     -0.438      0.661   -6.27e+04    3.98e+04
Bsmt Cond_Gd           -7222.2303    2.6e+04     -0.278      0.781   -5.82e+04    4.38e+04
Bsmt Cond_Po           -8412.8794    4.4e+04     -0.191      0.848   -9.48e+04    7.79e+04
Bsmt Cond_TA           -9877.2139   2.55e+04     -0.387      0.699      -6e+04    4.02e+04
BsmtFin Type 1_BLQ      5070.6075   4208.352      1.205      0.228   -3185.358    1.33e+04
BsmtFin Type 1_GLQ      1.358e+04   3604.645      3.767      0.000    6505.651    2.06e+04
BsmtFin Type 1_LwQ       406.6923   5158.051      0.079      0.937   -9712.396    1.05e+04
BsmtFin Type 1_Rec     -1928.0510   4094.445     -0.471      0.638   -9960.553    6104.451
BsmtFin Type 1_Unf     -9188.9446   3416.012     -2.690      0.007   -1.59e+04   -2487.396
BsmtFin Type 2_BLQ     -5489.7372   9494.822     -0.578      0.563   -2.41e+04    1.31e+04
BsmtFin Type 2_GLQ      9316.2355   1.24e+04      0.752      0.452    -1.5e+04    3.36e+04
BsmtFin Type 2_LwQ     -2532.2160   9417.837     -0.269      0.788    -2.1e+04    1.59e+04
BsmtFin Type 2_Rec     -1.057e+04   9362.503     -1.129      0.259   -2.89e+04    7795.787
BsmtFin Type 2_Unf     -5783.4160   7646.558     -0.756      0.450   -2.08e+04    9217.637
Electrical_FuseF       -1.321e+04   8654.824     -1.527      0.127   -3.02e+04    3766.981
Electrical_FuseP        1.735e+04   1.67e+04      1.042      0.298   -1.53e+04       5e+04
Electrical_Mix          1.423e-11    9.4e-12      1.514      0.130   -4.21e-12    3.27e-11
Electrical_SBrkr        7584.6268   4270.795      1.776      0.076    -793.839     1.6e+04
Garage Type_Attchd      3001.5896   1.39e+04      0.216      0.829   -2.42e+04    3.02e+04
Garage Type_Basment    -3002.0647   1.72e+04     -0.175      0.861   -3.67e+04    3.07e+04
Garage Type_BuiltIn     1.908e+04   1.46e+04      1.309      0.191   -9519.408    4.77e+04
Garage Type_CarPort    -1.969e+04   2.06e+04     -0.955      0.340   -6.01e+04    2.07e+04
Garage Type_Detchd     -8135.3689   1.39e+04     -0.585      0.559   -3.54e+04    1.91e+04
Garage Qual_Fa         -2.174e+05   4.66e+04     -4.668      0.000   -3.09e+05   -1.26e+05
Garage Qual_Gd         -1.873e+05   4.74e+04     -3.949      0.000    -2.8e+05   -9.43e+04
Garage Qual_Po         -1.975e+05   5.39e+04     -3.662      0.000   -3.03e+05   -9.17e+04
Garage Qual_TA         -2.053e+05   4.62e+04     -4.446      0.000   -2.96e+05   -1.15e+05
Garage Cond_Fa           1.79e+05   2.36e+04      7.598      0.000    1.33e+05    2.25e+05
Garage Cond_Gd          1.682e+05    2.6e+04      6.470      0.000    1.17e+05    2.19e+05
Garage Cond_Po          1.595e+05   2.67e+04      5.978      0.000    1.07e+05    2.12e+05
Garage Cond_TA          1.735e+05   2.31e+04      7.513      0.000    1.28e+05    2.19e+05
==============================================================================
Omnibus:                      188.369   Durbin-Watson:                   1.923
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              711.512
Skew:                           0.584   Prob(JB):                    3.14e-155
Kurtosis:                       6.208   Cond. No.                     1.16e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.73e-28. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Initial Adjusted R-Squared Result for Categorical Variables

I would say an initial adjusted R-Squared value of 79.8% for categorical type data is quite good.

However, there were many dummy columns: 184. I looked over the P>|t| values in the output table above and decided to remove dummy columns that had 'weak' values greater than 0.050.

In [21]:
text_df4 = text_df3.drop(['Condition 1', 'Condition 2', 'Lot Config',\
                          'House Style', 'Roof Style', 'Exterior 2nd',\
                          'Exter Cond', 'Functional', 'Electrical',\
                          'Foundation', 'Garage Type', 'Sale Type',\
                          'Sale Condition', 'BsmtFin Type 2','Bsmt Cond'\
                         ], axis = 1)
print(len(text_df4))
print(text_df4.columns)

Y = df2['SalePrice']

X = pd.get_dummies(data=text_df4, drop_first=True)
print('There are', len(X.columns), 'dummy columns created.', '\n')

from sklearn import linear_model
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .50, random_state = 40)

regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)


model = sm.OLS(Y_test, X_test).fit()
predictions = model.predict(X)

print_model = model.summary()
print(print_model)
2930
Index(['MS Zoning', 'Neighborhood', 'Bldg Type', 'Roof Matl', 'Exterior 1st',
       'Heating', 'Heating QC', 'Kitchen Qual', 'Bsmt Qual', 'BsmtFin Type 1',
       'Garage Qual', 'Garage Cond'],
      dtype='object')
There are 89 dummy columns created. 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.781
Model:                            OLS   Adj. R-squared:                  0.768
Method:                 Least Squares   F-statistic:                     59.45
Date:                Wed, 14 Jul 2021   Prob (F-statistic):               0.00
Time:                        16:33:12   Log-Likelihood:                -17433.
No. Observations:                1465   AIC:                         3.503e+04
Df Residuals:                    1381   BIC:                         3.548e+04
Df Model:                          83                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
MS Zoning_C (all)     7.855e+04   1.53e+04      5.147      0.000    4.86e+04    1.08e+05
MS Zoning_FV          1.119e+05   1.38e+04      8.079      0.000    8.47e+04    1.39e+05
MS Zoning_I (all)     1.068e+05   3.53e+04      3.022      0.003    3.75e+04    1.76e+05
MS Zoning_RH          1.025e+05   1.46e+04      6.996      0.000    7.37e+04    1.31e+05
MS Zoning_RL          1.132e+05   1.14e+04      9.905      0.000    9.08e+04    1.36e+05
MS Zoning_RM          1.188e+05   1.17e+04     10.170      0.000    9.59e+04    1.42e+05
Neighborhood_Blueste -3909.0526   1.72e+04     -0.227      0.821   -3.77e+04    2.99e+04
Neighborhood_BrDale  -1.985e+04   1.57e+04     -1.267      0.205   -5.06e+04    1.09e+04
Neighborhood_BrkSide -6.404e+04   1.28e+04     -4.999      0.000   -8.92e+04   -3.89e+04
Neighborhood_ClearCr -7503.7012   1.37e+04     -0.548      0.584   -3.44e+04    1.93e+04
Neighborhood_CollgCr -3.027e+04   1.06e+04     -2.848      0.004   -5.11e+04   -9424.382
Neighborhood_Crawfor  1.195e+04   1.17e+04      1.022      0.307    -1.1e+04    3.49e+04
Neighborhood_Edwards -6.523e+04   1.14e+04     -5.717      0.000   -8.76e+04   -4.28e+04
Neighborhood_Gilbert -2.759e+04   1.11e+04     -2.477      0.013   -4.94e+04   -5738.240
Neighborhood_Greens   4.245e+04   3.81e+04      1.113      0.266   -3.23e+04    1.17e+05
Neighborhood_GrnHill  1.543e-09   2.66e-10      5.801      0.000    1.02e-09    2.06e-09
Neighborhood_IDOTRR  -7.089e+04   1.41e+04     -5.030      0.000   -9.85e+04   -4.32e+04
Neighborhood_Landmrk -6480.9426   4.01e+04     -0.162      0.872   -8.51e+04    7.21e+04
Neighborhood_MeadowV -6.335e+04   1.62e+04     -3.922      0.000    -9.5e+04   -3.17e+04
Neighborhood_Mitchel -4.799e+04   1.16e+04     -4.137      0.000   -7.08e+04   -2.52e+04
Neighborhood_NAmes   -4.758e+04   1.12e+04     -4.259      0.000   -6.95e+04   -2.57e+04
Neighborhood_NPkVill  1939.1548   1.59e+04      0.122      0.903   -2.93e+04    3.32e+04
Neighborhood_NWAmes  -2.142e+04   1.18e+04     -1.814      0.070   -4.46e+04    1739.797
Neighborhood_NoRidge  6.321e+04   1.21e+04      5.244      0.000    3.96e+04    8.69e+04
Neighborhood_NridgHt  2.447e+04   1.11e+04      2.204      0.028    2688.219    4.62e+04
Neighborhood_OldTown -6.391e+04   1.29e+04     -4.956      0.000   -8.92e+04   -3.86e+04
Neighborhood_SWISU   -5.158e+04   1.34e+04     -3.862      0.000   -7.78e+04   -2.54e+04
Neighborhood_Sawyer  -4.915e+04   1.17e+04     -4.216      0.000    -7.2e+04   -2.63e+04
Neighborhood_SawyerW -3.718e+04   1.14e+04     -3.265      0.001   -5.95e+04   -1.48e+04
Neighborhood_Somerst -8254.2430   1.26e+04     -0.656      0.512   -3.29e+04    1.64e+04
Neighborhood_StoneBr  6.476e+04   1.24e+04      5.233      0.000    4.05e+04     8.9e+04
Neighborhood_Timber  -5745.4786   1.19e+04     -0.483      0.629   -2.91e+04    1.76e+04
Neighborhood_Veenker  1.301e+04   1.62e+04      0.805      0.421   -1.87e+04    4.47e+04
Bldg Type_2fmCon     -5685.7843   6903.046     -0.824      0.410   -1.92e+04    7855.805
Bldg Type_Duplex     -5991.4774   5720.495     -1.047      0.295   -1.72e+04    5230.322
Bldg Type_Twnhs      -6.448e+04   7915.962     -8.146      0.000      -8e+04    -4.9e+04
Bldg Type_TwnhsE     -5.284e+04   5194.145    -10.173      0.000    -6.3e+04   -4.27e+04
Roof Matl_CompShg     7.702e+04   1.39e+04      5.549      0.000    4.98e+04    1.04e+05
Roof Matl_Membran      1.08e+05   3.55e+04      3.043      0.002    3.84e+04    1.78e+05
Roof Matl_Metal         6.3e+04   3.55e+04      1.774      0.076   -6665.233    1.33e+05
Roof Matl_Roll        1.064e+05   3.67e+04      2.897      0.004    3.44e+04    1.79e+05
Roof Matl_Tar&Grv      9.96e+04   1.74e+04      5.733      0.000    6.55e+04    1.34e+05
Roof Matl_WdShake     1.176e+05   1.99e+04      5.900      0.000    7.85e+04    1.57e+05
Roof Matl_WdShngl     1.483e+05   1.91e+04      7.777      0.000    1.11e+05    1.86e+05
Exterior 1st_AsphShn  5992.4559   3.84e+04      0.156      0.876   -6.93e+04    8.13e+04
Exterior 1st_BrkComm  3.728e+04    2.1e+04      1.776      0.076   -3901.442    7.85e+04
Exterior 1st_BrkFace  3.682e+04   1.04e+04      3.554      0.000    1.65e+04    5.71e+04
Exterior 1st_CBlock   8.831e+04   3.57e+04      2.473      0.014    1.82e+04    1.58e+05
Exterior 1st_CemntBd  4.514e+04   1.07e+04      4.237      0.000    2.42e+04     6.6e+04
Exterior 1st_HdBoard  1.536e+04   9140.801      1.680      0.093   -2572.690    3.33e+04
Exterior 1st_ImStucc -2.453e+04   3.83e+04     -0.641      0.521   -9.96e+04    5.05e+04
Exterior 1st_MetalSd  1.898e+04   8822.900      2.151      0.032    1670.512    3.63e+04
Exterior 1st_Plywood  2.563e+04   9770.808      2.623      0.009    6459.394    4.48e+04
Exterior 1st_PreCast  1.308e+05    4.1e+04      3.193      0.001    5.04e+04    2.11e+05
Exterior 1st_Stone    -4.04e+04   3.86e+04     -1.048      0.295   -1.16e+05    3.53e+04
Exterior 1st_Stucco   1.331e+04   1.12e+04      1.187      0.235   -8682.606    3.53e+04
Exterior 1st_VinylSd  1.974e+04   8974.599      2.200      0.028    2139.127    3.73e+04
Exterior 1st_Wd Sdng  1.781e+04   8888.009      2.003      0.045     370.057    3.52e+04
Exterior 1st_WdShing  9286.0695   1.12e+04      0.829      0.407   -1.27e+04    3.13e+04
Heating_GasA          1.486e+05   1.48e+04     10.048      0.000     1.2e+05    1.78e+05
Heating_GasW          1.659e+05   1.77e+04      9.397      0.000    1.31e+05       2e+05
Heating_Grav          1.211e+05    2.2e+04      5.499      0.000    7.79e+04    1.64e+05
Heating_OthW          1.809e+05   3.55e+04      5.089      0.000    1.11e+05    2.51e+05
Heating_Wall          1.035e+05   2.18e+04      4.748      0.000    6.07e+04    1.46e+05
Heating QC_Fa        -1.965e+04   6503.085     -3.022      0.003   -3.24e+04   -6896.695
Heating QC_Gd        -1.052e+04   3070.148     -3.426      0.001   -1.65e+04   -4497.127
Heating QC_Po        -4.995e+04   3.75e+04     -1.331      0.183   -1.24e+05    2.36e+04
Heating QC_TA        -1.699e+04   3024.539     -5.616      0.000   -2.29e+04   -1.11e+04
Kitchen Qual_Fa      -8.664e+04   9345.350     -9.271      0.000   -1.05e+05   -6.83e+04
Kitchen Qual_Gd      -5.822e+04   5570.721    -10.451      0.000   -6.91e+04   -4.73e+04
Kitchen Qual_Po       9.502e-12   2.04e-11      0.466      0.641   -3.05e-11    4.95e-11
Kitchen Qual_TA      -8.256e+04   6056.496    -13.632      0.000   -9.44e+04   -7.07e+04
Bsmt Qual_Fa         -6.956e+04   8480.149     -8.202      0.000   -8.62e+04   -5.29e+04
Bsmt Qual_Gd         -5.182e+04   4975.917    -10.414      0.000   -6.16e+04   -4.21e+04
Bsmt Qual_Po         -4.957e+04   3.26e+04     -1.520      0.129   -1.14e+05    1.44e+04
Bsmt Qual_TA         -6.735e+04   5890.731    -11.433      0.000   -7.89e+04   -5.58e+04
BsmtFin Type 1_BLQ    7391.9778   4240.091      1.743      0.081    -925.738    1.57e+04
BsmtFin Type 1_GLQ    7554.7384   3612.208      2.091      0.037     468.730    1.46e+04
BsmtFin Type 1_LwQ     941.2438   5122.629      0.184      0.854   -9107.732     1.1e+04
BsmtFin Type 1_Rec     -99.0146   4103.114     -0.024      0.981   -8148.024    7949.995
BsmtFin Type 1_Unf   -9216.8850   3397.949     -2.712      0.007   -1.59e+04   -2551.185
Garage Qual_Fa       -2.071e+05    4.4e+04     -4.710      0.000   -2.93e+05   -1.21e+05
Garage Qual_Gd       -1.757e+05    4.5e+04     -3.902      0.000   -2.64e+05   -8.74e+04
Garage Qual_Po       -1.931e+05   5.16e+04     -3.744      0.000   -2.94e+05   -9.19e+04
Garage Qual_TA        -1.95e+05   4.36e+04     -4.471      0.000   -2.81e+05   -1.09e+05
Garage Cond_Fa        1.779e+05   1.72e+04     10.335      0.000    1.44e+05    2.12e+05
Garage Cond_Gd        1.768e+05   2.05e+04      8.618      0.000    1.37e+05    2.17e+05
Garage Cond_Po        1.833e+05   2.14e+04      8.577      0.000    1.41e+05    2.25e+05
Garage Cond_TA         1.82e+05   1.67e+04     10.922      0.000    1.49e+05    2.15e+05
==============================================================================
Omnibus:                      184.219   Durbin-Watson:                   1.909
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              589.136
Skew:                           0.622   Prob(JB):                    1.18e-128
Kurtosis:                       5.847   Cond. No.                     1.29e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.16e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Second Adjusted R-Squared Result for Categorical Variables

Reducing the number of dummy columns from 184 to 89, reduced the adjusted R-squared from 79.9% to 76.9%.

I would consider this not too big of a loss in return for substantially reducing the number of dummy cvolumns.

Categorical Data Within Sale Condition = Normal

In [22]:
# filter categorical data to only within Sale Condition = 'Normal'.
text_df3['SalePrice'] = df2['SalePrice']
text_df3_normal = text_df3[text_df3['Sale Condition'] == 'Normal']
print(text_df3_normal['Sale Condition'].value_counts(dropna=False))

text_df4_normal = text_df3_normal.drop(['Condition 1', 'Condition 2', 'Lot Config',\
                          'House Style', 'Roof Style', 'Exterior 2nd',\
                          'Exter Cond', 'Functional', 'Electrical',\
                          'Foundation', 'Garage Type', 'Sale Type',\
                          'Sale Condition', 'BsmtFin Type 2','Bsmt Cond'\
                         ], axis = 1)

print(len(text_df4_normal))

text_df5_normal = text_df4_normal.drop(['SalePrice'], axis=1)
print(text_df5_normal.columns)
Y = text_df4_normal['SalePrice']

X2 = pd.get_dummies(data=text_df5_normal, drop_first=True)
print('There are', len(X2.columns), 'dummy columns created.', '\n')

from sklearn import linear_model
from sklearn.model_selection import train_test_split

X2_train, X2_test, Y_train, Y_test = train_test_split(X2, Y, test_size = .50, random_state = 40)

regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X2_train, Y_train)
predicted = regr.predict(X2_test)


model = sm.OLS(Y_test, X2_test).fit()
predictions = model.predict(X2)

print_model = model.summary()
print(print_model)
Normal    2413
Name: Sale Condition, dtype: int64
2413
Index(['MS Zoning', 'Neighborhood', 'Bldg Type', 'Roof Matl', 'Exterior 1st',
       'Heating', 'Heating QC', 'Kitchen Qual', 'Bsmt Qual', 'BsmtFin Type 1',
       'Garage Qual', 'Garage Cond'],
      dtype='object')
There are 87 dummy columns created. 

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.733
Method:                 Least Squares   F-statistic:                     43.48
Date:                Wed, 14 Jul 2021   Prob (F-statistic):          4.94e-285
Time:                        16:33:12   Log-Likelihood:                -14351.
No. Observations:                1207   AIC:                         2.886e+04
Df Residuals:                    1128   BIC:                         2.926e+04
Df Model:                          78                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
MS Zoning_C (all)     1.085e+05    2.5e+04      4.345      0.000    5.95e+04    1.58e+05
MS Zoning_FV          1.145e+05   2.03e+04      5.648      0.000    7.47e+04    1.54e+05
MS Zoning_I (all)     1.126e+05   3.91e+04      2.883      0.004     3.6e+04    1.89e+05
MS Zoning_RH          1.268e+05   1.91e+04      6.634      0.000    8.93e+04    1.64e+05
MS Zoning_RL          1.242e+05   1.65e+04      7.517      0.000    9.18e+04    1.57e+05
MS Zoning_RM          1.213e+05   1.72e+04      7.050      0.000    8.76e+04    1.55e+05
Neighborhood_Blueste -2.227e+04   2.15e+04     -1.035      0.301   -6.45e+04       2e+04
Neighborhood_BrDale  -3.424e+04   1.81e+04     -1.896      0.058   -6.97e+04    1184.675
Neighborhood_BrkSide -7.483e+04   1.51e+04     -4.949      0.000   -1.04e+05   -4.52e+04
Neighborhood_ClearCr -1.411e+04   1.51e+04     -0.936      0.350   -4.37e+04    1.55e+04
Neighborhood_CollgCr -4.333e+04   1.28e+04     -3.393      0.001   -6.84e+04   -1.83e+04
Neighborhood_Crawfor -2.362e+04   1.43e+04     -1.654      0.098   -5.16e+04    4396.245
Neighborhood_Edwards -7.059e+04   1.34e+04     -5.264      0.000   -9.69e+04   -4.43e+04
Neighborhood_Gilbert -4.151e+04   1.34e+04     -3.100      0.002   -6.78e+04   -1.52e+04
Neighborhood_Greens   -348.2828   2.26e+04     -0.015      0.988   -4.47e+04     4.4e+04
Neighborhood_GrnHill  7.842e+04   2.91e+04      2.698      0.007    2.14e+04    1.35e+05
Neighborhood_IDOTRR  -7.771e+04   1.71e+04     -4.532      0.000   -1.11e+05   -4.41e+04
Neighborhood_Landmrk  1.171e-10    6.5e-11      1.800      0.072   -1.05e-11    2.45e-10
Neighborhood_MeadowV -7.688e+04   1.81e+04     -4.257      0.000   -1.12e+05   -4.14e+04
Neighborhood_Mitchel  -5.87e+04   1.39e+04     -4.236      0.000   -8.59e+04   -3.15e+04
Neighborhood_NAmes   -6.244e+04   1.33e+04     -4.708      0.000   -8.85e+04   -3.64e+04
Neighborhood_NPkVill -2.678e+04   1.67e+04     -1.602      0.109   -5.96e+04    6012.781
Neighborhood_NWAmes  -3.258e+04    1.4e+04     -2.330      0.020      -6e+04   -5147.418
Neighborhood_NoRidge  5.408e+04    1.4e+04      3.863      0.000    2.66e+04    8.15e+04
Neighborhood_NridgHt  1.426e+04   1.31e+04      1.086      0.278   -1.15e+04       4e+04
Neighborhood_OldTown -6.951e+04   1.54e+04     -4.517      0.000   -9.97e+04   -3.93e+04
Neighborhood_SWISU    -7.07e+04   1.58e+04     -4.485      0.000   -1.02e+05   -3.98e+04
Neighborhood_Sawyer   -7.12e+04   1.38e+04     -5.155      0.000   -9.83e+04   -4.41e+04
Neighborhood_SawyerW -5.362e+04   1.33e+04     -4.041      0.000   -7.96e+04   -2.76e+04
Neighborhood_Somerst -5235.2425   1.66e+04     -0.315      0.753   -3.79e+04    2.74e+04
Neighborhood_StoneBr  2.043e+04   1.42e+04      1.444      0.149   -7339.072    4.82e+04
Neighborhood_Timber  -1.065e+04   1.46e+04     -0.730      0.465   -3.93e+04     1.8e+04
Neighborhood_Veenker -9760.4948   1.96e+04     -0.498      0.619   -4.82e+04    2.87e+04
Bldg Type_2fmCon     -1.274e+04   7542.857     -1.689      0.091   -2.75e+04    2056.776
Bldg Type_Duplex      -772.1272   6664.154     -0.116      0.908   -1.38e+04    1.23e+04
Bldg Type_Twnhs      -5.756e+04   8138.598     -7.073      0.000   -7.35e+04   -4.16e+04
Bldg Type_TwnhsE     -5.539e+04   6042.165     -9.166      0.000   -6.72e+04   -4.35e+04
Roof Matl_Membran    -2.742e-10   6.99e-11     -3.922      0.000   -4.11e-10   -1.37e-10
Roof Matl_Metal        -1.8e-10   9.73e-11     -1.850      0.065   -3.71e-10    1.09e-11
Roof Matl_Roll        1.971e+04   4.02e+04      0.491      0.624   -5.91e+04    9.85e+04
Roof Matl_Tar&Grv     3.238e+04   1.28e+04      2.532      0.011    7289.735    5.75e+04
Roof Matl_WdShake     8511.2381   1.69e+04      0.504      0.614   -2.46e+04    4.16e+04
Roof Matl_WdShngl     2.785e+04   2.64e+04      1.056      0.291   -2.39e+04    7.96e+04
Exterior 1st_AsphShn -2.173e+04   2.89e+04     -0.753      0.452   -7.84e+04    3.49e+04
Exterior 1st_BrkComm  2.964e+04   3.06e+04      0.969      0.333   -3.04e+04    8.97e+04
Exterior 1st_BrkFace  4.041e+04   1.27e+04      3.184      0.001    1.55e+04    6.53e+04
Exterior 1st_CBlock   8.276e+04   3.79e+04      2.181      0.029    8314.362    1.57e+05
Exterior 1st_CemntBd  3.042e+04    1.4e+04      2.177      0.030    3002.746    5.78e+04
Exterior 1st_HdBoard  7463.6995   1.19e+04      0.626      0.532   -1.59e+04    3.09e+04
Exterior 1st_ImStucc -3.786e+04   3.88e+04     -0.976      0.329   -1.14e+05    3.82e+04
Exterior 1st_MetalSd  8397.7574   1.16e+04      0.724      0.469   -1.44e+04    3.12e+04
Exterior 1st_Plywood  2.042e+04   1.26e+04      1.625      0.104   -4230.240    4.51e+04
Exterior 1st_PreCast  1.853e-10    5.2e-11      3.566      0.000    8.33e-11    2.87e-10
Exterior 1st_Stucco  -5450.2247   1.41e+04     -0.386      0.700   -3.32e+04    2.23e+04
Exterior 1st_VinylSd  8815.1323   1.18e+04      0.748      0.455   -1.43e+04    3.19e+04
Exterior 1st_Wd Sdng  1.152e+04   1.15e+04      0.998      0.319   -1.11e+04    3.42e+04
Exterior 1st_WdShing  -134.6544   1.39e+04     -0.010      0.992   -2.75e+04    2.72e+04
Heating_GasA          2.088e+05    2.8e+04      7.448      0.000    1.54e+05    2.64e+05
Heating_GasW          2.495e+05   2.94e+04      8.473      0.000    1.92e+05    3.07e+05
Heating_Grav          1.856e+05   3.38e+04      5.494      0.000    1.19e+05    2.52e+05
Heating_OthW          5.717e-11   3.84e-11      1.490      0.136   -1.81e-11    1.32e-10
Heating_Wall          1.469e+05   3.52e+04      4.173      0.000    7.78e+04    2.16e+05
Heating QC_Fa        -2.376e+04   7329.521     -3.241      0.001   -3.81e+04   -9376.961
Heating QC_Gd        -1.095e+04   3441.011     -3.182      0.002   -1.77e+04   -4197.537
Heating QC_Po        -3.991e+04   3.76e+04     -1.061      0.289   -1.14e+05    3.39e+04
Heating QC_TA        -1.283e+04   3227.587     -3.975      0.000   -1.92e+04   -6497.862
Kitchen Qual_Fa      -1.034e+05   1.03e+04    -10.036      0.000   -1.24e+05   -8.32e+04
Kitchen Qual_Gd      -5.445e+04   6320.963     -8.614      0.000   -6.69e+04    -4.2e+04
Kitchen Qual_Po      -6.085e-12   1.09e-11     -0.560      0.575   -2.74e-11    1.52e-11
Kitchen Qual_TA      -7.774e+04   6634.536    -11.718      0.000   -9.08e+04   -6.47e+04
Bsmt Qual_Fa         -7.916e+04   9619.650     -8.229      0.000    -9.8e+04   -6.03e+04
Bsmt Qual_Gd         -4.715e+04   5818.418     -8.104      0.000   -5.86e+04   -3.57e+04
Bsmt Qual_Po         -3.609e+04   6.43e+04     -0.561      0.575   -1.62e+05    9.01e+04
Bsmt Qual_TA         -6.296e+04   6669.427     -9.440      0.000    -7.6e+04   -4.99e+04
BsmtFin Type 1_BLQ   -3889.3236   4536.134     -0.857      0.391   -1.28e+04    5010.886
BsmtFin Type 1_GLQ    3728.2450   3918.461      0.951      0.342   -3960.047    1.14e+04
BsmtFin Type 1_LwQ     520.0943   5792.363      0.090      0.928   -1.08e+04    1.19e+04
BsmtFin Type 1_Rec    1508.9451   4611.080      0.327      0.744   -7538.313    1.06e+04
BsmtFin Type 1_Unf   -9197.2286   3779.375     -2.434      0.015   -1.66e+04   -1781.833
Garage Qual_Fa       -8248.8121   2.09e+04     -0.395      0.693   -4.92e+04    3.27e+04
Garage Qual_Gd        3.415e+04   2.25e+04      1.517      0.129      -1e+04    7.83e+04
Garage Qual_Po         -3.7e+04   4.45e+04     -0.832      0.405   -1.24e+05    5.02e+04
Garage Qual_TA        4000.0555   2.03e+04      0.197      0.844   -3.59e+04    4.39e+04
Garage Cond_Fa       -6282.1362   2.12e+04     -0.296      0.767    -4.8e+04    3.54e+04
Garage Cond_Gd       -1.704e+04   2.37e+04     -0.718      0.473   -6.36e+04    2.95e+04
Garage Cond_Po        9512.0120   2.38e+04      0.400      0.689   -3.71e+04    5.62e+04
Garage Cond_TA        6716.6715   2.04e+04      0.329      0.742   -3.33e+04    4.67e+04
==============================================================================
Omnibus:                      276.095   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1418.389
Skew:                           0.963   Prob(JB):                    1.00e-308
Kurtosis:                       7.949   Cond. No.                     1.37e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.2e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Result within 'Normal' Sale Condition for Categorical Data

It's interesting that there was a slight reduction in Adjusted R-Squared value (76.9% to 73.2%) for categorical data whereas for numeical data there was an increase.

I don't have an explanation for that.

Combine Numerical and Categorical Data

In [23]:
df2_numerical = df2[features]

df_combined = pd.concat([df2_numerical, X], axis=1)
print('There are', len(df_combined.columns), 'numerical and dummy columns created.', '\n')

X = df_combined
Y = df2['SalePrice']

from sklearn import linear_model
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .50, random_state = 40)

regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)


model = sm.OLS(Y_test, X_test).fit()
predictions = model.predict(X)
R2 = model.rsquared
print_model = model.summary()
print(R2)
print(print_model)
There are 99 numerical and dummy columns created. 

0.9091012292453284
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.909
Model:                            OLS   Adj. R-squared:                  0.903
Method:                 Least Squares   F-statistic:                     147.4
Date:                Wed, 14 Jul 2021   Prob (F-statistic):               0.00
Time:                        16:33:12   Log-Likelihood:                -16790.
No. Observations:                1465   AIC:                         3.377e+04
Df Residuals:                    1371   BIC:                         3.427e+04
Df Model:                          93                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Open Porch SF           20.3018     10.768      1.885      0.060      -0.822      41.425
Wood Deck SF            16.0119      5.834      2.745      0.006       4.568      27.456
BsmtFin SF 1            18.5239      2.372      7.808      0.000      13.870      23.178
Fireplaces            4041.6344   1265.329      3.194      0.001    1559.443    6523.826
TotRms AbvGrd         2622.5292    661.540      3.964      0.000    1324.788    3920.270
Mas Vnr Area            24.6705      4.902      5.033      0.000      15.054      34.287
Full Bath             5464.7134   1903.616      2.871      0.004    1730.397    9199.029
Garage Area             33.7477      4.075      8.281      0.000      25.753      41.742
Total_SF                28.7634      1.826     15.754      0.000      25.182      32.345
Overall Qual           1.03e+04    897.242     11.474      0.000    8535.001    1.21e+04
MS Zoning_C (all)      2.54e+04      1e+04      2.536      0.011    5752.664    4.51e+04
MS Zoning_FV          5.107e+04   9283.669      5.501      0.000    3.29e+04    6.93e+04
MS Zoning_I (all)      1.55e+04    2.3e+04      0.673      0.501   -2.97e+04    6.07e+04
MS Zoning_RH          3.491e+04   9722.546      3.591      0.000    1.58e+04     5.4e+04
MS Zoning_RL          3.836e+04   7758.322      4.945      0.000    2.31e+04    5.36e+04
MS Zoning_RM           3.61e+04   7987.933      4.519      0.000    2.04e+04    5.18e+04
Neighborhood_Blueste  1.264e+04   1.12e+04      1.126      0.261   -9390.840    3.47e+04
Neighborhood_BrDale   1347.1087   1.04e+04      0.130      0.897    -1.9e+04    2.17e+04
Neighborhood_BrkSide -8867.8292   8450.784     -1.049      0.294   -2.54e+04    7710.038
Neighborhood_ClearCr  7647.9629   8955.433      0.854      0.393   -9919.872    2.52e+04
Neighborhood_CollgCr -5542.2506   6986.466     -0.793      0.428   -1.92e+04    8163.071
Neighborhood_Crawfor  2.284e+04   7644.191      2.988      0.003    7845.442    3.78e+04
Neighborhood_Edwards  -1.35e+04   7572.481     -1.783      0.075   -2.84e+04    1353.648
Neighborhood_Gilbert  -590.9699   7282.476     -0.081      0.935   -1.49e+04    1.37e+04
Neighborhood_Greens   3.157e+04   2.48e+04      1.273      0.203   -1.71e+04    8.02e+04
Neighborhood_GrnHill -3.803e-10   9.24e-11     -4.114      0.000   -5.62e-10   -1.99e-10
Neighborhood_IDOTRR  -1.515e+04   9261.233     -1.636      0.102   -3.33e+04    3017.459
Neighborhood_Landmrk  5061.9801    2.6e+04      0.195      0.846   -4.59e+04     5.6e+04
Neighborhood_MeadowV -1.113e+04   1.06e+04     -1.047      0.295    -3.2e+04    9724.297
Neighborhood_Mitchel -7522.1580   7655.764     -0.983      0.326   -2.25e+04    7496.123
Neighborhood_NAmes   -1.199e+04   7397.484     -1.621      0.105   -2.65e+04    2523.067
Neighborhood_NPkVill  7502.2339   1.04e+04      0.724      0.469   -1.28e+04    2.78e+04
Neighborhood_NWAmes   -2.11e+04   7725.641     -2.731      0.006   -3.63e+04   -5942.982
Neighborhood_NoRidge  2.307e+04   8058.976      2.863      0.004    7264.624    3.89e+04
Neighborhood_NridgHt  8619.0458   7339.920      1.174      0.240   -5779.645     2.3e+04
Neighborhood_OldTown -1.822e+04   8472.203     -2.151      0.032   -3.48e+04   -1600.793
Neighborhood_SWISU   -1.071e+04   8769.045     -1.222      0.222   -2.79e+04    6489.568
Neighborhood_Sawyer  -6494.5450   7709.365     -0.842      0.400   -2.16e+04    8628.885
Neighborhood_SawyerW -4023.6034   7468.926     -0.539      0.590   -1.87e+04    1.06e+04
Neighborhood_Somerst -7900.8433   8241.671     -0.959      0.338   -2.41e+04    8266.808
Neighborhood_StoneBr  4.274e+04   8087.805      5.285      0.000    2.69e+04    5.86e+04
Neighborhood_Timber  -4207.0559   7778.087     -0.541      0.589   -1.95e+04    1.11e+04
Neighborhood_Veenker -1.888e+04   1.06e+04     -1.774      0.076   -3.98e+04    1999.795
Bldg Type_2fmCon     -1.466e+04   4555.086     -3.219      0.001   -2.36e+04   -5727.760
Bldg Type_Duplex     -2.536e+04   4036.460     -6.283      0.000   -3.33e+04   -1.74e+04
Bldg Type_Twnhs      -2.879e+04   5253.901     -5.480      0.000   -3.91e+04   -1.85e+04
Bldg Type_TwnhsE     -2.433e+04   3537.217     -6.879      0.000   -3.13e+04   -1.74e+04
Roof Matl_CompShg     1.929e+04   9241.513      2.087      0.037    1156.717    3.74e+04
Roof Matl_Membran     6.771e+04   2.31e+04      2.928      0.003    2.24e+04    1.13e+05
Roof Matl_Metal       2.369e+04   2.32e+04      1.021      0.308   -2.18e+04    6.92e+04
Roof Matl_Roll        1.599e+04   2.41e+04      0.663      0.507   -3.13e+04    6.33e+04
Roof Matl_Tar&Grv     3.698e+04   1.14e+04      3.236      0.001    1.46e+04    5.94e+04
Roof Matl_WdShake     3.233e+04   1.31e+04      2.468      0.014    6632.421     5.8e+04
Roof Matl_WdShngl     6.258e+04   1.28e+04      4.872      0.000    3.74e+04    8.78e+04
Exterior 1st_AsphShn  2.096e+04   2.49e+04      0.841      0.400   -2.79e+04    6.99e+04
Exterior 1st_BrkComm -4506.4695   1.37e+04     -0.330      0.742   -3.13e+04    2.23e+04
Exterior 1st_BrkFace  1.311e+04   6778.901      1.934      0.053    -184.361    2.64e+04
Exterior 1st_CBlock   5.722e+04   2.32e+04      2.461      0.014    1.16e+04    1.03e+05
Exterior 1st_CemntBd  1.413e+04   6970.717      2.027      0.043     456.150    2.78e+04
Exterior 1st_HdBoard -5052.6089   5957.403     -0.848      0.397   -1.67e+04    6634.003
Exterior 1st_ImStucc -2.622e+04    2.5e+04     -1.049      0.295   -7.53e+04    2.28e+04
Exterior 1st_MetalSd   673.7221   5746.047      0.117      0.907   -1.06e+04    1.19e+04
Exterior 1st_Plywood -2347.2648   6374.201     -0.368      0.713   -1.49e+04    1.02e+04
Exterior 1st_PreCast  7.743e+04   2.67e+04      2.903      0.004    2.51e+04     1.3e+05
Exterior 1st_Stone   -1.233e+04    2.5e+04     -0.492      0.623   -6.15e+04    3.68e+04
Exterior 1st_Stucco   2544.5360   7296.113      0.349      0.727   -1.18e+04    1.69e+04
Exterior 1st_VinylSd  -321.3025   5852.141     -0.055      0.956   -1.18e+04    1.12e+04
Exterior 1st_Wd Sdng -5321.5633   5809.087     -0.916      0.360   -1.67e+04    6074.098
Exterior 1st_WdShing -5007.7650   7276.786     -0.688      0.491   -1.93e+04    9267.076
Heating_GasA          5.297e+04   9924.084      5.337      0.000    3.35e+04    7.24e+04
Heating_GasW          4.977e+04   1.19e+04      4.184      0.000    2.64e+04    7.31e+04
Heating_Grav          4.248e+04   1.45e+04      2.935      0.003    1.41e+04    7.09e+04
Heating_OthW          4.252e+04   2.34e+04      1.814      0.070   -3467.524    8.85e+04
Heating_Wall          7.082e+04   1.44e+04      4.923      0.000    4.26e+04     9.9e+04
Heating QC_Fa        -1.076e+04   4226.869     -2.546      0.011   -1.91e+04   -2471.661
Heating QC_Gd        -2756.7916   2004.725     -1.375      0.169   -6689.451    1175.868
Heating QC_Po        -1.514e+04   2.43e+04     -0.623      0.533   -6.28e+04    3.25e+04
Heating QC_TA        -8192.6848   1976.887     -4.144      0.000   -1.21e+04   -4314.634
Kitchen Qual_Fa        -3.7e+04   6277.411     -5.895      0.000   -4.93e+04   -2.47e+04
Kitchen Qual_Gd      -2.991e+04   3709.966     -8.063      0.000   -3.72e+04   -2.26e+04
Kitchen Qual_Po       3.443e-11   2.01e-11      1.714      0.087   -4.99e-12    7.38e-11
Kitchen Qual_TA      -3.733e+04   4116.143     -9.069      0.000   -4.54e+04   -2.93e+04
Bsmt Qual_Fa         -3.054e+04   5595.860     -5.457      0.000   -4.15e+04   -1.96e+04
Bsmt Qual_Gd         -2.643e+04   3299.828     -8.009      0.000   -3.29e+04      -2e+04
Bsmt Qual_Po         -2420.1087   2.11e+04     -0.114      0.909   -4.39e+04    3.91e+04
Bsmt Qual_TA          -3.06e+04   3938.270     -7.769      0.000   -3.83e+04   -2.29e+04
BsmtFin Type 1_BLQ    -115.4930   2765.819     -0.042      0.967   -5541.189    5310.203
BsmtFin Type 1_GLQ    4015.4595   2371.161      1.693      0.091    -636.037    8666.956
BsmtFin Type 1_LwQ   -5308.4878   3369.415     -1.575      0.115   -1.19e+04    1301.280
BsmtFin Type 1_Rec   -5191.8383   2702.452     -1.921      0.055   -1.05e+04     109.550
BsmtFin Type 1_Unf    -174.3089   2596.436     -0.067      0.946   -5267.727    4919.109
Garage Qual_Fa       -1.264e+05   2.89e+04     -4.372      0.000   -1.83e+05   -6.97e+04
Garage Qual_Gd       -1.166e+05   2.96e+04     -3.944      0.000   -1.75e+05   -5.86e+04
Garage Qual_Po       -1.305e+05   3.37e+04     -3.868      0.000   -1.97e+05   -6.43e+04
Garage Qual_TA       -1.233e+05   2.87e+04     -4.301      0.000    -1.8e+05    -6.7e+04
Garage Cond_Fa        6.408e+04   1.17e+04      5.490      0.000    4.12e+04     8.7e+04
Garage Cond_Gd        6.739e+04   1.37e+04      4.907      0.000    4.04e+04    9.43e+04
Garage Cond_Po        5.565e+04   1.43e+04      3.879      0.000    2.75e+04    8.38e+04
Garage Cond_TA        7.144e+04   1.13e+04      6.298      0.000    4.92e+04    9.37e+04
==============================================================================
Omnibus:                      155.772   Durbin-Watson:                   1.955
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1248.394
Skew:                           0.034   Prob(JB):                    8.22e-272
Kurtosis:                       7.522   Cond. No.                     2.83e+19
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.34e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Result for Combined Numerical and Categorical Data

Wow! A nice increase in Adjusted R-Squared value to 90.3% for the combined data types!

Combine Numerical and Categorical Data within Sale Condition 'Normal'

In [24]:
df2_numerical2 = df2_normal[features]

df_combined = pd.concat([df2_numerical2, X2], axis=1)
print('There are', len(df_combined.columns), 'numerical and dummy columns created.', '\n')

X = df_combined
Y = text_df4_normal['SalePrice']

from sklearn import linear_model
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .50, random_state = 40)

regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)


model = sm.OLS(Y_test, X_test).fit()
predictions = model.predict(X)
R2 = model.rsquared
print_model = model.summary()
print(R2)
print(print_model)
There are 97 numerical and dummy columns created. 

0.9167741975924575
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.917
Model:                            OLS   Adj. R-squared:                  0.910
Method:                 Least Squares   F-statistic:                     139.9
Date:                Wed, 14 Jul 2021   Prob (F-statistic):               0.00
Time:                        16:33:12   Log-Likelihood:                -13688.
No. Observations:                1207   AIC:                         2.755e+04
Df Residuals:                    1118   BIC:                         2.801e+04
Df Model:                          88                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
Open Porch SF           36.8597     10.786      3.417      0.001      15.697      58.023
Wood Deck SF            28.6674      5.418      5.292      0.000      18.038      39.297
BsmtFin SF 1            18.0298      2.559      7.045      0.000      13.008      23.052
Fireplaces            5563.2346   1249.703      4.452      0.000    3111.207    8015.262
TotRms AbvGrd         2424.6744    650.033      3.730      0.000    1149.252    3700.097
Mas Vnr Area            27.2985      4.787      5.703      0.000      17.906      36.691
Full Bath             1759.9739   1837.162      0.958      0.338   -1844.700    5364.648
Garage Area             28.2365      4.173      6.767      0.000      20.049      36.424
Total_SF                31.6868      1.759     18.010      0.000      28.235      35.139
Overall Qual          9434.9449    873.553     10.801      0.000    7720.957    1.11e+04
MS Zoning_C (all)     2720.8656   1.48e+04      0.183      0.855   -2.64e+04    3.19e+04
MS Zoning_FV          2.848e+04   1.21e+04      2.359      0.018    4795.595    5.22e+04
MS Zoning_I (all)    -6105.6821   2.29e+04     -0.266      0.790   -5.11e+04    3.89e+04
MS Zoning_RH          2.357e+04   1.15e+04      2.043      0.041     937.888    4.62e+04
MS Zoning_RL          2.319e+04      1e+04      2.317      0.021    3549.366    4.28e+04
MS Zoning_RM          1.595e+04   1.04e+04      1.528      0.127   -4531.386    3.64e+04
Neighborhood_Blueste  2422.5880   1.26e+04      0.192      0.848   -2.23e+04    2.72e+04
Neighborhood_BrDale    429.4518   1.07e+04      0.040      0.968   -2.06e+04    2.14e+04
Neighborhood_BrkSide -1.215e+04   8943.024     -1.359      0.174   -2.97e+04    5394.958
Neighborhood_ClearCr  -913.7958   8853.141     -0.103      0.918   -1.83e+04    1.65e+04
Neighborhood_CollgCr -1.014e+04   7524.299     -1.348      0.178   -2.49e+04    4619.264
Neighborhood_Crawfor  3596.7432   8382.840      0.429      0.668   -1.29e+04       2e+04
Neighborhood_Edwards -1.452e+04   7963.161     -1.824      0.068   -3.01e+04    1100.896
Neighborhood_Gilbert -5754.1292   7844.948     -0.733      0.463   -2.11e+04    9638.351
Neighborhood_Greens   6110.1197   1.34e+04      0.458      0.647   -2.01e+04    3.23e+04
Neighborhood_GrnHill  1.166e+05    1.7e+04      6.878      0.000    8.33e+04     1.5e+05
Neighborhood_IDOTRR  -1.252e+04   1.01e+04     -1.235      0.217   -3.24e+04    7362.550
Neighborhood_Landmrk  2.547e-11   4.63e-11      0.550      0.583   -6.54e-11    1.16e-10
Neighborhood_MeadowV -3283.5589   1.07e+04     -0.306      0.760   -2.44e+04    1.78e+04
Neighborhood_Mitchel  -1.22e+04   8185.906     -1.491      0.136   -2.83e+04    3857.468
Neighborhood_NAmes   -1.945e+04   7856.447     -2.475      0.013   -3.49e+04   -4032.254
Neighborhood_NPkVill -2301.9005   9749.512     -0.236      0.813   -2.14e+04    1.68e+04
Neighborhood_NWAmes   -2.11e+04   8180.734     -2.579      0.010   -3.71e+04   -5044.493
Neighborhood_NoRidge  2.318e+04   8307.113      2.790      0.005    6878.596    3.95e+04
Neighborhood_NridgHt  7312.4096   7703.803      0.949      0.343   -7803.131    2.24e+04
Neighborhood_OldTown -2.097e+04   9079.112     -2.310      0.021   -3.88e+04   -3157.249
Neighborhood_SWISU   -1.806e+04   9279.565     -1.947      0.052   -3.63e+04     143.048
Neighborhood_Sawyer  -1.488e+04   8198.978     -1.814      0.070    -3.1e+04    1210.927
Neighborhood_SawyerW -1.476e+04   7801.044     -1.892      0.059   -3.01e+04     546.082
Neighborhood_Somerst   375.9867   9718.144      0.039      0.969   -1.87e+04    1.94e+04
Neighborhood_StoneBr  1.576e+04   8283.393      1.902      0.057    -493.936     3.2e+04
Neighborhood_Timber    486.6076   8529.129      0.057      0.955   -1.62e+04    1.72e+04
Neighborhood_Veenker     6.5250   1.15e+04      0.001      1.000   -2.25e+04    2.25e+04
Bldg Type_2fmCon     -1.602e+04   4474.601     -3.580      0.000   -2.48e+04   -7241.156
Bldg Type_Duplex     -2.321e+04   4239.439     -5.476      0.000   -3.15e+04   -1.49e+04
Bldg Type_Twnhs      -2.732e+04   4824.966     -5.663      0.000   -3.68e+04   -1.79e+04
Bldg Type_TwnhsE     -2.268e+04   3678.040     -6.168      0.000   -2.99e+04   -1.55e+04
Roof Matl_Membran    -1.226e-10   3.62e-11     -3.386      0.001   -1.94e-10   -5.16e-11
Roof Matl_Metal      -4.376e-11   4.26e-11     -1.027      0.305   -1.27e-10    3.99e-11
Roof Matl_Roll       -3916.0750   2.34e+04     -0.167      0.867   -4.99e+04    4.21e+04
Roof Matl_Tar&Grv     1.658e+04   7465.319      2.221      0.027    1931.572    3.12e+04
Roof Matl_WdShake     1175.0583   9867.517      0.119      0.905   -1.82e+04    2.05e+04
Roof Matl_WdShngl     2.916e+04   1.56e+04      1.874      0.061   -1366.337    5.97e+04
Exterior 1st_AsphShn  1.445e+04   1.69e+04      0.858      0.391   -1.86e+04    4.75e+04
Exterior 1st_BrkComm  2.537e+04   1.78e+04      1.424      0.155   -9584.071    6.03e+04
Exterior 1st_BrkFace  2.448e+04   7441.494      3.289      0.001    9877.437    3.91e+04
Exterior 1st_CBlock   3.892e+04   2.21e+04      1.761      0.079   -4448.732    8.23e+04
Exterior 1st_CemntBd  8104.1408   8154.578      0.994      0.321   -7895.860    2.41e+04
Exterior 1st_HdBoard -2097.8774   6972.085     -0.301      0.764   -1.58e+04    1.16e+04
Exterior 1st_ImStucc -2.359e+04   2.28e+04     -1.036      0.300   -6.83e+04    2.11e+04
Exterior 1st_MetalSd  4204.2140   6753.348      0.623      0.534   -9046.451    1.75e+04
Exterior 1st_Plywood -1870.4736   7352.423     -0.254      0.799   -1.63e+04    1.26e+04
Exterior 1st_PreCast -1.096e-11   2.76e-11     -0.396      0.692   -6.52e-11    4.33e-11
Exterior 1st_Stucco  -1562.3802   8274.193     -0.189      0.850   -1.78e+04    1.47e+04
Exterior 1st_VinylSd  2888.3940   6883.072      0.420      0.675   -1.06e+04    1.64e+04
Exterior 1st_Wd Sdng -1252.1381   6752.181     -0.185      0.853   -1.45e+04     1.2e+04
Exterior 1st_WdShing  1839.9761   8156.956      0.226      0.822   -1.42e+04    1.78e+04
Heating_GasA          2.805e+04    1.7e+04      1.648      0.100   -5346.703    6.14e+04
Heating_GasW           2.23e+04    1.8e+04      1.237      0.217   -1.31e+04    5.77e+04
Heating_Grav          2.609e+04   2.01e+04      1.299      0.194   -1.33e+04    6.55e+04
Heating_OthW          3.487e-11   1.38e-11      2.532      0.011    7.85e-12    6.19e-11
Heating_Wall           5.03e+04   2.09e+04      2.408      0.016    9308.230    9.13e+04
Heating QC_Fa        -3639.2942   4281.086     -0.850      0.395    -1.2e+04    4760.573
Heating QC_Gd        -1740.8398   2018.133     -0.863      0.389   -5700.594    2218.915
Heating QC_Po         -1.35e+04   2.18e+04     -0.618      0.537   -5.64e+04    2.94e+04
Heating QC_TA        -3367.1913   1890.807     -1.781      0.075   -7077.121     342.738
Kitchen Qual_Fa      -3.782e+04   6246.000     -6.055      0.000   -5.01e+04   -2.56e+04
Kitchen Qual_Gd      -3.038e+04   3726.586     -8.152      0.000   -3.77e+04   -2.31e+04
Kitchen Qual_Po       2.105e-11    1.7e-11      1.241      0.215   -1.22e-11    5.43e-11
Kitchen Qual_TA       -3.77e+04   4001.877     -9.421      0.000   -4.56e+04   -2.99e+04
Bsmt Qual_Fa         -2.399e+04   5737.936     -4.182      0.000   -3.53e+04   -1.27e+04
Bsmt Qual_Gd         -1.792e+04   3490.543     -5.133      0.000   -2.48e+04   -1.11e+04
Bsmt Qual_Po          1.201e+04   3.74e+04      0.321      0.748   -6.13e+04    8.54e+04
Bsmt Qual_TA         -2.274e+04   4049.648     -5.616      0.000   -3.07e+04   -1.48e+04
BsmtFin Type 1_BLQ   -3406.7497   2665.320     -1.278      0.201   -8636.343    1822.844
BsmtFin Type 1_GLQ     944.7496   2300.554      0.411      0.681   -3569.140    5458.639
BsmtFin Type 1_LwQ   -5071.0262   3433.707     -1.477      0.140   -1.18e+04    1666.209
BsmtFin Type 1_Rec   -3492.4030   2728.033     -1.280      0.201   -8845.044    1860.238
BsmtFin Type 1_Unf   -2532.4512   2643.337     -0.958      0.338   -7718.911    2654.009
Garage Qual_Fa         557.7551   1.22e+04      0.046      0.964   -2.35e+04    2.46e+04
Garage Qual_Gd        8025.3972   1.32e+04      0.609      0.542   -1.78e+04    3.39e+04
Garage Qual_Po       -5521.7926   2.59e+04     -0.214      0.831   -5.62e+04    4.52e+04
Garage Qual_TA        2112.1843   1.19e+04      0.177      0.859   -2.12e+04    2.55e+04
Garage Cond_Fa         -17.8318   1.24e+04     -0.001      0.999   -2.44e+04    2.44e+04
Garage Cond_Gd        -767.7169   1.38e+04     -0.055      0.956   -2.79e+04    2.64e+04
Garage Cond_Po         976.9037   1.39e+04      0.070      0.944   -2.63e+04    2.83e+04
Garage Cond_TA        4982.1890   1.19e+04      0.418      0.676   -1.84e+04    2.84e+04
==============================================================================
Omnibus:                      273.816   Durbin-Watson:                   2.075
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2136.856
Skew:                           0.825   Prob(JB):                         0.00
Kurtosis:                       9.306   Cond. No.                     1.28e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.4e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Result for Combined Numerical and Categorical Data within Sale Condition 'Normal'

So, we've reached a maximum Adjusted R-Squared value of 91.0% It looks to be a good sign that for all combined data types regarding the chosen independent variables, there wasn't much difference in total explained variablity whether predicting across all Sale Condition types or just within 'Normal' type.

That's enough analysis for now. Don't want to over anaylze and create pseudo results.

What is Data Leakage in Machine Learning?

Data leakage can cause you to create overly optimistic if not completely invalid predictive models.

Data leakage is when information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and in turn invalidate the estimated performance of the mode being constructed.

How could data leakage pose a problem?

  • "The reality is that as a data scientist, you’re at risk of producing a data leakage situation any time you prepare, clean your data, impute missing values, remove outliers, etc. You might be distorting the data in the process of preparing it to the point that you’ll build a model that works well on your “clean” dataset, but will totally suck when applied in the real-world situation where you actually want to apply it."

Was there any data leakage in this project analysis? I can't say for sure as I tried to prevent it.

Create Summary Table of R-Squared Values

In [25]:
pd.set_option('display.max_colwidth', 90)
tab = pd.read_csv('R Squared.csv')
display(tab)
pd.reset_option('display.max_colwidth')
Num Data Description Features Dummy Columns R-Squared Adjusted R-Squared
0 1 Numerical Data 7 0 80.70% 80.60%
1 2 Numerical Data within Sale Condition 'Normal' 6 0 84.70% 84.70%
2 3 Categorical Data with 28 Features/184 Dummy Columns 28 184 82.20% 79.80%
3 4 Categorical Data with 12 Features/89 Dummy Columns 12 89 78.10% 76.80%
4 5 Categorical Data within Sale Condition 'Normal' 12 87 74.90% 73.20%
5 6 Numerical + Categorical Data Combined (1 + 3 above) 19 89 90.90% 90.30%
6 7 Numerical a+Categorical Data Combined (2 + 4 above) 19 89 91.70% 91.00%

Conclusions

It took a lot of time and work to screen through and clean the data numerical and non-numerical columns and to come up with various predictor model results as shown in the summary table above.

There were some judgement calls to make along the way in terms of what to drop and what to include.

With a final Adjusted R_Squared value of 91% for combined numerical and non-numerical data, I would say that is very good!

This is a long report and if you completely read through it, accolades to you and thank you!!