#!/usr/bin/env python # coding: utf-8 # # Regression Analysis for Estimating Prices # **By [Czarina Luna](https://www.linkedin.com/in/czarinagluna/)** # # ### Contents # * [I. Overview](#I.-Overview) # * [II. Business Problem](#II.-Business-Problem) # * [III. Data Understanding](#III.-Data-Understanding) # * [3.1 Exploratory Data Analysis](#3.1-Exploratory-Data-Analysis) # * [3.2 Geospatial Mapping](#3.2-Geospatial-Mapping) # * [3.3 Feature Engineering](#3.3-Feature-Engineering) # * [IV. Data Modeling](#IV.-Data-Modeling) # * [4.1 Linear Regression](#4.1-Linear-Regression) # * [4.2 Decision Tree](#4.2-Decision-Tree) # * [4.3 Random Forest](#4.3-Random-Forest) # * [V. Results](#V.-Results) # * [VI. Further Research](#VI.-Further-Research) # ## I. Overview # # An analysis and regression modeling of over 21,000 real estate transactions in King County, Washington is completed to improve home valuation and real estate advisory. The findings include top features that are important to the *bestimate* model, which are square footage of living space, distance to Seattle, total distance to both Seattle and Redmond, distance to Redmond, and square footage of living space of the nearest 15 neighbors. The *bestimate* model that is the tuned random forest performs best in predicting house prices based on over 20 features. It explains 88% of the variance in the data and its predictions are USD 107,000 off from the actual prices on average. # ## II. Business Problem # # Real estate agencies in King County, Washington may be able to improve their advisory services by identifying important features that factor in home valuation based on relevant data. Doing so will allow real estate agents to provide more accurate prices to clients as supported by historical records. Using publicly available data, I describe patterns in real estate transactions such as features that likely drive prices. To take it a step further, I develop a prototype for a client-facing application that serves as a home value estimator, modeled from the data. # ## III. Data Understanding # # The [King County data](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction) spans a year between 2014 and 2015 with over 21,000 real estate transactions. Among the 21 variables featured, `price` is the dependent variable to predict. A full description of all column features is provided along with other data files, available in the repository's [data folder](https://github.com/czarinagluna/regression-analysis-for-estimating-prices/tree/main/Data). # # To prepare the data for analysis, the pre-processing steps are taken below: # * Convert `date` to a date time format and create `month` feature # * Remove outliers for `price` and `bedrooms` # * Replace values for `waterfront`, `view`, `condition`, `grade` with a number scale for consistency # * Create `basement` feature using `sqft_basement` to determine basement or none # * Create `renovated` feature using `yr_renovated` to determine renovation or none # * Create `age` feature by subtracting the `yr_built` from year of `date` sold # In[1]: import pandas as pd import numpy as np # Import the dataset data = pd.read_csv('data/kc_house_data.csv') # In[2]: pd.set_option('display.max_columns', 30) # Quick look at the data structure data.head(3) # In[3]: # Convert date to date time format data['date'] = pd.to_datetime(data['date']) # In[4]: data['month'] = pd.DatetimeIndex(data['date']).month # Summary statistics of numerical features: # In[5]: data.drop(columns=['id', 'zipcode', 'lat', 'long']).describe().T.astype('int') # **Data Cleaning** # In[6]: import matplotlib.pyplot as plt import seaborn as sns # Create dataframe with price in millions price_data = data[['price']]/1000000 plt.figure(figsize=(12, 2)) plt.style.use('bmh') sns.boxplot(x='price', data=price_data, color='teal') plt.title('Distribution of Prices (USD Million)') plt.savefig('data/images/fig0.png', dpi=200, transparent=True) plt.show() # In[7]: # Remove outliers in prices data = data[(data.price < data.price.quantile(.995)) & (data.price > data.price.quantile(.005))] # In[8]: plt.figure(figsize=(8, 2)) sns.boxplot(x='bedrooms', data=data, color='honeydew') plt.title('Distribution of Bedrooms') plt.savefig('data/images/fig1.png', dpi=200, transparent=True) plt.show() # Likely a typographical error, the number of 33 bedrooms is corrected to 3. # In[9]: data.loc[(data.bedrooms == 33), 'bedrooms'] = 3 # In[10]: # Replace missing values with no waterfront data.loc[data['waterfront'].isnull(), 'waterfront'] = 'NO' data['waterfront'] = data['waterfront'].apply(lambda x: 0 if x == 'NO' else 1) # In[11]: # Replace missing values with no view data.loc[data.view.isnull(), 'view'] = 'NONE' data['view'] = data['view'].replace({'NONE':0, 'FAIR':1, 'AVERAGE':2, 'GOOD':3, 'EXCELLENT':4}) # In[12]: data['condition'] = data['condition'].replace({'Poor':0, 'Fair':1, 'Average':2, 'Good':3, 'Very Good':4}) # In[13]: data['grade'] = data['grade'].map(lambda x: int(x.split(' ')[0])) # only keep the number from string # In[14]: # Replace missing values with 0 square footage of basement data['sqft_basement'] = data['sqft_basement'].replace('?', '0').astype('float') data['basement'] = data['sqft_basement'].apply(lambda x: 0 if x == 0 else 1) # In[15]: # Replace missing values with not renovated data.loc[data['yr_renovated'].isnull(), 'yr_renovated'] = 0 data['renovated'] = data['yr_renovated'].apply(lambda x: 0 if x == 0 else 1) # In[16]: data['age'] = data['date'].dt.year - data['yr_built'] # ### 3.1 Exploratory Data Analysis # Month-over-month comparison: # In[17]: date_data = data[['date', 'price']] date_data = data.set_index('date', drop=True).astype('float64') # Resample data to count transactions per month monthly_count = date_data.resample('MS').count() monthly_count = monthly_count.drop(monthly_count.index[-1]) # Resample data to calculate mean price per month monthly_avg = date_data.resample('MS').mean() monthly_avg = monthly_avg.drop(monthly_avg.index[-1]) # In[18]: fig, axes = plt.subplots(ncols=2, figsize=(12,4)) monthly_count['price'].plot(title='Total Count', ax=axes[0], colormap='Set3', marker='.') monthly_avg['price'].plot(title='Average Price', ax=axes[1], colormap='BrBG_r', marker='.') axes[0].grid(axis='x', linewidth=0) axes[1].grid(axis='x', linewidth=0) plt.suptitle('Houses Sold Per Month', fontsize=14) plt.savefig('data/images/fig2.png', dpi=200, transparent=True) plt.show() # The highest peaks in the number of houses sold per month happened in the seasons of spring and summer. The decline in the months that followed dropped to the lowest point at the first month of the new year. As for the price of houses sold per month, the values followed roughly the same pattern. # # *** # Check for multicollinearity: # In[19]: def check_correlation(data, threshold): '''Checks for the highest correlated features in a dataframe given a threshold''' df = data.corr().abs().stack().reset_index().sort_values(0, ascending=False) df['pairs'] = list(zip(df['level_0'], df['level_1'])) df = df.set_index(['pairs']).drop(columns=['level_1', 'level_0']) df.columns = ['correlation'] df = df.drop_duplicates() df = df[(df['correlation'] > threshold) & (df['correlation'] < 1)] return df # In[20]: check_correlation(data, 0.75) # In[21]: # Drop highly collinear features and useless feature like id to_drop = ['id', 'date', 'yr_renovated', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated'] data = data.drop(columns=to_drop) # In[22]: plt.figure(figsize=(10,8)) plt.style.use('default') corr = data.corr() mask = np.triu(np.ones_like(corr, dtype=bool)) sns.heatmap(corr.abs(), mask=mask, cmap='YlGnBu') plt.title('Feature Correlation with Price', fontsize=14) plt.savefig('data/images/fig3.png', dpi=200, transparent=True) plt.show() # Price is strongly correlated with `sqft_living`, `grade`, `sqft_living15`, `bathrooms`, and `bathrooms`. # # Let's visualize their relationships and distributions. # In[23]: select_features = ['price', 'bathrooms', 'sqft_living', 'grade'] sns.pairplot(data, vars=select_features, diag_kind='kde', corner=True).fig.suptitle('Relationships between Price and Select Features', fontsize=16, y=1) plt.savefig('data/images/fig4.png', dpi=200, transparent=True) plt.show() # ### 3.2 Geospatial Mapping # # > '*Location, location, location*' # # **Location is important in real estate and in analysis!** # # Let's map the data points. # In[24]: import geopandas as gpd from shapely.geometry import Point, Polygon def map_data(data_, title, name, colormap): '''Plots the data points using latitude and longitude on top of King County map''' plt.style.use('seaborn') fig, ax = plt.subplots(figsize=(8,8)) king_county_fig = gpd.read_file('data/map/Incorporated_Areas_of_King_County___city_area.shp') geometry = [Point(xy) for xy in zip(data_['long'], data_['lat'])] geo_data = gpd.GeoDataFrame(data_, crs='EPSG:4326', geometry=geometry) king_county_fig.plot(ax=ax, alpha=0.4, color='grey') geo_data.plot(column='price', ax=ax, alpha=0.4, legend=True, markersize=8, cmap=plt.get_cmap(colormap), legend_kwds={'shrink': 0.7, 'format': '%.0f'}) plt.xlim(data['long'].min(), data['long'].max()) plt.ylim(data['lat'].min(), data['lat'].max()) plt.ticklabel_format(style='plain') plt.title(title, fontsize=14) plt.savefig(f'data/images/{name}.png', dpi=200, transparent=True) plt.show() # In[25]: # Create dataframe of houses sold under a million dollars UnderMillion = data.loc[(data['price'] < 1000000)] # Create dataframe of houses sold over a million dollars OverMillion = data.loc[(data['price'] >= 1000000)] # In[26]: # Store dataframe for retrieval in map notebook get_ipython().run_line_magic('store', 'UnderMillion') # In[27]: # Store dataframe for retrieval in map notebook get_ipython().run_line_magic('store', 'OverMillion') # In[28]: map_data(UnderMillion, 'House Prices Sold Under $1 Million', 'fig5', 'PuBuGn') # The concentrated geographic patterns reveal parts of the county populated by the more expensive houses represented by darker colors like the island at the center. The even more expensive houses sold for over a million dollars are located at about the same spots as the dark dots. # # Locate the highest priced houses in the data in the following map: # In[29]: map_data(OverMillion, 'House Prices Sold Over $1 Million', 'fig6', 'Greens') # *** # **Interactive Map** # # I created choropleth maps ([map notebook](https://github.com/czarinagluna/regression-analysis-for-estimating-prices/blob/main/map.ipynb)) to further understand how house prices vary by location. # # To interact with the maps, please use the [notebook viewer](https://nbviewer.org/github/czarinagluna/regression-analysis-for-estimating-prices/blob/main/map.ipynb). # ![](data/images/fig7.png) # ### 3.3 Feature Engineering # King County Population: # In[30]: # Import data from population census population = pd.read_csv('data/population.csv') population_density = population.loc[:, ['City Name', 'Total Population 2010', 'Population Density (Persons / Square Mile) 2010']] population_density = population_density.rename(columns={'City Name':'city', 'Total Population 2010':'population', 'Population Density (Persons / Square Mile) 2010':'pop_density'}) # In[31]: king_county_cities = ['Algona', 'Auburn', 'Beaux Arts Village', 'Bellevue', 'Black Diamond', 'Bothell', 'Burien', 'Carnation', 'Clyde Hill', 'Covington', 'Des Moines', 'Duvall', 'Enumclaw', 'Federal Way', 'Hunts Point', 'Issaquah', 'Kenmore', 'Kent', 'Kirkland', 'Lake Forest Park', 'Maple Valley', 'Medina', 'Mercer Island', 'Milton', 'Newcastle', 'Normandy Park', 'North Bend', 'Pacific', 'Redmond', 'Renton', 'Sammamish', 'SeaTac', 'Seattle', 'Shoreline', 'Skykomish', 'Snoqualmie', 'Tukwila', 'Woodinville', 'Yarrow Point'] city_names = pd.DataFrame(king_county_cities, columns=['city']) # In[32]: import warnings warnings.filterwarnings('ignore') king_county_population = pd.merge(population_density, city_names, on=['city']) king_county_population = king_county_population.append({'city':'Vashon','population':10624, 'pop_density':287.22}, ignore_index=True) # In[33]: zipcodes = pd.read_csv('data/zipcodes.csv') zipcodes['city'] = zipcodes['zipcode'].apply(lambda x: x.split('(')[1].strip('()')) zipcodes['zipcode'] = zipcodes['zipcode'].apply(lambda x: int(x.split()[0])) # In[34]: # Map population data to cities king_county_zipcode = pd.merge(king_county_population, zipcodes, on=['city']) engineered_data = pd.merge(data, king_county_zipcode, how='inner', on=['zipcode']) # Distance to Seattle and Redmond: # In[35]: from haversine import haversine, Unit def find_distance(center, lat, long): '''Uses haversine function to find distance between a given center and a point's latitude and longitude''' house = (lat, long) distance = haversine(center, house, unit=Unit.MILES) return distance def add_distance(data): '''Applies the find_distance function to dataframe to add columns for distances to Seattle and Redmond''' seattle = (47.608013, -122.335167) redmond = (47.673988, -122.121513) data['to_seattle'] = data.apply(lambda x: find_distance(seattle, x['lat'], x['long']), axis=1) data['to_redmond'] = data.apply(lambda x: find_distance(redmond, x['lat'], x['long']), axis=1) data['distance'] = data['to_seattle'] + data['to_redmond'] data = data.drop(columns=['lat', 'long']) return data # In[36]: engineered_data = add_distance(engineered_data) # ## IV. Data Modeling # **Train Test Split** # In[37]: from sklearn.model_selection import train_test_split X = data.drop(columns=['price']) y = data['price'] X_tr, X_te, y_tr, y_te = train_test_split(X, y, random_state=112221) # Baseline Model: # In[38]: # Simple linear regression using one independent variable from sklearn.linear_model import LinearRegression lr_model = LinearRegression() lr_model.fit(X_tr[['sqft_living']], y_tr) baseline_pred = lr_model.predict(X_te[['sqft_living']]) # In[39]: # Evaluate using R squared and RMSE scores from sklearn.model_selection import cross_val_score from sklearn.metrics import mean_squared_error baseline_cross_val = cross_val_score(lr_model, X_tr[['sqft_living']], y_tr, scoring='neg_root_mean_squared_error', cv=10) baseline_cross_val_mean = -(baseline_cross_val.mean()) baseline_rmse = mean_squared_error(y_te, baseline_pred, squared=False) baseline_score = lr_model.score(X_te[['sqft_living']], y_te) # In[40]: metric_table = pd.DataFrame(columns = ['model', 'cv_rmse', 'test_rmse', 'test_score']) metric_table = metric_table.append({'model': 'simple_lr', 'cv_rmse': f'{baseline_cross_val_mean:.2f}', 'test_rmse': f'{baseline_rmse:.2f}', 'test_score': f'{baseline_score:.2f}'}, ignore_index=True) # In[41]: metric_table # ### 4.1 Linear Regression # # Multiple Linear Regression: # In[42]: def run_model(regressor, name, X_train=X_tr, X_test=X_te, y_train=y_tr, y_test=y_te): '''Fits a regression model on train set, predicts on the test set, and returns cross validation and test scores''' regressor.fit(X_train, y_train) y_pred = regressor.predict(X_test) cross_val = cross_val_score(regressor, X_train, y_train, scoring='neg_root_mean_squared_error', cv=10) cross_val_mean = -(cross_val.mean()) rmse = mean_squared_error(y_test, y_pred, squared=False) score = regressor.score(X_test, y_test) return metric_table.append({'model': name, 'cv_rmse': f'{cross_val_mean:.2f}', 'test_rmse': f'{rmse:.2f}', 'test_score': f'{score:.2f}'}, ignore_index=True) # In[43]: # Use multiple numerical features num_feat = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_living15', 'sqft_lot15', 'age'] metric_table = run_model(lr_model, 'multiple_lr', X_tr[num_feat], X_te[num_feat]) metric_table # Standardization: # In[44]: from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_sc_tr = pd.DataFrame(scaler.fit_transform(X_tr[num_feat], y_tr), columns=X_tr[num_feat].columns, index=X_tr[num_feat].index) X_sc_te = pd.DataFrame(scaler.transform(X_te[num_feat]), columns=X_te[num_feat].columns, index=X_te[num_feat].index) metric_table = run_model(lr_model, 'standardized_lr', X_sc_tr, X_sc_te) metric_table # In[45]: # Data with engineered features such as population and distance X_ = engineered_data.drop(columns=['price']) y_ = engineered_data['price'] X_train, X_test, y_train, y_test = train_test_split(X_, y_, random_state=112221) # In[46]: # Store for retrieval in grid search notebook get_ipython().run_line_magic('store', 'X_train') # In[47]: # Store for retrieval in grid search notebook get_ipython().run_line_magic('store', 'y_train') # In[48]: num_feat.extend(['population', 'pop_density', 'to_seattle', 'to_redmond', 'distance']) X_scaled_train = pd.DataFrame(scaler.fit_transform(X_train[num_feat], y_train), columns=X_train[num_feat].columns, index=X_train[num_feat].index) X_scaled_test = pd.DataFrame(scaler.transform(X_test[num_feat]), columns=X_test[num_feat].columns, index=X_test[num_feat].index) metric_table = run_model(lr_model, 'feature_lr', X_scaled_train, X_scaled_test, y_train, y_test) metric_table # Machine Learning Pipeline: # In[49]: from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder num_pipe = Pipeline(steps=[ ('num_imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) # Perform one hot encoding on categorical features cat_pipe = Pipeline(steps=[ ('cat_imputer', SimpleImputer(strategy='most_frequent')), ('ohe', OneHotEncoder(handle_unknown='ignore')) ]) # In[50]: from sklearn.compose import ColumnTransformer num_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_living15', 'sqft_lot15', 'age', 'population', 'pop_density', 'to_seattle', 'to_redmond', 'distance'] cat_features = ['waterfront', 'view', 'condition', 'grade', 'zipcode', 'month', 'basement', 'renovated', 'city'] col_transformer = ColumnTransformer([ ('num_transformer', num_pipe, num_features), ('cat_transformer', cat_pipe, cat_features) ]) # In[51]: # Multiple linear regression using all features lr_pipe = Pipeline(steps=[ ('col_transformer', col_transformer), ('lr', LinearRegression()) ]) metric_table = run_model(lr_pipe, 'pipeline_lr', X_train, X_test, y_train, y_test) metric_table # Ridge and Lasso Regression: # In[52]: from sklearn.linear_model import Ridge ridge_pipe = Pipeline(steps=[ ('col_transformer', col_transformer), ('ridge', Ridge()) ]) metric_table = run_model(ridge_pipe, 'ridge_lr', X_train, X_test, y_train, y_test) metric_table # In[53]: from sklearn.linear_model import Lasso lasso_pipe = Pipeline(steps=[ ('col_transformer', col_transformer), ('lasso', Lasso(max_iter=2000)) ]) metric_table = run_model(lasso_pipe, 'lasso_lr', X_train, X_test, y_train, y_test) metric_table # Polynomial Regression: # In[54]: from sklearn.preprocessing import PolynomialFeatures poly_col_transformer = ColumnTransformer([ ('num_transformer', num_pipe, num_features), ('polynomial', PolynomialFeatures(2), num_features), ('cat_transformer', cat_pipe, cat_features), ]) poly_pipe = Pipeline(steps=[ ('poly_col_transformer', poly_col_transformer), ('lr', LinearRegression()) ]) metric_table = run_model(poly_pipe, 'poly_lr', X_train, X_test, y_train, y_test) metric_table # ### 4.2 Decision Tree # In[55]: from sklearn.tree import DecisionTreeRegressor dt_pipe = Pipeline(steps=[ ('col_transformer', col_transformer), ('decision_tree', DecisionTreeRegressor()) ]) metric_table = run_model(dt_pipe, 'default_dt', X_train, X_test, y_train, y_test) metric_table # In[56]: # Store for retrieval in grid search notebook get_ipython().run_line_magic('store', 'dt_pipe') # In[57]: poly_dt = Pipeline(steps=[ ('poly_col_transformer', poly_col_transformer), ('decision_tree', DecisionTreeRegressor()) ]) metric_table = run_model(poly_dt, 'poly_dt', X_train, X_test, y_train, y_test) metric_table # Grid search ([notebook](https://github.com/czarinagluna/regression-analysis-for-estimating-prices/blob/main/gridsearch.ipynb)) is performed to find the optimal parameters of the Decision Tree model. # In[58]: tuned_dt = Pipeline(steps=[ ('col_transformer', col_transformer), ('decision_tree', DecisionTreeRegressor(criterion='friedman_mse', max_depth=None, min_samples_leaf=5, min_samples_split=45)) ]) metric_table = run_model(tuned_dt, 'tuned_dt', X_train, X_test, y_train, y_test) metric_table # In[59]: poly_tuned_dt = Pipeline(steps=[ ('poly_col_transformer', poly_col_transformer), ('decision_tree', DecisionTreeRegressor(criterion='friedman_mse', max_depth=None, min_samples_leaf=5, min_samples_split=45)) ]) metric_table = run_model(poly_tuned_dt, 'poly_tuned_dt', X_train, X_test, y_train, y_test) metric_table # ### 4.3 Random Forest # In[60]: from sklearn.ensemble import RandomForestRegressor rf_pipe = Pipeline(steps=[ ('col_transformer', col_transformer), ('random_forest', RandomForestRegressor()) ]) metric_table = run_model(rf_pipe, 'default_rf', X_train, X_test, y_train, y_test) metric_table # In[61]: # Store for retrieval in grid search notebook get_ipython().run_line_magic('store', 'rf_pipe') # Grid search ([notebook](https://github.com/czarinagluna/regression-analysis-for-estimating-prices/blob/main/gridsearch.ipynb)) is performed to find the optimal parameters of the Random Forest model. # In[62]: tuned_rf = Pipeline(steps=[ ('col_transformer', col_transformer), ('random_forest', RandomForestRegressor(criterion='friedman_mse', max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=150)) ]) metric_table = run_model(tuned_rf, 'tuned_rf', X_train, X_test, y_train, y_test) metric_table # In[63]: poly_tuned_rf = Pipeline(steps=[ ('poly_col_transformer', poly_col_transformer), ('random_forest', RandomForestRegressor(criterion='friedman_mse', max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=150)) ]) metric_table = run_model(poly_tuned_rf, 'poly_tuned_rf', X_train, X_test, y_train, y_test) metric_table # In[64]: # The best model with the best estimate bestimate_model = tuned_rf bestimate_model.fit(X_train, y_train) bestimate_pred = bestimate_model.predict(X_test) # ## V. Results # # Model Performance: # In[65]: # Visualize the metrics evaluated per model metric_table[['cv_rmse', 'test_rmse', 'test_score']] = metric_table[['cv_rmse', 'test_rmse', 'test_score']].astype('float') plt.style.use('bmh') fig, axes = plt.subplots(nrows=2, figsize=(18,12)) # First metric is the R-squared score sns.barplot(x='model', y='test_score', data=metric_table, palette='BuPu', ax=axes[0], zorder=99) axes[0].set_xlabel('') axes[0].set_ylabel('R²') axes[0].set_ylim(0, 1) axes[0].set_title('Comparison of Model Performance: R²') # Second metric is the root mean squared error sns.barplot(x='model', y='test_rmse', data=metric_table, palette='BuPu', ax=axes[1], zorder=99) axes[1].set_xlabel('') axes[1].set_ylabel('RMSE') axes[1].set_title('Comparison of Model Performance: Root Mean Squared Error') plt.savefig('data/images/fig8.png', dpi=200, transparent=True) plt.show() # **Baseline to *Bestimate* Model** # # The baseline model `simple_lr` significantly improved to the `tuned_rf`, our *bestimate* model: # * from an R-squared of 0.47 to 0.88, and # * from a Root Mean Squared Error of USD 217,000 to USD 107,000 # # To visualize the significant difference, let's plot how far away the predictions of the baseline model are to the actual prices versus how much closer the predictions of the bestimate model are: # In[66]: plt.style.use('default') fig, axes = plt.subplots(ncols=2, figsize=(12,6)) # Perfect line where the predicted price is equal to the actual price perfect_line = np.arange(0, 3) # Baseline simple linear regression model axes[0].plot(perfect_line, linestyle='--', color='black', label='prediction value') axes[0].scatter(y_test/1000000, baseline_pred/1000000, alpha=0.1, color='cornflowerblue', label='price') axes[0].set_xlabel('Predicted Price in Millions') axes[0].set_ylabel('Actual Price in Millions') axes[0].set_title('Baseline Model \n R²: 47% RMSE: $217,000') axes[0].set_xlim(0, 2) axes[0].set_ylim(0, 2) axes[0].legend() # Bestimate random forest model axes[1].plot(perfect_line, linestyle='--', color='black', label='prediction value') axes[1].scatter(y_test/1000000, bestimate_pred/1000000, alpha=0.1, color='plum', label='price') axes[1].set_xlabel('Predicted Price in Millions') axes[1].set_ylabel('Actual Price in Millions') axes[1].set_title('Bestimate Model \n R²: 88% RMSE: $107,000') axes[1].set_xlim(0, 2) axes[1].set_ylim(0, 2) axes[1].legend() plt.savefig('data/images/fig9.png', dpi=200, transparent=True) plt.show() # Important features under the *bestimate* model: # In[67]: from sklearn.inspection import permutation_importance result = permutation_importance(bestimate_model, X_test, y_test, n_repeats=10, random_state=112221, n_jobs=2) sorted_idx = result.importances_mean.argsort() # In[68]: plt.style.use('bmh') fig, ax = plt.subplots(figsize=(16,12)) ax.boxplot(result.importances[sorted_idx].T, vert=False, labels=X_test.columns[sorted_idx], ) ax.set_title('Permutation Importances', fontsize=16) ax.set_xlabel('Mean of Feature Importance') ax.set_ylabel('Feature') plt.savefig('data/images/fig10.png', dpi=200, transparent=True) plt.show() # For advisory, the top five features with the highest mean feature importances to the model are: # 1. Square footage of living space # 2. Distance to Seattle # 3. Total distance to both Seattle and Redmond # 4. Distance to Redmond # 5. Square footage of living space of the nearest 15 neighbors # # The other important features that follows are overall grade related to the construction and design of the house, quality of view from the house, zipcode, whether the house is on a waterfront, and square footage of lot. # *** # **Model Deployment** # # Finally, I deploy the Random Forest regression model ([demo](https://streamlit.io/)) as a prototype I develop for a client-facing application that serves as a Home Value Estimator. # ## VI. Further Research # # For further research, the data should be updated to more recent transactions in King County, and more features should be added such as more information on comparable and neighboring houses. # # Contact # # Feel free to contact me for any questions and connect with me on [Linkedin](https://www.linkedin.com/in/czarinagluna/).