4 Pre-Processing and Training Data ¶

4.1 Contents ¶

4 Pre-Processing and Training Data

4.2 Introduction ¶

In preceding notebooks, we performed preliminary assessments of data quality and refined the question to be answered. We found a small number of observations that clearly indicated whether to replace values or drop a whole row. We determined that predicting the adult weekend ticket price was our primary aim. We threw away records with missing price data, but not before making the most of the other available data to look for any patterns among the states. We didn't see any and decided to treat all states equally; the state label didn't seem to be particularly useful.

In this notebook, we'll start to build machine learning models. Before diving into a machine learning model, however, we'll start by considering how useful the mean value is as a predictor. We never want to go to stakeholders with a machine learning model only to have the CEO point out that it performs worse than just guessing the average! Our first model is always a baseline performance comparitor for any subsequent model. Next, we'll build up the process of efficiently creating robust models to compare to our baseline forecast. We can validate steps with our own functions for checking expected equivalences between, say, pandas and sklearn implementations.

4.3 Imports ¶

In [88]:

import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime

from library.sb_utils import save_file

4.4 Load Data ¶

In [89]:

ski_data = pd.read_csv('../data/ski_data_step3_features.csv')
ski_data.head().T

Out[89]:

	0	1	2	3	4
Name	Alyeska Resort	Eaglecrest Ski Area	Hilltop Ski Area	Arizona Snowbowl	Sunrise Park Resort
Region	Alaska	Alaska	Alaska	Arizona	Arizona
state	Alaska	Alaska	Alaska	Arizona	Arizona
summit_elev	3939	2600	2090	11500	11100
vertical_drop	2500	1540	294	2300	1800
base_elev	250	1200	1796	9200	9200
trams	1	0	0	0	0
fastSixes	0	0	0	1	0
fastQuads	2	0	0	0	1
quad	2	0	0	2	2
triple	0	0	1	2	3
double	0	4	0	1	1
surface	2	0	2	2	0
total_chairs	7	4	3	8	7
Runs	76.0	36.0	13.0	55.0	65.0
TerrainParks	2.0	1.0	1.0	4.0	2.0
LongestRun_mi	1.0	2.0	1.0	2.0	1.2
SkiableTerrain_ac	1610.0	640.0	30.0	777.0	800.0
Snow Making_ac	113.0	60.0	30.0	104.0	80.0
daysOpenLastYear	150.0	45.0	150.0	122.0	115.0
yearsOpen	60.0	44.0	36.0	81.0	49.0
averageSnowfall	669.0	350.0	69.0	260.0	250.0
AdultWeekend	85.0	53.0	34.0	89.0	78.0
projectedDaysOpen	150.0	90.0	152.0	122.0	104.0
NightSkiing_ac	550.0	NaN	30.0	NaN	80.0
resorts_per_state	3	3	3	2	2
resorts_per_100kcapita	0.410091	0.410091	0.410091	0.027477	0.027477
resorts_per_100ksq_mile	0.450867	0.450867	0.450867	1.75454	1.75454
resort_skiable_area_ac_state_ratio	0.70614	0.280702	0.013158	0.492708	0.507292
resort_days_open_state_ratio	0.434783	0.130435	0.434783	0.514768	0.485232
resort_terrain_park_state_ratio	0.5	0.25	0.25	0.666667	0.333333
resort_night_skiing_state_ratio	0.948276	NaN	0.051724	NaN	1.0
total_chairs_runs_ratio	0.092105	0.111111	0.230769	0.145455	0.107692
total_chairs_skiable_ratio	0.004348	0.00625	0.1	0.010296	0.00875
fastQuads_runs_ratio	0.026316	0.0	0.0	0.0	0.015385
fastQuads_skiable_ratio	0.001242	0.0	0.0	0.0	0.00125

4.5 Extract Big Mountain Data ¶

Big Mountain is our resort. Separate it from the rest of the data to use later.

In [90]:

big_mountain = ski_data[ski_data.Name == 'Big Mountain Resort']

In [91]:

big_mountain.T

Out[91]:

	124
Name	Big Mountain Resort
Region	Montana
state	Montana
summit_elev	6817
vertical_drop	2353
base_elev	4464
trams	0
fastSixes	0
fastQuads	3
quad	2
triple	6
double	0
surface	3
total_chairs	14
Runs	105.0
TerrainParks	4.0
LongestRun_mi	3.3
SkiableTerrain_ac	3000.0
Snow Making_ac	600.0
daysOpenLastYear	123.0
yearsOpen	72.0
averageSnowfall	333.0
AdultWeekend	81.0
projectedDaysOpen	123.0
NightSkiing_ac	600.0
resorts_per_state	12
resorts_per_100kcapita	1.122778
resorts_per_100ksq_mile	8.161045
resort_skiable_area_ac_state_ratio	0.140121
resort_days_open_state_ratio	0.129338
resort_terrain_park_state_ratio	0.148148
resort_night_skiing_state_ratio	0.84507
total_chairs_runs_ratio	0.133333
total_chairs_skiable_ratio	0.004667
fastQuads_runs_ratio	0.028571
fastQuads_skiable_ratio	0.001

In [92]:

ski_data.shape

Out[92]:

(278, 36)

In [93]:

ski_data = ski_data[ski_data.Name != 'Big Mountain Resort']

In [94]:

ski_data.shape

Out[94]:

(277, 36)

4.6 Train/Test Split ¶

So far, we've treated ski resort data as a single entity. In machine learning, when we train our model on all of our data, we end up with no data set aside to evaluate model performance. We could keep making more and more complex models that fit the data better and better and not realise we are overfitting the model. By partitioning the data into training and testing splits, without letting a model (or missing-value imputation) learn anything about the test split, we have a somewhat independent assessment of how our model might perform in the future. An often overlooked subtlety here is that people all too frequently use the test set to assess model performance and then compare multiple models to pick the best. This means their overall model selection process is flawed: The engineer picks the model sans help from the test set. Instead we use held-out data and/or k-fold cross-validation to simulate additional test sets and assess model performance. The formal test set is very useful as a final check on expected future performance.

What partition sizes would we have with a 70/30 train/test split?

In [95]:

len(ski_data) * .7, len(ski_data) * .3

Out[95]:

(193.89999999999998, 83.1)

In [96]:

# Generate test and train sets for X and Y variables; set random state to get reproducable results
X_train, X_test, y_train, y_test = train_test_split(ski_data.drop(columns='AdultWeekend'), 
                                                    ski_data.AdultWeekend, test_size=0.3, 
                                                    random_state=47)

In [97]:

# Check the shapes of X train and test sets
X_train.shape, X_test.shape

Out[97]:

((193, 35), (84, 35))

In [98]:

# Check the shapes of train Y train and test sets
y_train.shape, y_test.shape

Out[98]:

((193,), (84,))

In [99]:

names_list = ['Name', 'state', 'Region']
names_train = X_train[names_list]
names_test = X_test[names_list]
X_train.drop(columns=names_list, inplace=True)
X_test.drop(columns=names_list, inplace=True)
X_train.shape, X_test.shape

Out[99]:

((193, 32), (84, 32))

In [100]:

# Check the `dtypes` attribute of `X_train` to verify all features are numeric
X_train.dtypes

Out[100]:

summit_elev                             int64
vertical_drop                           int64
base_elev                               int64
trams                                   int64
fastSixes                               int64
fastQuads                               int64
quad                                    int64
triple                                  int64
double                                  int64
surface                                 int64
total_chairs                            int64
Runs                                  float64
TerrainParks                          float64
LongestRun_mi                         float64
SkiableTerrain_ac                     float64
Snow Making_ac                        float64
daysOpenLastYear                      float64
yearsOpen                             float64
averageSnowfall                       float64
projectedDaysOpen                     float64
NightSkiing_ac                        float64
resorts_per_state                       int64
resorts_per_100kcapita                float64
resorts_per_100ksq_mile               float64
resort_skiable_area_ac_state_ratio    float64
resort_days_open_state_ratio          float64
resort_terrain_park_state_ratio       float64
resort_night_skiing_state_ratio       float64
total_chairs_runs_ratio               float64
total_chairs_skiable_ratio            float64
fastQuads_runs_ratio                  float64
fastQuads_skiable_ratio               float64
dtype: object

In [101]:

# Repeat this check for the test split in `X_test`
X_test.dtypes

Out[101]:

summit_elev                             int64
vertical_drop                           int64
base_elev                               int64
trams                                   int64
fastSixes                               int64
fastQuads                               int64
quad                                    int64
triple                                  int64
double                                  int64
surface                                 int64
total_chairs                            int64
Runs                                  float64
TerrainParks                          float64
LongestRun_mi                         float64
SkiableTerrain_ac                     float64
Snow Making_ac                        float64
daysOpenLastYear                      float64
yearsOpen                             float64
averageSnowfall                       float64
projectedDaysOpen                     float64
NightSkiing_ac                        float64
resorts_per_state                       int64
resorts_per_100kcapita                float64
resorts_per_100ksq_mile               float64
resort_skiable_area_ac_state_ratio    float64
resort_days_open_state_ratio          float64
resort_terrain_park_state_ratio       float64
resort_night_skiing_state_ratio       float64
total_chairs_runs_ratio               float64
total_chairs_skiable_ratio            float64
fastQuads_runs_ratio                  float64
fastQuads_skiable_ratio               float64
dtype: object

We have only numeric features in X now!

4.7 Initial Not-Even-A-Model ¶

We'll begin by determining how good the mean is as a predictor. In other words, what if we simply say our best guess is the average price?

In [102]:

# Calculate the mean of `y_train`
train_mean = y_train.mean()
train_mean

Out[102]:

63.84730569948186

sklearn's DummyRegressor easily does this:

In [103]:

# Sanity Check
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
dumb_reg.constant_

Out[103]:

array([[63.8473057]])

Having established the grand mean, we need to determine how closely it matches, or explains, the actual values. There are many ways of assessing how good one set of values agrees with another, which brings us to the subject of metrics.

4.7.1 Metrics ¶

4.7.1.1 R-squared, or coefficient of determination ¶

One measure is $R^2$, the coefficient of determination. This is a measure of the proportion of variance in the dependent variable (our ticket price) that is predicted by our "model". The linked Wikipedia articles gives a nice explanation of how negative values can arise. This is frequently a cause of confusion for newcomers who, reasonably, ask how can a squared value be negative?

Recall the mean can be denoted by $\bar{y}$, where

$$\bar{y} = \frac{1}{n}\sum_{i=1}^ny_i$$

and where $y_i$ are the individual values of the dependent variable.

The total sum of squares (error), can be expressed as

$$SS_{tot} = \sum_i(y_i-\bar{y})^2$$

The above formula should be familiar as it's simply the variance without the denominator to scale (divide) by the sample size.

The residual sum of squares is similarly defined to be

$$SS_{res} = \sum_i(y_i-\hat{y})^2$$

where $\hat{y}$ are our predicted values for the depended variable.

The coefficient of determination, $R^2$, here is given by

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Putting it into words, it's one minus the ratio of the residual variance to the original variance. Thus, the baseline model here, which always predicts $\bar{y}$, should give $R^2=0$. A model that perfectly predicts the observed values would have no residual error and so give $R^2=1$. Models that do worse than predicting the mean will have increased the sum of squares of residuals and so produce a negative $R^2$.

In [104]:

#Calculate the R^2 as defined above
def r_squared(y, ypred):
    """R-squared score.
    
    Calculate the R-squared, or coefficient of determination, of the input.
    
    Arguments:
    y -- the observed values
    ypred -- the predicted values
    """
    ybar = np.mean(y)
    sum_sq_tot = np.sum((y - ybar)**2)
    sum_sq_res = np.sum((y - ypred)**2)
    R2 = 1.0 - sum_sq_res / sum_sq_tot
    return R2

We make our predictions by creating an array of length the size of the training set with the single value of the mean.

In [105]:

y_tr_pred_ = train_mean * np.ones(len(y_train))
y_tr_pred_[:5]

Out[105]:

array([63.8473057, 63.8473057, 63.8473057, 63.8473057, 63.8473057])

Remember the sklearn dummy regressor?

In [106]:

y_tr_pred = dumb_reg.predict(X_train)
y_tr_pred[:5]

Out[106]:

array([63.8473057, 63.8473057, 63.8473057, 63.8473057, 63.8473057])

We can see that DummyRegressor produces exactly the same results and saves us from having to broadcast the mean (or whichever other statistic we used - check out the documentation to see what's available) to an array of the appropriate length. It also gives us an object with fit() and predict() methods as well, so we can use them as conveniently as any other sklearn estimator.

In [107]:

r_squared(y_train, y_tr_pred)

Out[107]:

0.0

Exactly as expected, if we use the average value as our prediction, we get an $R^2$ of zero on our training set. What if we use this "model" to predict unseen values from the test set? Remember, of course, that our "model" is trained on the training set; we still use the training set mean as our prediction.

Make predictions by creating an array of length the size of the test set with the single value of the (training) mean.

In [108]:

y_te_pred = train_mean * np.ones(len(y_test))
r_squared(y_test, y_te_pred)

Out[108]:

-0.0015364646867073173

Generally, you can expect performance on a test set to be slightly worse than on the training set. As you are getting an $R^2$ of zero on the training set, there's nowhere to go but negative!

$R^2$ is a common metric, and interpretable in terms of the amount of variance explained, it's less appealing if we want an idea of how "close" our predictions are to the true values. Metrics that summarise the difference between predicted and actual values are mean absolute error and mean squared error.

4.7.1.2 Mean Absolute Error ¶

This is very simply the average of the absolute errors:

$$MAE = \frac{1}{n}\sum_i^n|y_i - \hat{y}|$$

In [109]:

def mae(y, ypred):
    """Mean absolute error.
    
    Calculate the mean absolute error of the arguments

    Arguments:
    y -- the observed values
    ypred -- the predicted values
    """
    abs_error = np.abs(y - ypred)
    mae = np.mean(abs_error)
    return mae

In [110]:

mae(y_train, y_tr_pred)

Out[110]:

18.149503610835193

In [111]:

mae(y_test, y_te_pred)

Out[111]:

18.672179249938317

Mean absolute error is arguably the most intuitive of all the metrics, this essentially tells you that, on average, you might expect to be off by around \$19 if you guessed ticket price based on an average of known values.

4.7.1.3 Mean Squared Error ¶

Another common metric (and an important one internally for optimizing machine learning models) is the mean squared error. This is simply the average of the square of the errors:

$$MSE = \frac{1}{n}\sum_i^n(y_i - \hat{y})^2$$

In [112]:

# Calculate the MSE as defined above
def mse(y, ypred):
    """Mean square error.
    
    Calculate the mean square error of the arguments

    Arguments:
    y -- the observed values
    ypred -- the predicted values
    """
    sq_error = (y - ypred)**2
    mse = np.mean(sq_error)
    return mse

In [113]:

mse(y_train, y_tr_pred)

Out[113]:

616.9493046578431

In [114]:

mse(y_test, y_te_pred)

Out[114]:

574.1671108060107

So here, we get a slightly better MSE on the test set than we did on the train set. And what does a squared error mean anyway? To convert this back to our measurement space, we often take the square root, to form the root mean square error thus:

In [115]:

np.sqrt([mse(y_train, y_tr_pred), mse(y_test, y_te_pred)])

Out[115]:

array([24.83846422, 23.96178438])

4.7.2 sklearn metrics ¶

Functions are good, but you don't want to have to define functions every time we want to assess performance. sklearn.metrics provides many commonly used metrics, included the ones above.

4.7.2.0.1 R-squared ¶

In [116]:

r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

Out[116]:

(0.0, -0.0015364646867073173)

4.7.2.0.2 Mean absolute error ¶

In [117]:

mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

Out[117]:

(18.149503610835193, 18.672179249938317)

4.7.2.0.3 Mean squared error ¶

In [118]:

mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)

Out[118]:

(616.9493046578431, 574.1671108060107)

4.7.3 Note On Calculating Metrics ¶

In a Jupyter code cell, running r2_score? will bring up the docstring for the function, and r2_score?? will bring up the actual code of the function! Here we try it and compare the source for sklearn's function with ours.

In [119]:

# train set - sklearn
# correct order, incorrect order
r2_score(y_train, y_tr_pred), r2_score(y_tr_pred, y_train)

Out[119]:

(0.0, -3.054984985780873e+30)

In [120]:

# test set - sklearn
# correct order, incorrect order
r2_score(y_test, y_te_pred), r2_score(y_te_pred, y_test)

Out[120]:

(-0.0015364646867073173, -2.8431378228302645e+30)

In [121]:

# train set - using our homebrew function
# correct order, incorrect order
r_squared(y_train, y_tr_pred), r_squared(y_tr_pred, y_train)

Out[121]:

(0.0, -3.054984985780873e+30)

In [122]:

# test set - using our homebrew function
# correct order, incorrect order
r_squared(y_test, y_te_pred), r_squared(y_te_pred, y_test)

Out[122]:

(-0.0015364646867073173, -2.8431378228302645e+30)

We can get very different results swapping the argument order. It's worth highlighting this because data scientists do this too much in the real world! Frequently the argument order doesn't matter, but it will bite when we do it with a function that does care. It's sloppy, bad practice and if we don't make a habit of putting arguments in the right order, we stand to forget!

Remember:

argument order matters,
check function syntax with func? in a code cell

4.8 Initial Models ¶

4.8.1 Imputing missing feature (predictor) values ¶

Recall when performing EDA, we imputed (filled in) some missing values in Pandas. We can impute missing values using scikit-learn, but we will prioritize imputation from a train split and apply that to the test split to then assess how well our imputation worked.

4.8.1.1 Impute missing values with median ¶

We have missing values. Recall from our data exploration that many distributions were skewed. Our first thought might be to impute missing values using the median.

4.8.1.1.1 Learn the values to impute from the train set ¶

In [123]:

# These are the values we'll use to fill in any missing values
X_defaults_median = X_train.median()
X_defaults_median

Out[123]:

summit_elev                           2175.000000
vertical_drop                          750.000000
base_elev                             1280.000000
trams                                    0.000000
fastSixes                                0.000000
fastQuads                                0.000000
quad                                     0.000000
triple                                   1.000000
double                                   1.000000
surface                                  2.000000
total_chairs                             6.000000
Runs                                    29.000000
TerrainParks                             2.000000
LongestRun_mi                            1.000000
SkiableTerrain_ac                      170.000000
Snow Making_ac                          96.500000
daysOpenLastYear                       107.000000
yearsOpen                               57.000000
averageSnowfall                        120.000000
projectedDaysOpen                      112.000000
NightSkiing_ac                          70.000000
resorts_per_state                       15.000000
resorts_per_100kcapita                   0.248243
resorts_per_100ksq_mile                 24.428973
resort_skiable_area_ac_state_ratio       0.050000
resort_days_open_state_ratio             0.070595
resort_terrain_park_state_ratio          0.069444
resort_night_skiing_state_ratio          0.066804
total_chairs_runs_ratio                  0.200000
total_chairs_skiable_ratio               0.040323
fastQuads_runs_ratio                     0.000000
fastQuads_skiable_ratio                  0.000000
dtype: float64

4.8.1.1.2 Apply the imputation to both train and test splits ¶

In [124]:

X_tr = X_train.fillna(X_defaults_median)
X_te = X_test.fillna(X_defaults_median)

4.8.1.1.3 Scale the data ¶

As we have features measured in many different units, with numbers that vary by orders of magnitude, start off by scaling them to put them all on a consistent scale. The StandardScaler scales each feature to zero mean and unit variance.

In [125]:

#Call the StandardScaler`s fit method on `X_tr` to fit the scaler
#then use it's `transform()` method to apply the scaling to both the train and test split
#data (`X_tr` and `X_te`), naming the results `X_tr_scaled` and `X_te_scaled`, respectively
scaler = StandardScaler()
scaler.fit(X_tr)
X_tr_scaled = scaler.transform(X_tr)
X_te_scaled = scaler.transform(X_te)

4.8.1.1.4 Train the model on the train split ¶

In [126]:

lm = LinearRegression().fit(X_tr_scaled, y_train)

4.8.1.1.5 Make predictions using the model on both train and test splits ¶

In [127]:

#Call the `predict()` method of the model (`lm`) on both the (scaled) train and test data
#Assign the predictions to `y_tr_pred` and `y_te_pred`, respectively
y_tr_pred = lm.predict(X_tr_scaled)
y_te_pred = lm.predict(X_te_scaled)

4.8.1.1.6 Assess model performance ¶

In [128]:

# r^2 - train, test
median_r2 = r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
median_r2

Out[128]:

(0.8237204449411376, 0.7251410286259974)

Recall that we estimated ticket prices by simply using a known average. As expected, this produced an $R^2$ of zero for both the training and test set, because $R^2$ tells us how much of the variance we've explaining beyond that of using just the mean. Here, we see that our simple linear regression model explains over 80% of the variance on the train set and over 70% on the test set. Clearly, we are onto something, although the much lower value for the test set is indicative of overfitting. This isn't a surprise as we've made no effort to select a parsimonious set of features or deal with multicollinearity in our data.

In [129]:

# Now calculate the mean absolute error scores using `sklearn`'s `mean_absolute_error` function as we did above for R^2
# MAE - train, test
median_mae = mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
median_mae

Out[129]:

(8.495768235382354, 9.696652536263656)

Using this model, then, on average we'd expect to estimate a ticket price within \$9 or so of the real price. This is much, much better than the \$19 from just guessing using the average. There may be something to this machine learning lark after all!

In [130]:

# And also do the same using `sklearn`'s `mean_squared_error`
# MSE - train, test
median_mse = mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
median_mse

Out[130]:

(108.75554891895914, 157.57287631288543)

4.8.1.2 Impute missing values with the mean ¶

We chose to use the median for filling missing values because of the skew of many of our predictor feature distributions, let's try the mean.

4.8.1.2.1 Learn the values to impute from the train set ¶

In [131]:

# As we did for the median above, calculate mean values for imputing missing values
# These are the values we'll use to fill in any missing values
X_defaults_mean = X_train.mean()
X_defaults_mean

Out[131]:

summit_elev                           4042.036269
vertical_drop                         1057.264249
base_elev                             2975.487047
trams                                    0.103627
fastSixes                                0.093264
fastQuads                                0.673575
quad                                     0.948187
triple                                   1.414508
double                                   1.746114
surface                                  2.476684
total_chairs                             7.455959
Runs                                    41.387435
TerrainParks                             2.447205
LongestRun_mi                            1.301579
SkiableTerrain_ac                      458.691099
Snow Making_ac                         128.935294
daysOpenLastYear                       109.761290
yearsOpen                               56.895833
averageSnowfall                        160.112903
projectedDaysOpen                      114.900621
NightSkiing_ac                          84.843478
resorts_per_state                       16.523316
resorts_per_100kcapita                   0.442984
resorts_per_100ksq_mile                 42.862331
resort_skiable_area_ac_state_ratio       0.096680
resort_days_open_state_ratio             0.121639
resort_terrain_park_state_ratio          0.113116
resort_night_skiing_state_ratio          0.150272
total_chairs_runs_ratio                  0.266321
total_chairs_skiable_ratio               0.070053
fastQuads_runs_ratio                     0.010619
fastQuads_skiable_ratio                  0.001700
dtype: float64

By eye, we can immediately tell that our replacement values are much higher than those from using the median.

4.8.1.2.2 Apply the imputation to both train and test splits ¶

In [194]:

X_tr = X_train.fillna(X_defaults_mean)
X_te = X_test.fillna(X_defaults_mean)

4.8.1.2.3 Scale the data ¶

In [195]:

scaler = StandardScaler()
scaler.fit(X_tr)
X_tr_scaled = scaler.transform(X_tr)
X_te_scaled = scaler.transform(X_te)

4.8.1.2.4 Train the model on the train split ¶

In [196]:

lm = LinearRegression().fit(X_tr_scaled, y_train)

4.8.1.2.5 Make predictions using the model on both train and test splits ¶

In [197]:

y_tr_pred = lm.predict(X_tr_scaled)
y_te_pred = lm.predict(X_te_scaled)

4.8.1.2.6 Assess model performance ¶

In [198]:

r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

Out[198]:

(0.8221207605475709, 0.7290195691422242)

In [137]:

mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

Out[137]:

(8.510780313012354, 9.565093916371973)

In [138]:

mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)

Out[138]:

(109.74247309324214, 155.34936226136)

These results don't seem very different to when the one's we used with the median for imputing missing values. Perhaps it doesn't make much difference here. Maybe our overtraining is worse than we thought. Maybe other feature transformations, such as taking the log, would help. We could try with just a subset of features rather than using all of them as inputs.

To perform the median/mean comparison, we copied and pasted a lot of code just to change the function for imputing missing values. It would make more sense to write a function that performed the sequence of steps:

impute missing values
scale the features
train a model
calculate model performance

These are common steps, and sklearn provides something much better than writing custom functions.

4.8.2 Pipelines ¶

One of the most important and useful components of sklearn is the pipeline. In place of Pandas's fillna DataFrame method, there is sklearn's SimpleImputer. Remember the first linear model above performed the steps:

replace missing values with the median for each feature
scale the data to zero mean and unit variance
train a linear regression model

and all these steps were trained on the train split and then applied to the test split for assessment.

The pipeline below defines exactly those same steps. Crucially, the resultant Pipeline object has a fit() method and a predict() method, just like the LinearRegression() object itself. Just as we might create a linear regression model and train it with .fit() and predict with .predict(), we can wrap the entire process of imputing and feature scaling and regression in a single object you can train with .fit() and predict with .predict(). And that's basically a pipeline: a model on steroids.

4.8.2.1 Define the pipeline ¶

In [139]:

pipe = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(), 
    LinearRegression()
)

In [140]:

type(pipe)

Out[140]:

sklearn.pipeline.Pipeline

In [141]:

hasattr(pipe, 'fit'), hasattr(pipe, 'predict')

Out[141]:

(True, True)

4.8.2.2 Fit the pipeline ¶

Here, a single call to the pipeline's fit() method combines the steps of learning the imputation (determining what values to use to fill the missing ones), the scaling (determining the mean to subtract and the variance to divide by), and then training the model. It does this all in the one call with the training data as arguments.

In [142]:

# Call the pipe's `fit()` method with `X_train` and `y_train` as arguments
pipe.fit(X_train, y_train)

Out[142]:

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('linearregression', LinearRegression())])

4.8.2.3 Make predictions on the train and test sets ¶

In [143]:

y_tr_pred = pipe.predict(X_train)
y_te_pred = pipe.predict(X_test)

4.8.2.4 Assess performance ¶

In [144]:

r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

Out[144]:

(0.8237204449411376, 0.7251410286259974)

And compare with our earlier (non-pipeline) result:

In [145]:

median_r2

Out[145]:

(0.8237204449411376, 0.7251410286259974)

In [146]:

mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

Out[146]:

(8.495768235382354, 9.696652536263656)

Compare with our earlier result:

In [147]:

median_mae

Out[147]:

(8.495768235382354, 9.696652536263656)

In [148]:

mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)

Out[148]:

(108.75554891895914, 157.57287631288543)

Compare with our earlier result:

In [149]:

median_mse

Out[149]:

(108.75554891895914, 157.57287631288543)

These results confirm the pipeline is doing exactly what's expected, and results are identical to our earlier steps. This allows we to move faster but with confidence.

4.9 Refining The Linear Model ¶

We suspected the model was overfitting. This is no real surprise given the number of features we blindly used. It's likely a judicious subset of features would generalize better. sklearn has a number of feature selection functions available. The one we'll use here is SelectKBest which, as we might guess, selects the k best features. We can read about SelectKBest here. f_regression is just the score function We're using because we're performing regression. It's important to choose an appropriate one for our machine learning task.

4.9.1 Define the pipeline ¶

Redefine our pipeline to include this feature selection step:

In [150]:

pipe = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(),
    SelectKBest(score_func=f_regression),
    LinearRegression()
)

4.9.2 Fit the pipeline ¶

In [151]:

pipe.fit(X_train, y_train)

Out[151]:

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('selectkbest',
                 SelectKBest(score_func=<function f_regression at 0x7fd9b00e3ca0>)),
                ('linearregression', LinearRegression())])

4.9.3 Assess performance on the train and test set ¶

In [152]:

y_tr_pred = pipe.predict(X_train)
y_te_pred = pipe.predict(X_test)

In [153]:

r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

Out[153]:

(0.760478965339582, 0.681569974499793)

In [154]:

mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

Out[154]:

(9.757130441228263, 10.585905291034962)

This has made things worse! Clearly selecting a subset of features has an impact on performance. SelectKBest defaults to k=10. Let's create a new pipeline with a different value of k:

4.9.4 Define a new pipeline to select a different number of features ¶

In [155]:

# Modify the `SelectKBest` step to use a value of 15 for k
pipe15 = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(),
    SelectKBest(score_func=f_regression, k=15),
    LinearRegression()
)

4.9.5 Fit the pipeline ¶

In [156]:

pipe15.fit(X_train, y_train)

Out[156]:

Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                ('standardscaler', StandardScaler()),
                ('selectkbest',
                 SelectKBest(k=15,
                             score_func=<function f_regression at 0x7fd9b00e3ca0>)),
                ('linearregression', LinearRegression())])

4.9.6 Assess performance on train and test data ¶

In [157]:

y_tr_pred = pipe15.predict(X_train)
y_te_pred = pipe15.predict(X_test)

In [158]:

r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)

Out[158]:

(0.7922946911681397, 0.66079117939879)

In [159]:

mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)

Out[159]:

(9.214834764542976, 10.496823817105572)

We could keep going, trying different values of k, training a model, measuring performance on the test set, and then picking the model with the best test set performance. There's a fundamental problem with this approach: we're tuning the model to the arbitrary test set! If we continue this way we'll end up with a model works well on the particular quirks of our test set but fails to generalize to new data. The whole point of keeping a test set is for it to be a set of unseen data on which to test performance.

The way around this is a technique called cross-validation. We partition the training set into k folds, train our model on k-1 of those folds, and calculate performance on the fold not used in training. This procedure then cycles through k times with a different fold held back each time. Thus we end up building k models on k sets of data with k estimates of how the model performs on unseen data but without having to touch the test set.

4.9.7 Assessing performance using cross-validation ¶

In [160]:

# Run 5-Fold Cross validation
cv_results = cross_validate(pipe15, X_train, y_train, cv=5)

In [161]:

# get scores
cv_scores = cv_results['test_score']
cv_scores

Out[161]:

array([0.60510478, 0.67731713, 0.75047442, 0.58935004, 0.50041885])

Without using the same random state for initializing the CV folds, our actual numbers will be different.

In [162]:

np.mean(cv_scores), np.std(cv_scores)

Out[162]:

(0.6245330431201284, 0.08445948393083175)

These results highlight that assessing model performance in inherently open to variability. We'll get different results depending on the quirks of which points are in which fold. An advantage of this is that you can also obtain an estimate of the variability, or uncertainty, in our performance estimate.

In [163]:

np.round((np.mean(cv_scores) - 2 * np.std(cv_scores), np.mean(cv_scores) + 2 * np.std(cv_scores)), 2)

Out[163]:

array([0.46, 0.79])

4.9.8 Hyperparameter search using GridSearchCV ¶

Pulling the above together, we have:

a pipeline that
- imputes missing values
- scales the data
- selects the k best features
- trains a linear regression model
a technique (cross-validation) for estimating model performance

Now we will use cross-validation for multiple values of k, and then use cross-validation to pick the value of k that gives the best performance. make_pipeline automatically names each step in lowercase. Parameters of each step are then accessed by appending a double underscore followed by the parameter name. We know the name of the step will be 'selectkbest', and we know the parameter is 'k'.

We can also list the names of all the parameters in a pipeline as follows:

In [164]:

# Call `pipe`'s `get_params()` method to get a dict of available parameters and print their names
# using dict's `keys()` method
pipe.get_params().keys()

Out[164]:

dict_keys(['memory', 'steps', 'verbose', 'simpleimputer', 'standardscaler', 'selectkbest', 'linearregression', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'selectkbest__k', 'selectkbest__score_func', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize', 'linearregression__positive'])

The above can be particularly useful as our pipelines becomes more complex (we can even nest pipelines within pipelines).

In [165]:

k = [k+1 for k in range(len(X_train.columns))]
grid_params = {'selectkbest__k': k}

Now we have a range of k to investigate. Is 1 feature best? 2? 3? 4? All of them? We could write a for loop and iterate over each possible value, doing all the housekeeping ourselves to track the best value of k. But this is a common task, so there's a built in function in sklearn. This is GridSearchCV.

This takes the pipeline object, in fact it takes anything with a .fit() and .predict() method. In simple cases with no feature selection or imputation or feature scaling etc. we may see the classifier or regressor object itself directly passed into GridSearchCV. The other key input is the set of parameters and values to search over. Optional parameters include the cross-validation strategy and number of CPUs to use.

In [166]:

lr_grid_cv = GridSearchCV(pipe, param_grid=grid_params, cv=5, n_jobs=-1)

In [167]:

lr_grid_cv.fit(X_train, y_train)

Out[167]:

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('simpleimputer',
                                        SimpleImputer(strategy='median')),
                                       ('standardscaler', StandardScaler()),
                                       ('selectkbest',
                                        SelectKBest(score_func=<function f_regression at 0x7fd9b00e3ca0>)),
                                       ('linearregression',
                                        LinearRegression())]),
             n_jobs=-1,
             param_grid={'selectkbest__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
                                            12, 13, 14, 15, 16, 17, 18, 19, 20,
                                            21, 22, 23, 24, 25, 26, 27, 28, 29,
                                            30, ...]})

In [168]:

score_mean = lr_grid_cv.cv_results_['mean_test_score']
score_std = lr_grid_cv.cv_results_['std_test_score']
cv_k = [k for k in lr_grid_cv.cv_results_['param_selectkbest__k']]

In [169]:

# Print the `best_params_` attribute of `lr_grid_cv`
lr_grid_cv.best_params_

Out[169]:

{'selectkbest__k': 6}

In [170]:

# Assign the value of k from the above dict of `best_params_` and assign it to `best_k`
best_k = lr_grid_cv.best_params_['selectkbest__k']
plt.subplots(figsize=(10, 5))
plt.errorbar(cv_k, score_mean, yerr=score_std)
plt.axvline(x=best_k, c='r', ls='--', alpha=.5)
plt.xlabel('k')
plt.ylabel('CV score (r-squared)')
plt.title('Pipeline mean CV score (error bars +/- 1sd)');

The above suggests a good value for k is 8. There was an initial rapid increase with k, followed by a slow decline. Also noticeable is the variance of the results greatly increase above k=8. As we increasingly overfit, expect greater swings in performance as different points move in and out of the train/test folds.

Which features were most useful? Step into our best model, shown below. Starting with the fitted grid search object, you get the best estimator, then the named step 'selectkbest', for which you can its get_support() method for a logical mask of the features selected.

In [171]:

selected = lr_grid_cv.best_estimator_.named_steps.selectkbest.get_support()

Similarly, instead of using the 'selectkbest' named step, we access the named step for the linear regression model and, from that, grab the model coefficients via its coef_ attribute:

In [172]:

# Get the linear model coefficients from the `coef_` attribute and store in `coefs`,
# get the matching feature names from the column names of the dataframe,
# and display the results as a pandas Series with `coefs` as the values and `features` as the index,
# sorting the values in descending order
coefs = lr_grid_cv.best_estimator_.named_steps.linearregression.coef_
features = X_train.columns[selected]
pd.Series(coefs, index=features).sort_values(ascending=False)

Out[172]:

vertical_drop        10.333065
Snow Making_ac        6.376653
fastQuads             4.980331
total_chairs          3.416636
Runs                  3.233735
SkiableTerrain_ac    -3.259420
dtype: float64

These results suggest that vertical drop is our biggest positive feature. This makes intuitive sense and is consistent with what you saw during the EDA work. Also, we see the area covered by snow making equipment is a strong positive as well. People like guaranteed skiing! The skiable terrain area is negatively associated with ticket price! This seems odd. People will pay less for larger resorts? There could be all manner of reasons for this. It could be an effect whereby larger resorts can host more visitors at any one time and so can charge less per ticket.

As has been mentioned previously, the data are missing information about visitor numbers. Bear in mind, the coefficient for skiable terrain is negative for this model. For example, if you kept the total number of chairs and fastQuads constant, but increased the skiable terrain extent, you might imagine the resort is worse off because the chairlift capacity is stretched thinner.

4.10 Random Forest Model ¶

A model that can work very well in a lot of cases is the random forest. For regression, this is provided by sklearn's RandomForestRegressor class.

Time to stop the bad practice of repeatedly checking performance on the test split. Instead, go straight from defining the pipeline to assessing performance using cross-validation. cross_validate will perform the fitting as part of the process. This uses the default settings for the random forest so you'll then proceed to investigate some different hyperparameters.

4.10.1 Define the pipeline ¶

In [173]:

# Define a pipeline comprising the steps:
# SimpleImputer() with a strategy of 'median'
# StandardScaler(),
# and then RandomForestRegressor() with a random state of 47
RF_pipe = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler(),
    RandomForestRegressor(random_state=47)
)

4.10.2 Fit and assess performance using cross-validation ¶

In [174]:

# Call `cross_validate` to estimate the pipeline's performance.
# Pass it the random forest pipe object, `X_train` and `y_train`,
# and get it to use 5-fold cross-validation
rf_default_cv_results = cross_validate(RF_pipe, X_train, y_train, cv=5)

In [175]:

rf_cv_scores = rf_default_cv_results['test_score']
rf_cv_scores

Out[175]:

array([0.6711019 , 0.78433505, 0.75960138, 0.59978811, 0.56699816])

In [176]:

np.mean(rf_cv_scores), np.std(rf_cv_scores)

Out[176]:

(0.6763649193918256, 0.08536820259910885)

4.10.3 Hyperparameter search using GridSearchCV ¶

Random forest has a number of hyperparameters that can be explored, however, here we'll limit ourselves to exploring some different values for the number of trees. We'll try it with and without feature scaling, and try both the mean and median as strategies for imputing missing values.

In [177]:

n_est = [int(n) for n in np.logspace(start=1, stop=3, num=20)]
grid_params = {
        'randomforestregressor__n_estimators': n_est,
        'standardscaler': [StandardScaler(), None],
        'simpleimputer__strategy': ['mean', 'median']
}
grid_params

Out[177]:

{'randomforestregressor__n_estimators': [10,
  12,
  16,
  20,
  26,
  33,
  42,
  54,
  69,
  88,
  112,
  143,
  183,
  233,
  297,
  379,
  483,
  615,
  784,
  1000],
 'standardscaler': [StandardScaler(), None],
 'simpleimputer__strategy': ['mean', 'median']}

In [178]:

# Call `GridSearchCV` with the random forest pipeline, passing in the above `grid_params`
# dict for parameters to evaluate, 5-fold cross-validation, and all available CPU cores (if desired)
rf_grid_cv = GridSearchCV(RF_pipe, param_grid=grid_params, cv=5, n_jobs=-1)

In [179]:

# Now call the `GridSearchCV`'s `fit()` method with `X_train` and `y_train` as arguments
# to actually start the grid search. This may take a minute or two.
rf_grid_cv.fit(X_train, y_train)

Out[179]:

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('simpleimputer',
                                        SimpleImputer(strategy='median')),
                                       ('standardscaler', StandardScaler()),
                                       ('randomforestregressor',
                                        RandomForestRegressor(random_state=47))]),
             n_jobs=-1,
             param_grid={'randomforestregressor__n_estimators': [10, 12, 16, 20,
                                                                 26, 33, 42, 54,
                                                                 69, 88, 112,
                                                                 143, 183, 233,
                                                                 297, 379, 483,
                                                                 615, 784,
                                                                 1000],
                         'simpleimputer__strategy': ['mean', 'median'],
                         'standardscaler': [StandardScaler(), None]})

In [180]:

# Print the best params (`best_params_` attribute) from the grid search
rf_grid_cv.best_params_

Out[180]:

{'randomforestregressor__n_estimators': 233,
 'simpleimputer__strategy': 'mean',
 'standardscaler': None}

It looks like imputing with the median helps, but scaling the features doesn't.

In [181]:

rf_best_cv_results = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train, cv=5)
rf_best_scores = rf_best_cv_results['test_score']
rf_best_scores

Out[181]:

array([0.67199826, 0.78788539, 0.7776494 , 0.61844222, 0.60114645])

In [182]:

np.mean(rf_best_scores), np.std(rf_best_scores)

Out[182]:

(0.691424343745292, 0.07822193232981417)

We've marginally improved upon the default CV results. Random forest has many more hyperparameters you could tune, but we won't dive into that here.

In [183]:

# Plot a barplot of the random forest's feature importances,
# assigning the `feature_importances_` attribute of 
# `rf_grid_cv.best_estimator_.named_steps.randomforestregressor` to the name `imps` to then
# create a pandas Series object of the feature importances, with the index given by the
# training data column names, sorting the values in descending order
plt.subplots(figsize=(10, 5))
imps = rf_grid_cv.best_estimator_.named_steps.randomforestregressor.feature_importances_
rf_feat_imps = pd.Series(imps, index=X_train.columns).sort_values(ascending=False)
rf_feat_imps.plot(kind='bar')
plt.xlabel('features')
plt.ylabel('importance')
plt.title('Best random forest regressor feature importances');

Encouragingly, the dominant top four features are in common with our linear model:

fastQuads
Runs
Snow Making_ac
vertical_drop

4.11 Final Model Selection ¶

Time to select our final model to use for further business modeling! It would be good to revisit the above model selection; there is undoubtedly more that could be done to explore possible hyperparameters. It would also be worthwhile to investigate removing the least useful features. Gathering or calculating, and storing, features adds business cost and dependencies, so if features genuinely are not needed they should be removed. Building a simpler model with fewer features can also have the advantage of being easier to sell (and/or explain) to stakeholders. Certainly there seem to be four strong features here and so a model using only those would probably work well. However, we want to explore some different scenarios where other features vary so keep the fuller model for now. The business is waiting for this model and we have something that we have confidence in to be much better than guessing with the average price.

Or, rather, we have two "somethings". We built a best linear model and a best random forest model. We need to finally choose between them. We can calculate the mean absolute error using cross-validation. Although cross-validate defaults to the $R^2$ metric for scoring regression, we can specify the mean absolute error as an alternative via the scoring parameter.

4.11.1 Linear regression model performance ¶

In [184]:

# 'neg_mean_absolute_error' uses the (negative of) the mean absolute error
lr_neg_mae = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

In [185]:

lr_mae_mean = np.mean(-1 * lr_neg_mae['test_score'])
lr_mae_std = np.std(-1 * lr_neg_mae['test_score'])
lr_mae_mean, lr_mae_std

Out[185]:

(10.891687453692906, 1.867599419260583)

In [186]:

mean_absolute_error(y_test, lr_grid_cv.best_estimator_.predict(X_test))

Out[186]:

10.314388905802957

4.11.2 Random forest regression model performance ¶

In [187]:

rf_neg_mae = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

In [188]:

rf_mae_mean = np.mean(-1 * rf_neg_mae['test_score'])
rf_mae_std = np.std(-1 * rf_neg_mae['test_score'])
rf_mae_mean, rf_mae_std

Out[188]:

(9.829508627130718, 1.0814411685916026)

In [189]:

mean_absolute_error(y_test, rf_grid_cv.best_estimator_.predict(X_test))

Out[189]:

9.539958614347025

4.11.3 Conclusion ¶

The random forest model has a lower cross-validation mean absolute error by almost ~$1. It also exhibits less variability. Verifying performance on the test set produces performance consistent with the cross-validation results.

4.12 Data quantity assessment ¶

Finally, we need to advise the business whether it needs to undertake further data collection. Would more data be useful? We're often led to believe more data is always good, but gathering data invariably has a cost associated with it. Assess this trade off by seeing how performance varies with differing data set sizes. The learning_curve function does this conveniently.

In [190]:

fractions = [.2, .25, .3, .35, .4, .45, .5, .6, .75, .8, 1.0]
train_size, train_scores, test_scores = learning_curve(pipe, X_train, y_train, train_sizes=fractions)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

In [191]:

plt.subplots(figsize=(10, 5))
plt.errorbar(train_size, test_scores_mean, yerr=test_scores_std)
plt.xlabel('Training set size')
plt.ylabel('CV scores')
plt.title('Cross-validation score as training set size increases');

This shows that you seem to have plenty of data. There's an initial rapid improvement in model scores as one would expect, but it's essentially levelled off by around a sample size of 40-50.

4.13 Save best model object from pipeline ¶

In [192]:

best_model = rf_grid_cv.best_estimator_
best_model.version = '1.0'
best_model.pandas_version = pd.__version__
best_model.numpy_version = np.__version__
best_model.sklearn_version = sklearn_version
best_model.X_columns = [col for col in X_train.columns]
best_model.build_datetime = datetime.datetime.now()

In [193]:

# save the model

modelpath = '../models'
save_file(best_model, 'ski_resort_pricing_model.pkl', modelpath)

A file already exists with this name.

Do you want to overwrite? (Y/N)Y
Writing file.  "../models/ski_resort_pricing_model.pkl"

4 Pre-Processing and Training Data¶

4.1 Contents¶

4.2 Introduction¶

4.3 Imports¶

4.4 Load Data¶

4.5 Extract Big Mountain Data¶

4.6 Train/Test Split¶

4.7 Initial Not-Even-A-Model¶

4.7.1 Metrics¶

4.7.1.1 R-squared, or coefficient of determination¶

4.7.1.2 Mean Absolute Error¶

4.7.1.3 Mean Squared Error¶

4.7.2 sklearn metrics¶

4.7.2.0.1 R-squared¶

4.7.2.0.2 Mean absolute error¶

4.7.2.0.3 Mean squared error¶

4.7.3 Note On Calculating Metrics¶

4.8 Initial Models¶

4.8.1 Imputing missing feature (predictor) values¶

4.8.1.1 Impute missing values with median¶

4.8.1.1.1 Learn the values to impute from the train set¶

4.8.1.1.2 Apply the imputation to both train and test splits¶

4.8.1.1.3 Scale the data¶

4.8.1.1.4 Train the model on the train split¶

4.8.1.1.5 Make predictions using the model on both train and test splits¶

4.8.1.1.6 Assess model performance¶

4.8.1.2 Impute missing values with the mean¶

4.8.1.2.1 Learn the values to impute from the train set¶

4.8.1.2.2 Apply the imputation to both train and test splits¶

4.8.1.2.3 Scale the data¶

4.8.1.2.4 Train the model on the train split¶

4.8.1.2.5 Make predictions using the model on both train and test splits¶

4.8.1.2.6 Assess model performance¶

4.8.2 Pipelines¶

4.8.2.1 Define the pipeline¶

4.8.2.2 Fit the pipeline¶

4.8.2.3 Make predictions on the train and test sets¶

4.8.2.4 Assess performance¶

4.9 Refining The Linear Model¶

4.9.1 Define the pipeline¶

4.9.2 Fit the pipeline¶

4.9.3 Assess performance on the train and test set¶

4.9.4 Define a new pipeline to select a different number of features¶

4.9.5 Fit the pipeline¶

4.9.6 Assess performance on train and test data¶

4.9.7 Assessing performance using cross-validation¶

4.9.8 Hyperparameter search using GridSearchCV¶

4.10 Random Forest Model¶

4.10.1 Define the pipeline¶

4.10.2 Fit and assess performance using cross-validation¶

4.10.3 Hyperparameter search using GridSearchCV¶

4.11 Final Model Selection¶

4.11.1 Linear regression model performance¶

4.11.2 Random forest regression model performance¶

4.11.3 Conclusion¶

4.12 Data quantity assessment¶

4.13 Save best model object from pipeline¶