In preceding notebooks, we performed preliminary assessments of data quality and refined the question to be answered. We found a small number of observations that clearly indicated whether to replace values or drop a whole row. We determined that predicting the adult weekend ticket price was our primary aim. We threw away records with missing price data, but not before making the most of the other available data to look for any patterns among the states. We didn't see any and decided to treat all states equally; the state label didn't seem to be particularly useful.
In this notebook, we'll start to build machine learning models. Before diving into a machine learning model, however, we'll start by considering how useful the mean value is as a predictor. We never want to go to stakeholders with a machine learning model only to have the CEO point out that it performs worse than just guessing the average! Our first model is always a baseline performance comparitor for any subsequent model. Next, we'll build up the process of efficiently creating robust models to compare to our baseline forecast. We can validate steps with our own functions for checking expected equivalences between, say, pandas and sklearn implementations.
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
from library.sb_utils import save_file
ski_data = pd.read_csv('../data/ski_data_step3_features.csv')
ski_data.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
Name | Alyeska Resort | Eaglecrest Ski Area | Hilltop Ski Area | Arizona Snowbowl | Sunrise Park Resort |
Region | Alaska | Alaska | Alaska | Arizona | Arizona |
state | Alaska | Alaska | Alaska | Arizona | Arizona |
summit_elev | 3939 | 2600 | 2090 | 11500 | 11100 |
vertical_drop | 2500 | 1540 | 294 | 2300 | 1800 |
base_elev | 250 | 1200 | 1796 | 9200 | 9200 |
trams | 1 | 0 | 0 | 0 | 0 |
fastSixes | 0 | 0 | 0 | 1 | 0 |
fastQuads | 2 | 0 | 0 | 0 | 1 |
quad | 2 | 0 | 0 | 2 | 2 |
triple | 0 | 0 | 1 | 2 | 3 |
double | 0 | 4 | 0 | 1 | 1 |
surface | 2 | 0 | 2 | 2 | 0 |
total_chairs | 7 | 4 | 3 | 8 | 7 |
Runs | 76.0 | 36.0 | 13.0 | 55.0 | 65.0 |
TerrainParks | 2.0 | 1.0 | 1.0 | 4.0 | 2.0 |
LongestRun_mi | 1.0 | 2.0 | 1.0 | 2.0 | 1.2 |
SkiableTerrain_ac | 1610.0 | 640.0 | 30.0 | 777.0 | 800.0 |
Snow Making_ac | 113.0 | 60.0 | 30.0 | 104.0 | 80.0 |
daysOpenLastYear | 150.0 | 45.0 | 150.0 | 122.0 | 115.0 |
yearsOpen | 60.0 | 44.0 | 36.0 | 81.0 | 49.0 |
averageSnowfall | 669.0 | 350.0 | 69.0 | 260.0 | 250.0 |
AdultWeekend | 85.0 | 53.0 | 34.0 | 89.0 | 78.0 |
projectedDaysOpen | 150.0 | 90.0 | 152.0 | 122.0 | 104.0 |
NightSkiing_ac | 550.0 | NaN | 30.0 | NaN | 80.0 |
resorts_per_state | 3 | 3 | 3 | 2 | 2 |
resorts_per_100kcapita | 0.410091 | 0.410091 | 0.410091 | 0.027477 | 0.027477 |
resorts_per_100ksq_mile | 0.450867 | 0.450867 | 0.450867 | 1.75454 | 1.75454 |
resort_skiable_area_ac_state_ratio | 0.70614 | 0.280702 | 0.013158 | 0.492708 | 0.507292 |
resort_days_open_state_ratio | 0.434783 | 0.130435 | 0.434783 | 0.514768 | 0.485232 |
resort_terrain_park_state_ratio | 0.5 | 0.25 | 0.25 | 0.666667 | 0.333333 |
resort_night_skiing_state_ratio | 0.948276 | NaN | 0.051724 | NaN | 1.0 |
total_chairs_runs_ratio | 0.092105 | 0.111111 | 0.230769 | 0.145455 | 0.107692 |
total_chairs_skiable_ratio | 0.004348 | 0.00625 | 0.1 | 0.010296 | 0.00875 |
fastQuads_runs_ratio | 0.026316 | 0.0 | 0.0 | 0.0 | 0.015385 |
fastQuads_skiable_ratio | 0.001242 | 0.0 | 0.0 | 0.0 | 0.00125 |
Big Mountain is our resort. Separate it from the rest of the data to use later.
big_mountain = ski_data[ski_data.Name == 'Big Mountain Resort']
big_mountain.T
124 | |
---|---|
Name | Big Mountain Resort |
Region | Montana |
state | Montana |
summit_elev | 6817 |
vertical_drop | 2353 |
base_elev | 4464 |
trams | 0 |
fastSixes | 0 |
fastQuads | 3 |
quad | 2 |
triple | 6 |
double | 0 |
surface | 3 |
total_chairs | 14 |
Runs | 105.0 |
TerrainParks | 4.0 |
LongestRun_mi | 3.3 |
SkiableTerrain_ac | 3000.0 |
Snow Making_ac | 600.0 |
daysOpenLastYear | 123.0 |
yearsOpen | 72.0 |
averageSnowfall | 333.0 |
AdultWeekend | 81.0 |
projectedDaysOpen | 123.0 |
NightSkiing_ac | 600.0 |
resorts_per_state | 12 |
resorts_per_100kcapita | 1.122778 |
resorts_per_100ksq_mile | 8.161045 |
resort_skiable_area_ac_state_ratio | 0.140121 |
resort_days_open_state_ratio | 0.129338 |
resort_terrain_park_state_ratio | 0.148148 |
resort_night_skiing_state_ratio | 0.84507 |
total_chairs_runs_ratio | 0.133333 |
total_chairs_skiable_ratio | 0.004667 |
fastQuads_runs_ratio | 0.028571 |
fastQuads_skiable_ratio | 0.001 |
ski_data.shape
(278, 36)
ski_data = ski_data[ski_data.Name != 'Big Mountain Resort']
ski_data.shape
(277, 36)
So far, we've treated ski resort data as a single entity. In machine learning, when we train our model on all of our data, we end up with no data set aside to evaluate model performance. We could keep making more and more complex models that fit the data better and better and not realise we are overfitting the model. By partitioning the data into training and testing splits, without letting a model (or missing-value imputation) learn anything about the test split, we have a somewhat independent assessment of how our model might perform in the future. An often overlooked subtlety here is that people all too frequently use the test set to assess model performance and then compare multiple models to pick the best. This means their overall model selection process is flawed: The engineer picks the model sans help from the test set. Instead we use held-out data and/or k-fold cross-validation to simulate additional test sets and assess model performance. The formal test set is very useful as a final check on expected future performance.
What partition sizes would we have with a 70/30 train/test split?
len(ski_data) * .7, len(ski_data) * .3
(193.89999999999998, 83.1)
# Generate test and train sets for X and Y variables; set random state to get reproducable results
X_train, X_test, y_train, y_test = train_test_split(ski_data.drop(columns='AdultWeekend'),
ski_data.AdultWeekend, test_size=0.3,
random_state=47)
# Check the shapes of X train and test sets
X_train.shape, X_test.shape
((193, 35), (84, 35))
# Check the shapes of train Y train and test sets
y_train.shape, y_test.shape
((193,), (84,))
names_list = ['Name', 'state', 'Region']
names_train = X_train[names_list]
names_test = X_test[names_list]
X_train.drop(columns=names_list, inplace=True)
X_test.drop(columns=names_list, inplace=True)
X_train.shape, X_test.shape
((193, 32), (84, 32))
# Check the `dtypes` attribute of `X_train` to verify all features are numeric
X_train.dtypes
summit_elev int64 vertical_drop int64 base_elev int64 trams int64 fastSixes int64 fastQuads int64 quad int64 triple int64 double int64 surface int64 total_chairs int64 Runs float64 TerrainParks float64 LongestRun_mi float64 SkiableTerrain_ac float64 Snow Making_ac float64 daysOpenLastYear float64 yearsOpen float64 averageSnowfall float64 projectedDaysOpen float64 NightSkiing_ac float64 resorts_per_state int64 resorts_per_100kcapita float64 resorts_per_100ksq_mile float64 resort_skiable_area_ac_state_ratio float64 resort_days_open_state_ratio float64 resort_terrain_park_state_ratio float64 resort_night_skiing_state_ratio float64 total_chairs_runs_ratio float64 total_chairs_skiable_ratio float64 fastQuads_runs_ratio float64 fastQuads_skiable_ratio float64 dtype: object
# Repeat this check for the test split in `X_test`
X_test.dtypes
summit_elev int64 vertical_drop int64 base_elev int64 trams int64 fastSixes int64 fastQuads int64 quad int64 triple int64 double int64 surface int64 total_chairs int64 Runs float64 TerrainParks float64 LongestRun_mi float64 SkiableTerrain_ac float64 Snow Making_ac float64 daysOpenLastYear float64 yearsOpen float64 averageSnowfall float64 projectedDaysOpen float64 NightSkiing_ac float64 resorts_per_state int64 resorts_per_100kcapita float64 resorts_per_100ksq_mile float64 resort_skiable_area_ac_state_ratio float64 resort_days_open_state_ratio float64 resort_terrain_park_state_ratio float64 resort_night_skiing_state_ratio float64 total_chairs_runs_ratio float64 total_chairs_skiable_ratio float64 fastQuads_runs_ratio float64 fastQuads_skiable_ratio float64 dtype: object
We have only numeric features in X now!
We'll begin by determining how good the mean is as a predictor. In other words, what if we simply say our best guess is the average price?
# Calculate the mean of `y_train`
train_mean = y_train.mean()
train_mean
63.84730569948186
sklearn
's DummyRegressor
easily does this:
# Sanity Check
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train, y_train)
dumb_reg.constant_
array([[63.8473057]])
Having established the grand mean, we need to determine how closely it matches, or explains, the actual values. There are many ways of assessing how good one set of values agrees with another, which brings us to the subject of metrics.
One measure is $R^2$, the coefficient of determination. This is a measure of the proportion of variance in the dependent variable (our ticket price) that is predicted by our "model". The linked Wikipedia articles gives a nice explanation of how negative values can arise. This is frequently a cause of confusion for newcomers who, reasonably, ask how can a squared value be negative?
Recall the mean can be denoted by $\bar{y}$, where
$$\bar{y} = \frac{1}{n}\sum_{i=1}^ny_i$$and where $y_i$ are the individual values of the dependent variable.
The total sum of squares (error), can be expressed as
$$SS_{tot} = \sum_i(y_i-\bar{y})^2$$The above formula should be familiar as it's simply the variance without the denominator to scale (divide) by the sample size.
The residual sum of squares is similarly defined to be
$$SS_{res} = \sum_i(y_i-\hat{y})^2$$where $\hat{y}$ are our predicted values for the depended variable.
The coefficient of determination, $R^2$, here is given by
$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$Putting it into words, it's one minus the ratio of the residual variance to the original variance. Thus, the baseline model here, which always predicts $\bar{y}$, should give $R^2=0$. A model that perfectly predicts the observed values would have no residual error and so give $R^2=1$. Models that do worse than predicting the mean will have increased the sum of squares of residuals and so produce a negative $R^2$.
#Calculate the R^2 as defined above
def r_squared(y, ypred):
"""R-squared score.
Calculate the R-squared, or coefficient of determination, of the input.
Arguments:
y -- the observed values
ypred -- the predicted values
"""
ybar = np.mean(y)
sum_sq_tot = np.sum((y - ybar)**2)
sum_sq_res = np.sum((y - ypred)**2)
R2 = 1.0 - sum_sq_res / sum_sq_tot
return R2
We make our predictions by creating an array of length the size of the training set with the single value of the mean.
y_tr_pred_ = train_mean * np.ones(len(y_train))
y_tr_pred_[:5]
array([63.8473057, 63.8473057, 63.8473057, 63.8473057, 63.8473057])
Remember the sklearn
dummy regressor?
y_tr_pred = dumb_reg.predict(X_train)
y_tr_pred[:5]
array([63.8473057, 63.8473057, 63.8473057, 63.8473057, 63.8473057])
We can see that DummyRegressor
produces exactly the same results and saves us from having to broadcast the mean (or whichever other statistic we used - check out the documentation to see what's available) to an array of the appropriate length. It also gives us an object with fit()
and predict()
methods as well, so we can use them as conveniently as any other sklearn
estimator.
r_squared(y_train, y_tr_pred)
0.0
Exactly as expected, if we use the average value as our prediction, we get an $R^2$ of zero on our training set. What if we use this "model" to predict unseen values from the test set? Remember, of course, that our "model" is trained on the training set; we still use the training set mean as our prediction.
Make predictions by creating an array of length the size of the test set with the single value of the (training) mean.
y_te_pred = train_mean * np.ones(len(y_test))
r_squared(y_test, y_te_pred)
-0.0015364646867073173
Generally, you can expect performance on a test set to be slightly worse than on the training set. As you are getting an $R^2$ of zero on the training set, there's nowhere to go but negative!
$R^2$ is a common metric, and interpretable in terms of the amount of variance explained, it's less appealing if we want an idea of how "close" our predictions are to the true values. Metrics that summarise the difference between predicted and actual values are mean absolute error and mean squared error.
This is very simply the average of the absolute errors:
$$MAE = \frac{1}{n}\sum_i^n|y_i - \hat{y}|$$def mae(y, ypred):
"""Mean absolute error.
Calculate the mean absolute error of the arguments
Arguments:
y -- the observed values
ypred -- the predicted values
"""
abs_error = np.abs(y - ypred)
mae = np.mean(abs_error)
return mae
mae(y_train, y_tr_pred)
18.149503610835193
mae(y_test, y_te_pred)
18.672179249938317
Mean absolute error is arguably the most intuitive of all the metrics, this essentially tells you that, on average, you might expect to be off by around \$19 if you guessed ticket price based on an average of known values.
Another common metric (and an important one internally for optimizing machine learning models) is the mean squared error. This is simply the average of the square of the errors:
$$MSE = \frac{1}{n}\sum_i^n(y_i - \hat{y})^2$$# Calculate the MSE as defined above
def mse(y, ypred):
"""Mean square error.
Calculate the mean square error of the arguments
Arguments:
y -- the observed values
ypred -- the predicted values
"""
sq_error = (y - ypred)**2
mse = np.mean(sq_error)
return mse
mse(y_train, y_tr_pred)
616.9493046578431
mse(y_test, y_te_pred)
574.1671108060107
So here, we get a slightly better MSE on the test set than we did on the train set. And what does a squared error mean anyway? To convert this back to our measurement space, we often take the square root, to form the root mean square error thus:
np.sqrt([mse(y_train, y_tr_pred), mse(y_test, y_te_pred)])
array([24.83846422, 23.96178438])
Functions are good, but you don't want to have to define functions every time we want to assess performance. sklearn.metrics
provides many commonly used metrics, included the ones above.
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
(0.0, -0.0015364646867073173)
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
(18.149503610835193, 18.672179249938317)
mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
(616.9493046578431, 574.1671108060107)
In a Jupyter code cell, running r2_score?
will bring up the docstring for the function, and r2_score??
will bring up the actual code of the function! Here we try it and compare the source for sklearn
's function with ours.
# train set - sklearn
# correct order, incorrect order
r2_score(y_train, y_tr_pred), r2_score(y_tr_pred, y_train)
(0.0, -3.054984985780873e+30)
# test set - sklearn
# correct order, incorrect order
r2_score(y_test, y_te_pred), r2_score(y_te_pred, y_test)
(-0.0015364646867073173, -2.8431378228302645e+30)
# train set - using our homebrew function
# correct order, incorrect order
r_squared(y_train, y_tr_pred), r_squared(y_tr_pred, y_train)
(0.0, -3.054984985780873e+30)
# test set - using our homebrew function
# correct order, incorrect order
r_squared(y_test, y_te_pred), r_squared(y_te_pred, y_test)
(-0.0015364646867073173, -2.8431378228302645e+30)
We can get very different results swapping the argument order. It's worth highlighting this because data scientists do this too much in the real world! Frequently the argument order doesn't matter, but it will bite when we do it with a function that does care. It's sloppy, bad practice and if we don't make a habit of putting arguments in the right order, we stand to forget!
Remember:
func?
in a code cellRecall when performing EDA, we imputed (filled in) some missing values in Pandas. We can impute missing values using scikit-learn, but we will prioritize imputation from a train split and apply that to the test split to then assess how well our imputation worked.
We have missing values. Recall from our data exploration that many distributions were skewed. Our first thought might be to impute missing values using the median.
# These are the values we'll use to fill in any missing values
X_defaults_median = X_train.median()
X_defaults_median
summit_elev 2175.000000 vertical_drop 750.000000 base_elev 1280.000000 trams 0.000000 fastSixes 0.000000 fastQuads 0.000000 quad 0.000000 triple 1.000000 double 1.000000 surface 2.000000 total_chairs 6.000000 Runs 29.000000 TerrainParks 2.000000 LongestRun_mi 1.000000 SkiableTerrain_ac 170.000000 Snow Making_ac 96.500000 daysOpenLastYear 107.000000 yearsOpen 57.000000 averageSnowfall 120.000000 projectedDaysOpen 112.000000 NightSkiing_ac 70.000000 resorts_per_state 15.000000 resorts_per_100kcapita 0.248243 resorts_per_100ksq_mile 24.428973 resort_skiable_area_ac_state_ratio 0.050000 resort_days_open_state_ratio 0.070595 resort_terrain_park_state_ratio 0.069444 resort_night_skiing_state_ratio 0.066804 total_chairs_runs_ratio 0.200000 total_chairs_skiable_ratio 0.040323 fastQuads_runs_ratio 0.000000 fastQuads_skiable_ratio 0.000000 dtype: float64
X_tr = X_train.fillna(X_defaults_median)
X_te = X_test.fillna(X_defaults_median)
As we have features measured in many different units, with numbers that vary by orders of magnitude, start off by scaling them to put them all on a consistent scale. The StandardScaler scales each feature to zero mean and unit variance.
#Call the StandardScaler`s fit method on `X_tr` to fit the scaler
#then use it's `transform()` method to apply the scaling to both the train and test split
#data (`X_tr` and `X_te`), naming the results `X_tr_scaled` and `X_te_scaled`, respectively
scaler = StandardScaler()
scaler.fit(X_tr)
X_tr_scaled = scaler.transform(X_tr)
X_te_scaled = scaler.transform(X_te)
lm = LinearRegression().fit(X_tr_scaled, y_train)
#Call the `predict()` method of the model (`lm`) on both the (scaled) train and test data
#Assign the predictions to `y_tr_pred` and `y_te_pred`, respectively
y_tr_pred = lm.predict(X_tr_scaled)
y_te_pred = lm.predict(X_te_scaled)
# r^2 - train, test
median_r2 = r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
median_r2
(0.8237204449411376, 0.7251410286259974)
Recall that we estimated ticket prices by simply using a known average. As expected, this produced an $R^2$ of zero for both the training and test set, because $R^2$ tells us how much of the variance we've explaining beyond that of using just the mean. Here, we see that our simple linear regression model explains over 80% of the variance on the train set and over 70% on the test set. Clearly, we are onto something, although the much lower value for the test set is indicative of overfitting. This isn't a surprise as we've made no effort to select a parsimonious set of features or deal with multicollinearity in our data.
# Now calculate the mean absolute error scores using `sklearn`'s `mean_absolute_error` function as we did above for R^2
# MAE - train, test
median_mae = mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
median_mae
(8.495768235382354, 9.696652536263656)
Using this model, then, on average we'd expect to estimate a ticket price within \$9 or so of the real price. This is much, much better than the \$19 from just guessing using the average. There may be something to this machine learning lark after all!
# And also do the same using `sklearn`'s `mean_squared_error`
# MSE - train, test
median_mse = mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
median_mse
(108.75554891895914, 157.57287631288543)
We chose to use the median for filling missing values because of the skew of many of our predictor feature distributions, let's try the mean.
# As we did for the median above, calculate mean values for imputing missing values
# These are the values we'll use to fill in any missing values
X_defaults_mean = X_train.mean()
X_defaults_mean
summit_elev 4042.036269 vertical_drop 1057.264249 base_elev 2975.487047 trams 0.103627 fastSixes 0.093264 fastQuads 0.673575 quad 0.948187 triple 1.414508 double 1.746114 surface 2.476684 total_chairs 7.455959 Runs 41.387435 TerrainParks 2.447205 LongestRun_mi 1.301579 SkiableTerrain_ac 458.691099 Snow Making_ac 128.935294 daysOpenLastYear 109.761290 yearsOpen 56.895833 averageSnowfall 160.112903 projectedDaysOpen 114.900621 NightSkiing_ac 84.843478 resorts_per_state 16.523316 resorts_per_100kcapita 0.442984 resorts_per_100ksq_mile 42.862331 resort_skiable_area_ac_state_ratio 0.096680 resort_days_open_state_ratio 0.121639 resort_terrain_park_state_ratio 0.113116 resort_night_skiing_state_ratio 0.150272 total_chairs_runs_ratio 0.266321 total_chairs_skiable_ratio 0.070053 fastQuads_runs_ratio 0.010619 fastQuads_skiable_ratio 0.001700 dtype: float64
By eye, we can immediately tell that our replacement values are much higher than those from using the median.
X_tr = X_train.fillna(X_defaults_mean)
X_te = X_test.fillna(X_defaults_mean)
scaler = StandardScaler()
scaler.fit(X_tr)
X_tr_scaled = scaler.transform(X_tr)
X_te_scaled = scaler.transform(X_te)
lm = LinearRegression().fit(X_tr_scaled, y_train)
y_tr_pred = lm.predict(X_tr_scaled)
y_te_pred = lm.predict(X_te_scaled)
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
(0.8221207605475709, 0.7290195691422242)
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
(8.510780313012354, 9.565093916371973)
mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
(109.74247309324214, 155.34936226136)
These results don't seem very different to when the one's we used with the median for imputing missing values. Perhaps it doesn't make much difference here. Maybe our overtraining is worse than we thought. Maybe other feature transformations, such as taking the log, would help. We could try with just a subset of features rather than using all of them as inputs.
To perform the median/mean comparison, we copied and pasted a lot of code just to change the function for imputing missing values. It would make more sense to write a function that performed the sequence of steps:
These are common steps, and sklearn
provides something much better than writing custom functions.
One of the most important and useful components of sklearn
is the pipeline. In place of Pandas's fillna
DataFrame method, there is sklearn
's SimpleImputer
. Remember the first linear model above performed the steps:
and all these steps were trained on the train split
and then applied to the test split
for assessment.
The pipeline below defines exactly those same steps. Crucially, the resultant Pipeline
object has a fit()
method and a predict()
method, just like the LinearRegression()
object itself. Just as we might create a linear regression model and train it with .fit()
and predict with .predict()
, we can wrap the entire process of imputing and feature scaling and regression in a single object you can train with .fit()
and predict with .predict()
. And that's basically a pipeline: a model on steroids.
pipe = make_pipeline(
SimpleImputer(strategy='median'),
StandardScaler(),
LinearRegression()
)
type(pipe)
sklearn.pipeline.Pipeline
hasattr(pipe, 'fit'), hasattr(pipe, 'predict')
(True, True)
Here, a single call to the pipeline's fit()
method combines the steps of learning the imputation (determining what values to use to fill the missing ones), the scaling (determining the mean to subtract and the variance to divide by), and then training the model. It does this all in the one call with the training data as arguments.
# Call the pipe's `fit()` method with `X_train` and `y_train` as arguments
pipe.fit(X_train, y_train)
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler()), ('linearregression', LinearRegression())])
y_tr_pred = pipe.predict(X_train)
y_te_pred = pipe.predict(X_test)
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
(0.8237204449411376, 0.7251410286259974)
And compare with our earlier (non-pipeline) result:
median_r2
(0.8237204449411376, 0.7251410286259974)
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
(8.495768235382354, 9.696652536263656)
Compare with our earlier result:
median_mae
(8.495768235382354, 9.696652536263656)
mean_squared_error(y_train, y_tr_pred), mean_squared_error(y_test, y_te_pred)
(108.75554891895914, 157.57287631288543)
Compare with our earlier result:
median_mse
(108.75554891895914, 157.57287631288543)
These results confirm the pipeline is doing exactly what's expected, and results are identical to our earlier steps. This allows we to move faster but with confidence.
We suspected the model was overfitting. This is no real surprise given the number of features we blindly used. It's likely a judicious subset of features would generalize better. sklearn
has a number of feature selection functions available. The one we'll use here is SelectKBest
which, as we might guess, selects the k best features. We can read about SelectKBest
here. f_regression
is just the score function We're using because we're performing regression. It's important to choose an appropriate one for our machine learning task.
Redefine our pipeline to include this feature selection step:
pipe = make_pipeline(
SimpleImputer(strategy='median'),
StandardScaler(),
SelectKBest(score_func=f_regression),
LinearRegression()
)
pipe.fit(X_train, y_train)
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler()), ('selectkbest', SelectKBest(score_func=<function f_regression at 0x7fd9b00e3ca0>)), ('linearregression', LinearRegression())])
y_tr_pred = pipe.predict(X_train)
y_te_pred = pipe.predict(X_test)
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
(0.760478965339582, 0.681569974499793)
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
(9.757130441228263, 10.585905291034962)
This has made things worse! Clearly selecting a subset of features has an impact on performance. SelectKBest
defaults to k=10. Let's create a new pipeline with a different value of k:
# Modify the `SelectKBest` step to use a value of 15 for k
pipe15 = make_pipeline(
SimpleImputer(strategy='median'),
StandardScaler(),
SelectKBest(score_func=f_regression, k=15),
LinearRegression()
)
pipe15.fit(X_train, y_train)
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler()), ('selectkbest', SelectKBest(k=15, score_func=<function f_regression at 0x7fd9b00e3ca0>)), ('linearregression', LinearRegression())])
y_tr_pred = pipe15.predict(X_train)
y_te_pred = pipe15.predict(X_test)
r2_score(y_train, y_tr_pred), r2_score(y_test, y_te_pred)
(0.7922946911681397, 0.66079117939879)
mean_absolute_error(y_train, y_tr_pred), mean_absolute_error(y_test, y_te_pred)
(9.214834764542976, 10.496823817105572)
We could keep going, trying different values of k, training a model, measuring performance on the test set, and then picking the model with the best test set performance. There's a fundamental problem with this approach: we're tuning the model to the arbitrary test set! If we continue this way we'll end up with a model works well on the particular quirks of our test set but fails to generalize to new data. The whole point of keeping a test set is for it to be a set of unseen data on which to test performance.
The way around this is a technique called cross-validation. We partition the training set into k folds, train our model on k-1 of those folds, and calculate performance on the fold not used in training. This procedure then cycles through k times with a different fold held back each time. Thus we end up building k models on k sets of data with k estimates of how the model performs on unseen data but without having to touch the test set.
# Run 5-Fold Cross validation
cv_results = cross_validate(pipe15, X_train, y_train, cv=5)
# get scores
cv_scores = cv_results['test_score']
cv_scores
array([0.60510478, 0.67731713, 0.75047442, 0.58935004, 0.50041885])
Without using the same random state for initializing the CV folds, our actual numbers will be different.
np.mean(cv_scores), np.std(cv_scores)
(0.6245330431201284, 0.08445948393083175)
These results highlight that assessing model performance in inherently open to variability. We'll get different results depending on the quirks of which points are in which fold. An advantage of this is that you can also obtain an estimate of the variability, or uncertainty, in our performance estimate.
np.round((np.mean(cv_scores) - 2 * np.std(cv_scores), np.mean(cv_scores) + 2 * np.std(cv_scores)), 2)
array([0.46, 0.79])
Pulling the above together, we have:
Now we will use cross-validation for multiple values of k, and then use cross-validation to pick the value of k that gives the best performance. make_pipeline
automatically names each step in lowercase. Parameters of each step are then accessed by appending a double underscore followed by the parameter name. We know the name of the step will be 'selectkbest', and we know the parameter is 'k'.
We can also list the names of all the parameters in a pipeline as follows:
# Call `pipe`'s `get_params()` method to get a dict of available parameters and print their names
# using dict's `keys()` method
pipe.get_params().keys()
dict_keys(['memory', 'steps', 'verbose', 'simpleimputer', 'standardscaler', 'selectkbest', 'linearregression', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'selectkbest__k', 'selectkbest__score_func', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize', 'linearregression__positive'])
The above can be particularly useful as our pipelines becomes more complex (we can even nest pipelines within pipelines).
k = [k+1 for k in range(len(X_train.columns))]
grid_params = {'selectkbest__k': k}
Now we have a range of k
to investigate. Is 1 feature best? 2? 3? 4? All of them? We could write a for loop and iterate over each possible value, doing all the housekeeping ourselves to track the best value of k. But this is a common task, so there's a built in function in sklearn
. This is GridSearchCV
.
This takes the pipeline object, in fact it takes anything with a .fit()
and .predict()
method. In simple cases with no feature selection or imputation or feature scaling etc. we may see the classifier or regressor object itself directly passed into GridSearchCV
. The other key input is the set of parameters and values to search over. Optional parameters include the cross-validation strategy and number of CPUs to use.
lr_grid_cv = GridSearchCV(pipe, param_grid=grid_params, cv=5, n_jobs=-1)
lr_grid_cv.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler()), ('selectkbest', SelectKBest(score_func=<function f_regression at 0x7fd9b00e3ca0>)), ('linearregression', LinearRegression())]), n_jobs=-1, param_grid={'selectkbest__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, ...]})
score_mean = lr_grid_cv.cv_results_['mean_test_score']
score_std = lr_grid_cv.cv_results_['std_test_score']
cv_k = [k for k in lr_grid_cv.cv_results_['param_selectkbest__k']]
# Print the `best_params_` attribute of `lr_grid_cv`
lr_grid_cv.best_params_
{'selectkbest__k': 6}
# Assign the value of k from the above dict of `best_params_` and assign it to `best_k`
best_k = lr_grid_cv.best_params_['selectkbest__k']
plt.subplots(figsize=(10, 5))
plt.errorbar(cv_k, score_mean, yerr=score_std)
plt.axvline(x=best_k, c='r', ls='--', alpha=.5)
plt.xlabel('k')
plt.ylabel('CV score (r-squared)')
plt.title('Pipeline mean CV score (error bars +/- 1sd)');
The above suggests a good value for k is 8. There was an initial rapid increase with k, followed by a slow decline. Also noticeable is the variance of the results greatly increase above k=8. As we increasingly overfit, expect greater swings in performance as different points move in and out of the train/test folds.
Which features were most useful? Step into our best model, shown below. Starting with the fitted grid search object, you get the best estimator, then the named step 'selectkbest', for which you can its get_support()
method for a logical mask of the features selected.
selected = lr_grid_cv.best_estimator_.named_steps.selectkbest.get_support()
Similarly, instead of using the 'selectkbest' named step, we access the named step for the linear regression model and, from that, grab the model coefficients via its coef_
attribute:
# Get the linear model coefficients from the `coef_` attribute and store in `coefs`,
# get the matching feature names from the column names of the dataframe,
# and display the results as a pandas Series with `coefs` as the values and `features` as the index,
# sorting the values in descending order
coefs = lr_grid_cv.best_estimator_.named_steps.linearregression.coef_
features = X_train.columns[selected]
pd.Series(coefs, index=features).sort_values(ascending=False)
vertical_drop 10.333065 Snow Making_ac 6.376653 fastQuads 4.980331 total_chairs 3.416636 Runs 3.233735 SkiableTerrain_ac -3.259420 dtype: float64
These results suggest that vertical drop is our biggest positive feature. This makes intuitive sense and is consistent with what you saw during the EDA work. Also, we see the area covered by snow making equipment is a strong positive as well. People like guaranteed skiing! The skiable terrain area is negatively associated with ticket price! This seems odd. People will pay less for larger resorts? There could be all manner of reasons for this. It could be an effect whereby larger resorts can host more visitors at any one time and so can charge less per ticket.
As has been mentioned previously, the data are missing information about visitor numbers. Bear in mind, the coefficient for skiable terrain is negative for this model. For example, if you kept the total number of chairs and fastQuads constant, but increased the skiable terrain extent, you might imagine the resort is worse off because the chairlift capacity is stretched thinner.
A model that can work very well in a lot of cases is the random forest. For regression, this is provided by sklearn
's RandomForestRegressor
class.
Time to stop the bad practice of repeatedly checking performance on the test split. Instead, go straight from defining the pipeline to assessing performance using cross-validation. cross_validate
will perform the fitting as part of the process. This uses the default settings for the random forest so you'll then proceed to investigate some different hyperparameters.
# Define a pipeline comprising the steps:
# SimpleImputer() with a strategy of 'median'
# StandardScaler(),
# and then RandomForestRegressor() with a random state of 47
RF_pipe = make_pipeline(
SimpleImputer(strategy='median'),
StandardScaler(),
RandomForestRegressor(random_state=47)
)
# Call `cross_validate` to estimate the pipeline's performance.
# Pass it the random forest pipe object, `X_train` and `y_train`,
# and get it to use 5-fold cross-validation
rf_default_cv_results = cross_validate(RF_pipe, X_train, y_train, cv=5)
rf_cv_scores = rf_default_cv_results['test_score']
rf_cv_scores
array([0.6711019 , 0.78433505, 0.75960138, 0.59978811, 0.56699816])
np.mean(rf_cv_scores), np.std(rf_cv_scores)
(0.6763649193918256, 0.08536820259910885)
Random forest has a number of hyperparameters that can be explored, however, here we'll limit ourselves to exploring some different values for the number of trees. We'll try it with and without feature scaling, and try both the mean and median as strategies for imputing missing values.
n_est = [int(n) for n in np.logspace(start=1, stop=3, num=20)]
grid_params = {
'randomforestregressor__n_estimators': n_est,
'standardscaler': [StandardScaler(), None],
'simpleimputer__strategy': ['mean', 'median']
}
grid_params
{'randomforestregressor__n_estimators': [10, 12, 16, 20, 26, 33, 42, 54, 69, 88, 112, 143, 183, 233, 297, 379, 483, 615, 784, 1000], 'standardscaler': [StandardScaler(), None], 'simpleimputer__strategy': ['mean', 'median']}
# Call `GridSearchCV` with the random forest pipeline, passing in the above `grid_params`
# dict for parameters to evaluate, 5-fold cross-validation, and all available CPU cores (if desired)
rf_grid_cv = GridSearchCV(RF_pipe, param_grid=grid_params, cv=5, n_jobs=-1)
# Now call the `GridSearchCV`'s `fit()` method with `X_train` and `y_train` as arguments
# to actually start the grid search. This may take a minute or two.
rf_grid_cv.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler()), ('randomforestregressor', RandomForestRegressor(random_state=47))]), n_jobs=-1, param_grid={'randomforestregressor__n_estimators': [10, 12, 16, 20, 26, 33, 42, 54, 69, 88, 112, 143, 183, 233, 297, 379, 483, 615, 784, 1000], 'simpleimputer__strategy': ['mean', 'median'], 'standardscaler': [StandardScaler(), None]})
# Print the best params (`best_params_` attribute) from the grid search
rf_grid_cv.best_params_
{'randomforestregressor__n_estimators': 233, 'simpleimputer__strategy': 'mean', 'standardscaler': None}
It looks like imputing with the median helps, but scaling the features doesn't.
rf_best_cv_results = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train, cv=5)
rf_best_scores = rf_best_cv_results['test_score']
rf_best_scores
array([0.67199826, 0.78788539, 0.7776494 , 0.61844222, 0.60114645])
np.mean(rf_best_scores), np.std(rf_best_scores)
(0.691424343745292, 0.07822193232981417)
We've marginally improved upon the default CV results. Random forest has many more hyperparameters you could tune, but we won't dive into that here.
# Plot a barplot of the random forest's feature importances,
# assigning the `feature_importances_` attribute of
# `rf_grid_cv.best_estimator_.named_steps.randomforestregressor` to the name `imps` to then
# create a pandas Series object of the feature importances, with the index given by the
# training data column names, sorting the values in descending order
plt.subplots(figsize=(10, 5))
imps = rf_grid_cv.best_estimator_.named_steps.randomforestregressor.feature_importances_
rf_feat_imps = pd.Series(imps, index=X_train.columns).sort_values(ascending=False)
rf_feat_imps.plot(kind='bar')
plt.xlabel('features')
plt.ylabel('importance')
plt.title('Best random forest regressor feature importances');
Encouragingly, the dominant top four features are in common with our linear model:
Time to select our final model to use for further business modeling! It would be good to revisit the above model selection; there is undoubtedly more that could be done to explore possible hyperparameters. It would also be worthwhile to investigate removing the least useful features. Gathering or calculating, and storing, features adds business cost and dependencies, so if features genuinely are not needed they should be removed. Building a simpler model with fewer features can also have the advantage of being easier to sell (and/or explain) to stakeholders. Certainly there seem to be four strong features here and so a model using only those would probably work well. However, we want to explore some different scenarios where other features vary so keep the fuller model for now. The business is waiting for this model and we have something that we have confidence in to be much better than guessing with the average price.
Or, rather, we have two "somethings". We built a best linear model and a best random forest model. We need to finally choose between them. We can calculate the mean absolute error using cross-validation. Although cross-validate
defaults to the $R^2$ metric for scoring regression, we can specify the mean absolute error as an alternative via
the scoring
parameter.
# 'neg_mean_absolute_error' uses the (negative of) the mean absolute error
lr_neg_mae = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train,
scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
lr_mae_mean = np.mean(-1 * lr_neg_mae['test_score'])
lr_mae_std = np.std(-1 * lr_neg_mae['test_score'])
lr_mae_mean, lr_mae_std
(10.891687453692906, 1.867599419260583)
mean_absolute_error(y_test, lr_grid_cv.best_estimator_.predict(X_test))
10.314388905802957
rf_neg_mae = cross_validate(rf_grid_cv.best_estimator_, X_train, y_train,
scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
rf_mae_mean = np.mean(-1 * rf_neg_mae['test_score'])
rf_mae_std = np.std(-1 * rf_neg_mae['test_score'])
rf_mae_mean, rf_mae_std
(9.829508627130718, 1.0814411685916026)
mean_absolute_error(y_test, rf_grid_cv.best_estimator_.predict(X_test))
9.539958614347025
The random forest model has a lower cross-validation mean absolute error by almost ~$1. It also exhibits less variability. Verifying performance on the test set produces performance consistent with the cross-validation results.
Finally, we need to advise the business whether it needs to undertake further data collection. Would more data be useful? We're often led to believe more data is always good, but gathering data invariably has a cost associated with it. Assess this trade off by seeing how performance varies with differing data set sizes. The learning_curve
function does this conveniently.
fractions = [.2, .25, .3, .35, .4, .45, .5, .6, .75, .8, 1.0]
train_size, train_scores, test_scores = learning_curve(pipe, X_train, y_train, train_sizes=fractions)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.subplots(figsize=(10, 5))
plt.errorbar(train_size, test_scores_mean, yerr=test_scores_std)
plt.xlabel('Training set size')
plt.ylabel('CV scores')
plt.title('Cross-validation score as training set size increases');
This shows that you seem to have plenty of data. There's an initial rapid improvement in model scores as one would expect, but it's essentially levelled off by around a sample size of 40-50.
best_model = rf_grid_cv.best_estimator_
best_model.version = '1.0'
best_model.pandas_version = pd.__version__
best_model.numpy_version = np.__version__
best_model.sklearn_version = sklearn_version
best_model.X_columns = [col for col in X_train.columns]
best_model.build_datetime = datetime.datetime.now()
# save the model
modelpath = '../models'
save_file(best_model, 'ski_resort_pricing_model.pkl', modelpath)
A file already exists with this name. Do you want to overwrite? (Y/N)Y Writing file. "../models/ski_resort_pricing_model.pkl"