In this notebook, we now take our model for ski resort ticket price and leverage it to gain some insights into what price Big Mountain's facilities might actually support, as well as explore the sensitivity of changes to various resort parameters. Note, this relies on the implicit assumption that all other resorts are largely setting prices based on how much people value certain facilities. This means comparable prices are set correctly.
We can now use our model to gain insight into what Big Mountain's ideal ticket price could/should be, and how that might change under various scenarios.
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.model_selection import cross_validate
# This isn't exactly production-grade, but a quick check for development
# These checks can save some head-scratching in development when moving from
# one python environment to another, for example
expected_model_version = '1.0'
model_path = '../models/ski_resort_pricing_model.pkl'
if os.path.exists(model_path):
with open(model_path, 'rb') as f:
model = pickle.load(f)
if model.version != expected_model_version:
print("Expected model version doesn't match version loaded")
if model.sklearn_version != sklearn_version:
print("Warning: model created under different sklearn version")
else:
print("Expected model not found")
# Load our model features
ski_data = pd.read_csv('../data/ski_data_step3_features.csv')
# Locate and view Big Mountain
big_mountain = ski_data[ski_data.Name == 'Big Mountain Resort']
big_mountain.T
124 | |
---|---|
Name | Big Mountain Resort |
Region | Montana |
state | Montana |
summit_elev | 6817 |
vertical_drop | 2353 |
base_elev | 4464 |
trams | 0 |
fastSixes | 0 |
fastQuads | 3 |
quad | 2 |
triple | 6 |
double | 0 |
surface | 3 |
total_chairs | 14 |
Runs | 105.0 |
TerrainParks | 4.0 |
LongestRun_mi | 3.3 |
SkiableTerrain_ac | 3000.0 |
Snow Making_ac | 600.0 |
daysOpenLastYear | 123.0 |
yearsOpen | 72.0 |
averageSnowfall | 333.0 |
AdultWeekend | 81.0 |
projectedDaysOpen | 123.0 |
NightSkiing_ac | 600.0 |
resorts_per_state | 12 |
resorts_per_100kcapita | 1.122778 |
resorts_per_100ksq_mile | 8.161045 |
resort_skiable_area_ac_state_ratio | 0.140121 |
resort_days_open_state_ratio | 0.129338 |
resort_terrain_park_state_ratio | 0.148148 |
resort_night_skiing_state_ratio | 0.84507 |
total_chairs_runs_ratio | 0.133333 |
total_chairs_skiable_ratio | 0.004667 |
fastQuads_runs_ratio | 0.028571 |
fastQuads_skiable_ratio | 0.001 |
This next step requires some careful thought. We want to refit the model using all available data. But should we include Big Mountain data? The motivation for this entire project is based on the sense that Big Mountain needs to adjust its pricing. One way to phrase this problem: We want to train a model to predict Big Mountain's ticket price based on data from all other resorts! We don't want Big Mountain's current price to bias this. We want to calculate a price based only on its competitors.
# Remove Big Mountain from model design matrix and target
X = ski_data.loc[ski_data.Name != "Big Mountain Resort", model.X_columns]
y = ski_data.loc[ski_data.Name != "Big Mountain Resort", 'AdultWeekend']
# Check vector lengths
len(X), len(y)
(277, 277)
# Apply our model to our design matrix of features and response vector
model.fit(X, y)
Pipeline(steps=[('simpleimputer', SimpleImputer()), ('standardscaler', None), ('randomforestregressor', RandomForestRegressor(n_estimators=233, random_state=47))])
# Cross-validate the model
cv_results = cross_validate(model, X, y, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
# Grab our 5 fold cv scores
cv_results['test_score']
array([-12.29858906, -9.13243332, -11.56087007, -8.30959188, -10.61311198])
# View mean absolute error stats
mae_mean, mae_std = np.mean(-1 * cv_results['test_score']), np.std(-1 * cv_results['test_score'])
mae_mean, mae_std
(10.382919263140293, 1.4814013000415358)
These numbers will inevitably be different to those in the previous step that used a different training data set. They should, however, be consistent. It's important to appreciate that estimates of model performance are subject to the noise and uncertainty of data!
# Run Big Mountain data
X_bm = ski_data.loc[ski_data.Name == "Big Mountain Resort", model.X_columns]
y_bm = ski_data.loc[ski_data.Name == "Big Mountain Resort", 'AdultWeekend']
# Obtain predictions
bm_pred = model.predict(X_bm).item()
# Grab actual price; this and the above combine to form our loss function
y_bm = y_bm.values.item()
print(f'Big Mountain Resort modelled price is ${bm_pred:.2f}, actual price is ${y_bm:.2f}.')
print(f'Even with the expected mean absolute error of ${mae_mean:.2f}, this suggests there is room for an increase.')
Big Mountain Resort modelled price is $93.82, actual price is $81.00. Even with the expected mean absolute error of $10.38, this suggests there is room for an increase.
This result should be looked at optimistically and doubtfully! The validity of our model lies in the assumption that other resorts accurately set their prices according to what the market (the ticket-buying public) supports. The fact that our resort seems to be charging that much less than what's predicted suggests our resort might be undercharging. But if Big Mountain is missing the market, are others? It's reasonable to expect that some resorts will be "overpriced" and some "underpriced." Or if resorts are pretty good at pricing strategies, it could be that our model is simply lacking some key data? Certainly we know nothing about operating costs, for example, and they would surely help.
Features that came up as important in the modeling (not just our final, random forest model) included:
A handy glossary of skiing terms can be found on the ski.com site. Some potentially relevant contextual information is that vertical drop, although nominally the height difference from the summit to the base, is generally taken from the highest lift-served point.
It's often useful to define custom functions for visualizing data in meaningful ways. The function below takes a feature name as an input and plots a histogram of the values of that feature. It then marks where Big Mountain sits in the distribution by marking Big Mountain's value with a vertical line using matplotlib
's axvline function. It also performs a little cleaning up of missing values and adds descriptive labels and a title.
# Add code to the `plot_compare` function that displays a vertical, dashed line
# on the histogram to indicate Big Mountain's position in the distribution
# Hint: plt.axvline() plots a vertical line, its position for 'feature1'
# would be `big_mountain['feature1'].values, we'd like a red line, which can be
# specified with c='r', a dashed linestyle is produced by ls='--',
# and it's nice to give it a slightly reduced alpha value, such as 0.8.
# Don't forget to give it a useful label (e.g. 'Big Mountain') so it's listed
# in the legend.
def plot_compare(feat_name, description, state=None, figsize=(10, 5)):
"""Graphically compare distributions of features.
Plot histogram of values for all resorts and reference line to mark
Big Mountain's position.
Arguments:
feat_name - the feature column name in the data
description - text description of the feature
state - select a specific state (None for all states)
figsize - (optional) figure size
"""
plt.subplots(figsize=figsize)
# quirk that hist sometimes objects to NaNs, sometimes doesn't
# filtering only for finite values tidies this up
if state is None:
ski_x = ski_data[feat_name]
else:
ski_x = ski_data.loc[ski_data.state == state, feat_name]
ski_x = ski_x[np.isfinite(ski_x)]
plt.hist(ski_x, bins=30)
plt.axvline(x=big_mountain[feat_name].values, c='r', ls='--', alpha=0.8, label='Big Montain')
plt.xlabel(description)
plt.ylabel('frequency')
plt.title(description + ' distribution for resorts in market share')
plt.legend()
Look at where Big Mountain sits overall amongst all resorts for price and for just other resorts in Montana.
plot_compare('AdultWeekend', 'Adult weekend ticket price ($)')
plot_compare('AdultWeekend', 'Adult weekend ticket price ($) - Montana only', state='Montana')
plot_compare('vertical_drop', 'Vertical drop (feet)')
Big Mountain is doing well for vertical drop, but there are still quite a few resorts with a greater drop.
plot_compare('Snow Making_ac', 'Area covered by snow makers (acres)')
Big Mountain is very high up the league table of snow making area.
plot_compare('total_chairs', 'Total number of chairs')
Big Mountain has amongst the highest number of total chairs, resorts with more appear to be outliers.
plot_compare('fastQuads', 'Number of fast quads')
Most resorts have no fast quads. Big Mountain has 3, which puts it high up that league table. There are some values much higher, but they are rare.
plot_compare('Runs', 'Total number of runs')
Big Mountain compares well for the number of runs. There are some resorts with more, but not many.
plot_compare('LongestRun_mi', 'Longest run length (miles)')
Big Mountain has one of the longest runs. Although it is just over half the length of the longest, the longer ones are rare.
plot_compare('trams', 'Number of trams')
The vast majority of resorts, such as Big Mountain, have no trams.
plot_compare('SkiableTerrain_ac', 'Skiable terrain area (acres)')
Big Mountain is amongst the resorts with the largest amount of skiable terrain.
Big Mountain Resort has been reviewing potential scenarios for either cutting costs or increasing revenue (from ticket prices). Ticket price is not determined by any set of parameters; the resort is free to set whatever price it likes. However, the resort operates within a market where people pay more for certain facilities, and less for others. Being able to sense how facilities support a given ticket price is valuable business intelligence. This is where the utility of our model comes in.
The business has shortlisted some options:
The expected number of visitors over the season is 350,000 and, on average, visitors ski for five days. Assume the provided data includes the additional lift that Big Mountain recently installed.
expected_visitors = 350_000
all_feats = ['vertical_drop', 'Snow Making_ac', 'total_chairs', 'fastQuads',
'Runs', 'LongestRun_mi', 'trams', 'SkiableTerrain_ac']
big_mountain[all_feats]
vertical_drop | Snow Making_ac | total_chairs | fastQuads | Runs | LongestRun_mi | trams | SkiableTerrain_ac | |
---|---|---|---|---|---|---|---|---|
124 | 2353 | 600.0 | 14 | 3 | 105.0 | 3.3 | 0 | 3000.0 |
# In this function, we copy the Big Mountain data into a new data frame.
# For each feature, and each of its deltas,
# we create a modified scenario dataframe (bm2), and make a ticket price prediction
# for it. The difference between the scenario's prediction and the current
# prediction is then calculated and returned.
# Complete the code to increment each feature by the associated delta
def predict_increase(features, deltas):
"""Increase in modelled ticket price by applying delta to feature.
Arguments:
features - list, names of the features in the ski_data dataframe to change
deltas - list, the amounts by which to increase the values of the features
Outputs:
Amount of increase in the predicted ticket price
"""
bm2 = X_bm.copy()
for f, d in zip(features, deltas):
bm2[f] += d
return model.predict(bm2).item() - model.predict(X_bm).item()
Close up to 10 of the least used runs. The number of runs is the only parameter varying.
[i for i in range(-1, -11, -1)]
[-1, -2, -3, -4, -5, -6, -7, -8, -9, -10]
runs_delta = [i for i in range(-1, -11, -1)]
price_deltas = [predict_increase(['Runs'], [delta]) for delta in runs_delta]
price_deltas
[-0.24034334763948095, -0.6180257510729632, -0.8841201716738141, -0.8841201716738141, -0.8841201716738141, -1.5836909871244558, -1.5836909871244558, -1.6523605150214564, -1.7982832618025668, -1.7982832618025668]
# Here, we create two plots, side by side, for the predicted ticket price change (delta) for each
# condition (number of runs closed) in the scenario and the associated predicted revenue
# change on the assumption that each of the expected visitors buys 5 tickets
# There are two things to do here:
# 1 - use a list comprehension to create a list of the number of runs closed from `runs_delta`
# 2 - use a list comprehension to create a list of predicted revenue changes from `price_deltas`
runs_closed = [-1 * num for num in runs_delta] #1
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
fig.subplots_adjust(wspace=0.5)
ax[0].plot(runs_closed, price_deltas, 'o-')
ax[0].set(xlabel='Runs closed', ylabel='Change ($)', title='Ticket price')
revenue_deltas = [5 * expected_visitors * price for price in price_deltas] #2
ax[1].plot(runs_closed, revenue_deltas, 'o-')
ax[1].set(xlabel='Runs closed', ylabel='Change ($)', title='Revenue');
The model says closing one run makes no difference. Closing 2 and 3 successively reduces support for ticket price and so revenue. If Big Mountain closes down 3 runs, it seems they may as well close down 4 or 5, as there's no further loss in ticket price. Increasing the closures down to 6 or more leads to a large drop.
In this scenario, Big Mountain is adding a run, increasing the vertical drop by 150 feet, and installing an additional chair lift.
# Call `predict_increase` with a list of the features 'Runs', 'vertical_drop', and 'total_chairs'
# and associated deltas of 1, 150, and 1
ticket2_increase = predict_increase(['Runs', 'vertical_drop', 'total_chairs'], [1, 150, 1])
revenue2_increase = 5 * expected_visitors * ticket2_increase
print(f'This scenario supports a ticket price increase of ${ticket2_increase:.3f}')
print('Over the season, this could be expected to amount to a revenue increase $'+'{:,}'.format(round(revenue2_increase,2)))
This scenario supports a ticket price increase of $0.695 Over the season, this could be expected to amount to a revenue increase $1,216,738.2
In this scenario, we are repeating the previous one, but adding 2 acres of snow making.
# Repeat scenario 2 conditions, but add an increase of 2 to `Snow Making_ac`
ticket3_increase = predict_increase(['Runs', 'vertical_drop', 'total_chairs', 'Snow Making_ac'], [1, 150, 1, 2])
revenue3_increase = 5 * expected_visitors * ticket3_increase
print(f'This scenario increases the ticket price by ${ticket3_increase:.3f}')
print(f'Over the season, this could be expected to amount to $'+'{:,}'.format(round(revenue2_increase,2)))
This scenario increases the ticket price by $0.695 Over the season, this could be expected to amount to $1,216,738.2
Such a small increase in the snow making area makes no difference!
This scenario calls for increasing the longest run by .2 miles and guaranteeing its snow coverage by adding 4 acres of snow making capability.
# Predict the increase from adding 0.2 miles to `LongestRun_mi` and 4 to `Snow Making_ac`
predict_increase(['LongestRun_mi', 'Snow Making_ac'], [0.2, 4])
0.0
No difference whatsoever. Although the longest run feature was used in the linear model, the random forest model (the one we chose because of its better performance) only has longest run way down in the feature importance list.
The data on which our study is based speaks to revenue growth and the assets that best support it. This is, however, only the top line of the Income Statement. Additional data on Big Mountain's cost structure relative to competitors would be highly valuable. This would allow us to compute asset utilization, working capital, leverage, and profitability metrics, as well as a one or more earnings forecasts and forecasts of free cashflow to the firm. This would assist management in capital budgeting and/or allow them to place a value on the enterprise, if the owners wish to engage an investment bank and sell the business to obtain greater personal liquidity.