In the second part of this blog series on forecasting I discuss forecasting steps, evaluation of forecasting methods, model selection, combinining models for robust and accurate forecasting and forecast uncertainty.
In Part 1, I did exploratory data analysis of sales time series of a French retailer. In this blog I will apply various time series models in Python and R to forecast sales for the next 4 quarters. The forecasting methods I will cover are:
For each of these models, I will provide a short description for intuitive understanding of these methods and give references for more academic explanation. For any forecasting model, the general steps are as below.
EDA
Forecast on test set
Evaluate the forecast
Check residuals.
Plot residuals, plot ACF/PACF and Q/Q plots
Conditions A, B below are essential and C,D are useful. Residuals should be:
First two ensure that there is no more information that can be extracted from the data, while the bottom two keep the variability in the point forecast narrow
Select model(s)
We evaluate the forecasting model by comparing the fitted & predicted values against the actual values in training and test sets. Note that residuals are the difference between training data and fitted values, while forecast error is the difference between test data and predicted values. We use residuals to check performance of the model while errors for checking accuracy/uncertainty of the future forecast.
As a general rule, if the data has no outliers RMSE (Root Mean Square Error) is a good metric to use. %MAPE (Mean Absolute Percentage Error) provides a more inutitive understanding as it is expressed in percentage. We do not use %MAPE if the series is intermittent to avoid division by zero. Note that both these measures are not scale independent but to keep things simple I will use RSME and MAPE.
#collapse-hide
#Author: Sandeep Pawar
#Version: 1.0
#Date Mar 27, 2020
import pandas as pd
import numpy as np
import itertools
#Plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
plt.style.use('seaborn-white')
pd.plotting.register_matplotlib_converters()
%matplotlib inline
#statistics libraries
import statsmodels.api as sm
import scipy
from scipy.stats import anderson
from statsmodels.tools.eval_measures import rmse
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import month_plot, seasonal_plot, plot_acf, plot_pacf, quarter_plot
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.holtwinters import ExponentialSmoothing, SimpleExpSmoothing
from statsmodels.stats.diagnostic import acorr_ljungbox as ljung
from statsmodels.tsa.statespace.tools import diff as diff
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pmdarima as pm
from pmdarima import ARIMA, auto_arima
from scipy import signal
from scipy.stats import shapiro
from scipy.stats import boxcox
from scipy.special import inv_boxcox
from sklearn.preprocessing import StandardScaler
from scipy.stats import jarque_bera as jb
from itertools import combinations
import fbprophet as Prophet
#library to use R in Python
import rpy2
from rpy2.robjects import pandas2ri
pandas2ri.activate()
import warnings
warnings.filterwarnings("ignore")
np.random.seed(786)
#Printing library versions
print('Pandas:', pd.__version__)
print('Statsmodels:', sm.__version__)
print('Scipy:', scipy.__version__)
print('Rpy2:', rpy2.__version__)
print('Numpy:', np.__version__)
Pandas: 0.25.0 Statsmodels: 0.11.0 Scipy: 1.4.1 Rpy2: 2.9.4 Numpy: 1.18.2
Below are some of the custom functions I wrote for forecast accuracy, gridsearching, residual diagnostics.
def MAPE(y_true, y_pred):
"""
%Error compares true value with predicted value. Lower the better. Use this along with rmse(). If the series has
outliers, compare/select model using MAPE instead of rmse()
"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def HWGrid(train, test, seasonal_periods):
"""
Author: Sandeep Pawar twitter: @PawarBI
Functions returns a dataframe with parameters of the Holt-Winter's method and corresponding train & test evaluation scores.
It also does a quick check of the residuals using Ljung-Box test and Shapiro test for normality.
Residuals must be uncorrelated.
train: (pandas series)
- Training data
test: (pandas series)
- Test data
Seasonal_periods: int
- No of seasonas in the time period. e.g. 4 for Quarterly, 12 for Monthly, 52 for Weekly data
"""
trend = ['add','mul']
seasonal = ['add','mul']
damped = [False, True]
use_boxcox = [False, True, 'log']
params = itertools.product(trend,seasonal,damped,use_boxcox)
result_df = pd.DataFrame(columns=['Trend', 'Seasonal', 'Damped', 'BoxCox','AICc Train',
'%MAPE_Train', 'RMSE_Train', '%MAPE_Test',
'RMSE_Test', "Resid_LJ", "Resid_Norm","Resid_mean" ])
for trend,seasonal,damped,use_boxcox in params:
model = ExponentialSmoothing(train,
trend=trend,
damped=damped,
seasonal=seasonal,
seasonal_periods=seasonal_periods).fit(use_boxcox=use_boxcox)
mape1=MAPE(train,model.fittedvalues)
rmse1=rmse(train,model.fittedvalues)
mape2=MAPE(test,model.forecast(len(test)))
rmse2=rmse(test,model.forecast(len(test)))
aicc1 = model.aicc.round(1)
lj_p_val = np.mean(ljung(x=model.resid, lags=10)[1])
norm_p_val = jb(model.resid)[1]#shapiro(model.resid)[1]
lj = "Uncorrelated" if lj_p_val > 0.05 else "Correlated"
norm = "Normal" if norm_p_val > 0.05 else "Non-Normal"
result_df = result_df.append({'Trend':trend ,
'Seasonal': seasonal ,
'Damped':damped ,
'BoxCox':use_boxcox ,
'%MAPE_Train':np.round(mape1,2) ,
'RMSE_Train':np.round(rmse1,1) ,
'AICc Train':aicc1 ,
'%MAPE_Test':np.round(mape2,2) ,
'RMSE_Test':np.round(rmse2,1) ,
'Resid_LJ' :lj ,
'Resid_Norm':norm ,
'Resid_mean':np.round(model.resid.mean(),1)} , ignore_index=True, sort=False)
return result_df.sort_values(by=["RMSE_Test", "%MAPE_Test","RMSE_Train","%MAPE_Train"]).style.format({"%MAPE_Train": "{:20,.2f}%", "%MAPE_Test": "{:20,.2f}%"}).highlight_min(color='lightgreen')
Calculating cross-validation score for Holt-Winter's method in Python
def hw_cv(series, seasonal_periods, initial_train_window, test_window):
from statsmodels.tools.eval_measures import rmse
import warnings
warnings.filterwarnings("ignore")
"""
Author: Sandeep Pawar
Date: 4/15/2020
Ver: 1.0
Returns Rolling and Expanding cross-validation scores (avg rmse), along with model paramters
for Triple Exponential Smoothing method. Expanding expands the training set each time by adding one observation,
while rolling slides the training and test by one observation each time.
Output shows parameters used and Rolling & Expanding cv scores. Output is in below order:
1. Trend 2. Seasonal 3. Damped 4. use_boxcox 5. Rolling cv 6. Expanding cv
Requirements: Pandas, Numpy, Statsmodels, itertools, rmse
series: Pandas Series
Time series
seasonal_periods: int
No of seasonal periods in a full cycle (e.g. 4 in quarter, 12 in monthly, 52 in weekly data)
initial_train_window: int
Minimum training set length. Recommended to use minimum 2 * seasonal_periods
test_window: int
Test set length. Recommended to use equal to forecast horizon
e.g. hw_cv(ts["Sales"], 4, 12, 6 )
Output: add add False False R: 41.3 ,E: 39.9
Note: This function can take anywhere from 5-15 min to run full output
"""
def expanding_tscv(series,trend,seasonal,seasonal_periods,damped,boxcox,initial_train_window, test_window):
i = 0
x = initial_train_window
t = test_window
errors_roll=[]
while (i+x+t) <len(series):
train_ts=series[:(i+x)].values
test_ts= series[(i+x):(i+x+t)].values
model_roll = ExponentialSmoothing(train_ts,
trend=trend,
seasonal=seasonal,
seasonal_periods=seasonal_periods,
damped=damped).fit(use_boxcox=boxcox)
fcast = model_roll.forecast(t)
error_roll = rmse(test_ts, fcast)
errors_roll.append(error_roll)
i=i+1
return np.mean(errors_roll).round(1)
def rolling_tscv(series,trend,seasonal,seasonal_periods,damped,boxcox,initial_train_window, test_window):
i = 0
x = initial_train_window
t = test_window
errors_roll=[]
while (i+x+t) <len(series):
train_ts=series[(i):(i+x)].values
test_ts= series[(i+x):(i+x+t)].values
model_roll = ExponentialSmoothing(train_ts,
trend=trend,
seasonal=seasonal,
seasonal_periods=seasonal_periods,
damped=damped).fit(use_boxcox=boxcox)
fcast = model_roll.forecast(t)
error_roll = rmse(test_ts, fcast)
errors_roll.append(error_roll)
i=i+1
return np.mean(errors_roll).round(1)
trend = ['add','mul']
seasonal = ['add','mul']
damped = [False, True]
use_boxcox = [False, True, 'log']
params = itertools.product(trend,seasonal,damped,use_boxcox)
for trend,seasonal,damped,use_boxcox in params:
r=rolling_tscv(data["Sales"], trend, seasonal, 4, damped, use_boxcox, 12,4)
e=expanding_tscv(data["Sales"], trend, seasonal, 4, damped, use_boxcox, 12,4)
result = print(trend, seasonal, damped, use_boxcox," R:", r," ,E:", e)
return result
Function for residual diagnostics
def residcheck(residuals, lags):
"""
Function to check if the residuals are white noise. Ideally the residuals should be uncorrelated, zero mean,
constant variance and normally distributed. First two are must, while last two are good to have.
If the first two are not met, we have not fully captured the information from the data for prediction.
Consider different model and/or add exogenous variable.
If Ljung Box test shows p> 0.05, the residuals as a group are white noise. Some lags might still be significant.
Lags should be min(2*seasonal_period, T/5)
plots from: https://tomaugspurger.github.io/modern-7-timeseries.html
"""
resid_mean = np.mean(residuals)
lj_p_val = np.mean(ljung(x=residuals, lags=lags)[1])
norm_p_val = jb(residuals)[1]
adfuller_p = adfuller(residuals)[1]
fig = plt.figure(figsize=(10,8))
layout = (2, 2)
ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2);
acf_ax = plt.subplot2grid(layout, (1, 0));
kde_ax = plt.subplot2grid(layout, (1, 1));
residuals.plot(ax=ts_ax)
plot_acf(residuals, lags=lags, ax=acf_ax);
sns.kdeplot(residuals);
#[ax.set_xlim(1.5) for ax in [acf_ax, kde_ax]]
sns.despine()
plt.tight_layout();
print("** Mean of the residuals: ", np.around(resid_mean,2))
print("\n** Ljung Box Test, p-value:", np.around(lj_p_val,3), "(>0.05, Uncorrelated)" if (lj_p_val > 0.05) else "(<0.05, Correlated)")
print("\n** Jarque Bera Normality Test, p_value:", np.around(norm_p_val,3), "(>0.05, Normal)" if (norm_p_val>0.05) else "(<0.05, Not-normal)")
print("\n** AD Fuller, p_value:", np.around(adfuller_p,3), "(>0.05, Non-stationary)" if (adfuller_p > 0.05) else "(<0.05, Stationary)")
return ts_ax, acf_ax, kde_ax
Function for calculating RMSE & %MAPE
def accuracy(y1,y2):
accuracy_df=pd.DataFrame()
rms_error = np.round(rmse(y1, y2),1)
map_error = np.round(np.mean(np.abs((np.array(y1) - np.array(y2)) / np.array(y1))) * 100,1)
accuracy_df=accuracy_df.append({"RMSE":rms_error, "%MAPE": map_error}, ignore_index=True)
return accuracy_df
path = 'https://raw.githubusercontent.com/pawarbi/datasets/master/timeseries/ts_frenchretail.csv'
#Sales numbers are in thousands, so I am dividing by 1000 to make it easier to work with numbers, especially squared errors
data = pd.read_csv(path, parse_dates=True, index_col="Date").div(1_000)
data.index.freq='Q'
data.head()
Sales | |
---|---|
Date | |
2012-03-31 | 362.0 |
2012-06-30 | 385.0 |
2012-09-30 | 432.0 |
2012-12-31 | 341.0 |
2013-03-31 | 382.0 |
data.index
DatetimeIndex(['2012-03-31', '2012-06-30', '2012-09-30', '2012-12-31', '2013-03-31', '2013-06-30', '2013-09-30', '2013-12-31', '2014-03-31', '2014-06-30', '2014-09-30', '2014-12-31', '2015-03-31', '2015-06-30', '2015-09-30', '2015-12-31', '2016-03-31', '2016-06-30', '2016-09-30', '2016-12-31', '2017-03-31', '2017-06-30', '2017-09-30', '2017-12-31'], dtype='datetime64[ns]', name='Date', freq='Q-DEC')
Part 1 on EDA covers this in detail. I will be using both typical train/test split and cross-validation for training & evaluation.
#Split into train and test
train = data.iloc[:-6]
test = data.iloc[-6:]
#forecast horizon
h = 6
train_length = len(train)
print('train_length:',train_length, '\ntest_length:', len(test) )
#Creating BxCox transformed train & test to be used later
train_bcox, bcox_lam = boxcox(train["Sales"])
print("BoxCox parameter to linearize the series:", bcox_lam.round(2))
test_bcox = boxcox(test["Sales"], lmbda=bcox_lam)
train_log = np.log(train["Sales"])
train_length: 18 test_length: 6 BoxCox parameter to linearize the series: -0.21
#collapse-hide
#Create line chart for Training data. index is reset to use Date column
train_chart=alt.Chart(train.reset_index()).mark_line(point=True).encode(
x='Date',
y='Sales',
tooltip=['Date', 'Sales'])
#Create Rolling mean. This centered rolling mean
rolling_mean = alt.Chart(train.reset_index()).mark_trail(
color='orange',
size=1
).transform_window(
rolling_mean='mean(Sales)',
frame=[-4,4]
).encode(
x='Date:T',
y='rolling_mean:Q',
size='Sales'
)
#Add data labels
text = train_chart.mark_text(
align='left',
baseline='top',
dx=5 # Moves text to right so it doesn't appear on top of the bar
).encode(
text='Sales:Q'
)
#Add zoom-in/out
scales = alt.selection_interval(bind='scales')
#Combine everything
(train_chart + rolling_mean +text).properties(
width=600,
title="French Retail Sales & 4Q Rolling mean ( in '000)").add_selection(
scales
)
Seasonal naive method uses the observations from the corresponding season from last period. For example, forecast Q3 would be sales from Q3 last year. It does not take any trend or previous history into account. This method, as expected, is not the most accurate but helps create a baseline. As we explore more complex models, we want them to perform better than this and compare them with seasonal naive forecast.
This method is not available in statsmodels library so I wrote a function for it.
def pysnaive(train_series,seasonal_periods,forecast_horizon):
'''
Python implementation of Seasonal Naive Forecast.
This should work similar to https://otexts.com/fpp2/simple-methods.html
Returns two arrays
> fitted: Values fitted to the training dataset
> fcast: seasonal naive forecast
Author: Sandeep Pawar
Date: Apr 9, 2020
Ver: 1.0
train_series: Pandas Series
Training Series to be used for forecasting. This should be a valid Pandas Series.
Length of the Training set should be greater than or equal to number of seasonal periods
Seasonal_periods: int
No of seasonal periods
Yearly=1
Quarterly=4
Monthly=12
Weekly=52
Forecast_horizon: int
Number of values to forecast into the future
e.g.
fitted_values = pysnaive(train,12,12)[0]
fcast_values = pysnaive(train,12,12)[1]
'''
if len(train_series)>= seasonal_periods: #checking if there are enough observations in the training data
last_season=train_series.iloc[-seasonal_periods:]
reps=np.int(np.ceil(forecast_horizon/seasonal_periods))
fcarray=np.tile(last_season,reps)
fcast=pd.Series(fcarray[:forecast_horizon])
fitted = train_series.shift(seasonal_periods)
else:
fcast=print("Length of the trainining set must be greater than number of seasonal periods")
return fitted, fcast
#Before I create the model, I am going to create a dataframe to store all out-of=sample forecasts and the test set
predictions = test.copy()
#Fitted values
py_snaive_fit = pysnaive(train["Sales"],
seasonal_periods=4,
forecast_horizon=6)[0]
#forecast
py_snaive = pysnaive(train["Sales"],
seasonal_periods=4,
forecast_horizon=6)[1]
#Residuals
py_snaive_resid = (train["Sales"] - py_snaive_fit).dropna()
predictions["py_snaive"] = py_snaive.values
predictions
Sales | py_snaive | |
---|---|---|
Date | ||
2016-09-30 | 773.0 | 681.0 |
2016-12-31 | 592.0 | 557.0 |
2017-03-31 | 627.0 | 628.0 |
2017-06-30 | 725.0 | 707.0 |
2017-09-30 | 854.0 | 681.0 |
2017-12-31 | 661.0 | 557.0 |
pd.plotting.register_matplotlib_converters()
train["Sales"].plot(figsize=(12,8))#, style="--", color="gray", legend=True, label="Train")
py_snaive_fit.plot(color="b", legend=True, label="SNaive_Fitted")
predictions["Sales"].plot(style="--",color="r", legend=True, label="Test")
predictions["py_snaive"].plot(color="b", legend=True, label="Snaive_fc");
#Training score
accuracy(train["Sales"].iloc[-len(py_snaive_fit.dropna()):], py_snaive_fit.dropna())
%MAPE | RMSE | |
---|---|---|
0 | 13.9 | 80.3 |
#Test score
accuracy(predictions["Sales"], predictions["py_snaive"])
%MAPE | RMSE | |
---|---|---|
0 | 9.4 | 92.0 |
residcheck(py_snaive_resid.dropna(),12);
** Mean of the residuals: 75.21 ** Ljung Box Test, p-value: 0.339 (>0.05, Uncorrelated) ** Jarque Bera Normality Test, p_value: 0.721 (>0.05, Normal) ** AD Fuller, p_value: 0.071 (>0.05, Non-stationary)
#PACF of Seasonal Snaive model residuals
plot_acf(py_snaive_resid);
plot_pacf(py_snaive_resid);
Triple Exponential Smoothing (Holt Winter's method) decomposes the series into level, trend, seasonality. Future values are predicted by combining these systematic factors based on recent history. The intuitive idea here is that the future will behave very similar to recent past, we just have to find how much of the past is relevant. The three systematic components are:
This method is called "Exponential" because each of the above factors give exponential weightage to the past values.
Additive model = (Level + Trend) + Seasonality
Multiplicative Model = (Level * Trend) * Seasonality
The Exponentialsmoothing()
method in statsmodels finds the optimal alpha, beta, gamma and phi by minizing the errors.
Depending on the temporal structure of the time series, trend and seasonality can show additive, multiplicative or mix behaviour.
In case of trend, if the time series has linear trend, it's additive. If it is exponentially increasing (power law), a multiplicative model might fit better.
Seasonality is calculated relative to the level. If a series has additive seasonality, each season varies +/- relative to the level. E.g. in a quarterly series, in Q1 we might add +30 more units, -60 in Q2, +120 in Q3 and +60 in Q4 relative to level. Thus, seasonality peaks have somehwat fix height relative to level everywhere. On the other hand, in a multiplicative seasonality, the quantities will vary by %, i.e +5% in Q1, -7% in Q2, +10% in Q3 and +5 in Q4 relative to trend. As the level increases or decreases, seasonality can vary by %.
For this time series, we identified during EDA (Part 1) that trend is exponential. Seasonality can be modeled as additive or multiplicatives. Note that we can turn a multiplicative component into additive by taking log or using power transform (BoxCox). This is often preferred and may perform better.
Refer to Part 1 where I calculated these seasonality factors.
I replicated an illustration from Everette Gardener's paper to show this effect below. I highly recommend reading it if you are ineterested in exponential smoothing.
Instead of fitting each model individually, I wrote a custom function HWGrid()
to perform uniform, gridsearch using the model parameters. This function also returns statistics on residuals (Ljung Box test, Normality test and mean). You get model evaluation metric and residual metric for 24 models. This may take a while (5-15 min) on your computer.
m=HWGrid(train["Sales"], test["Sales"], seasonal_periods=4)
m
Trend | Seasonal | Damped | BoxCox | AICc Train | %MAPE_Train | RMSE_Train | %MAPE_Test | RMSE_Test | Resid_LJ | Resid_Norm | Resid_mean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | add | add | True | False | 170.6 | 3.55% | 20.4 | 10.79% | 81.3 | Uncorrelated | Normal | 4.4 |
0 | add | add | False | False | 151.8 | 2.41% | 18.2 | 10.98% | 82.6 | Uncorrelated | Normal | 4.9 |
8 | add | mul | False | log | 139.7 | 2.29% | 13 | 11.49% | 84 | Uncorrelated | Normal | 0.8 |
11 | add | mul | True | log | 154.3 | 2.29% | 13 | 11.49% | 84 | Uncorrelated | Normal | 0.8 |
2 | add | add | False | log | 138.9 | 2.20% | 12.7 | 11.69% | 84.8 | Uncorrelated | Normal | 0.8 |
5 | add | add | True | log | 153.4 | 2.20% | 12.7 | 11.69% | 84.8 | Uncorrelated | Normal | 0.8 |
18 | mul | mul | False | False | 137.6 | 2.31% | 12.2 | 11.88% | 86.7 | Uncorrelated | Normal | -0.6 |
12 | mul | add | False | False | 154.7 | 3.09% | 19.7 | 12.09% | 91.1 | Uncorrelated | Normal | 5.8 |
15 | mul | add | True | False | 170.5 | 3.56% | 20.4 | 12.15% | 91.4 | Uncorrelated | Normal | 3.2 |
23 | mul | mul | True | log | 153.4 | 2.32% | 12.7 | 12.52% | 92.1 | Uncorrelated | Normal | 0.8 |
20 | mul | mul | False | log | 139 | 2.24% | 12.7 | 12.61% | 92.4 | Uncorrelated | Normal | 0.6 |
14 | mul | add | False | log | 138.1 | 2.16% | 12.4 | 12.76% | 92.9 | Uncorrelated | Normal | 0.6 |
17 | mul | add | True | log | 152.7 | 2.16% | 12.4 | 12.76% | 92.9 | Uncorrelated | Normal | 0.6 |
1 | add | add | False | True | 139.6 | 2.26% | 13 | 12.85% | 94.6 | Uncorrelated | Normal | 0.8 |
4 | add | add | True | True | 154.2 | 2.26% | 13 | 12.85% | 94.6 | Uncorrelated | Normal | 0.8 |
10 | add | mul | True | True | 154.8 | 2.32% | 13.2 | 12.77% | 94.7 | Uncorrelated | Normal | 0.8 |
7 | add | mul | False | True | 140.3 | 2.32% | 13.2 | 12.78% | 94.7 | Uncorrelated | Normal | 0.8 |
6 | add | mul | False | False | 142 | 1.78% | 13.8 | 13.52% | 98 | Uncorrelated | Normal | 3.4 |
16 | mul | add | True | True | 154.1 | 2.22% | 12.9 | 13.34% | 98.1 | Uncorrelated | Normal | 1.4 |
9 | add | mul | True | False | 156.5 | 1.78% | 13.8 | 13.53% | 98.1 | Uncorrelated | Normal | 3.4 |
19 | mul | mul | False | True | 139.7 | 2.32% | 13 | 13.28% | 98.7 | Uncorrelated | Normal | 0.6 |
13 | mul | add | False | True | 139.4 | 2.25% | 12.9 | 13.45% | 99.2 | Uncorrelated | Normal | 0.7 |
22 | mul | mul | True | True | 154.6 | 2.31% | 13.1 | 13.40% | 99.5 | Uncorrelated | Normal | 0.7 |
21 | mul | mul | True | False | nan | nan% | nan | nan% | nan | Correlated | Non-Normal | 469.6 |
Above resulting dataframe is sorted in ascending order showing model with lowest test RMSE and %MAPE at the top. Cells highlighted in green are the lowest numbers in their respective columns. Some observations from this result:
metrics. 9. Top and fourth model has high AICc. Third and fifth have almost same performance. We want to select a parsimonious and simple model. I will select the model with additive seasonality and trend as it has the lowest AICc in the top 5 models.
hw_model = ExponentialSmoothing(train["Sales"],
trend ="add",
seasonal = "add",
seasonal_periods=4,
damped=False).fit(use_boxcox='log')
hw_fitted = hw_model.fittedvalues
hw_resid = hw_model.resid
#Adding the mean of the residuals to correct the bias.
py_hw = hw_model.forecast(len(test["Sales"])) + np.mean(hw_resid)
predictions["py_hw"] = py_hw
#Holt-Winter Parameters
hw_model.params_formatted
name | param | optimized | |
---|---|---|---|
smoothing_level | alpha | 7.538529e-01 | True |
smoothing_slope | beta | 3.568007e-09 | True |
smoothing_seasonal | gamma | 0.000000e+00 | True |
initial_level | l.0 | 6.094620e+00 | True |
initial_slope | b.0 | 3.675604e-02 | True |
initial_seasons.0 | s.0 | -2.463472e-01 | True |
initial_seasons.1 | s.1 | -2.036851e-01 | True |
initial_seasons.2 | s.2 | -9.372556e-02 | True |
initial_seasons.3 | s.3 | -3.542245e-01 | True |
Note the optimized alpha, beta and gamma parameters.
alpha: 0.75, i.e 75% weight was given to the last observation
beta: learning parameter for trend. very small. Not much weight is given to the recent trend and trend is obatained from distant past
gamma: seasonal factor is 0. gamma is usually very small (<0.2) and we want it to be small. If the gamma is high, it can lead to overfitting becuase it means the model is learning too much from recent the recenet data. 0 here indicates seasonality is learned from the earliest season.
#Plotting
train["Sales"].plot(figsize=(12,8), style="--", color="gray", legend=True, label="Train")
hw_fitted.plot(color="b", legend=True, label="HW_Fitted")
predictions["Sales"].plot(style="--",color="r", legend=True, label="Test")
predictions["py_hw"].plot(color="b", legend=True, label="HW_Forecast");
In the above gridsearch, the training set size was fix and we evaluated the model performance by comparing train AICc, RMSE, %MAPE and test RMSE & %MAPE. Test metrics provides true forecast accuracy and should always be used for model selection. This is the preferred approach when the data size is large. But when the time series is short, cross-validation should be used to make sure the model does not overfit the data. The two common approaches are:
Expanding window cross-validation: We start with some initial training & test sets and with each iteration we add one observation to the training set. Forecast errors are calculated with each iteration and averaged to compare model performance. This simulates the performance of the model as we add more observations. For the final forecast we will using all the available history, so using expanding window gives us a good estimate of the forecast accuracy and uncertainty.
Rolling Window cross-validation: Similar to Expanding but the training size remains same, instead it moves by one observation each time. Training and test lengths remain same.
Note that AICc, theoretically, provides the same information because it penalizes complex models that overfit.
#I would like to perform 5 fold cross validation, want the training size to be at
#least 12 and test window = forecast horizon 24 - 4 - 5 = 15. Initial training size should be min 12, max 15.
#I will choose 15
hw_cv(data["Sales"], seasonal_periods=4, initial_train_window=15, test_window=4)
add add False False R: 39.9 ,E: 41.3 add add False True R: 43.4 ,E: 51.0 add add False log R: 40.9 ,E: 36.8 add add True False R: 40.7 ,E: 45.9 add add True True R: 38.2 ,E: 45.4 add add True log R: 33.9 ,E: 39.6 add mul False False R: 35.4 ,E: 39.5 add mul False True R: 42.6 ,E: 50.4 add mul False log R: 44.1 ,E: 40.7 add mul True False R: 38.9 ,E: 40.8 add mul True True R: 37.1 ,E: 45.6 add mul True log R: 37.4 ,E: 41.1 mul add False False R: 44.7 ,E: 43.8 mul add False True R: 47.6 ,E: 49.7 mul add False log R: 46.0 ,E: 39.1 mul add True False R: 163.9 ,E: 90.7 mul add True True R: 292.8 ,E: 70.0 mul add True log R: 487.5 ,E: 39.1 mul mul False False R: 42.6 ,E: 38.9 mul mul False True R: 48.9 ,E: 52.3 mul mul False log R: 43.0 ,E: 39.2 mul mul True False R: 78.4 ,E: nan mul mul True True R: nan ,E: 71.2 mul mul True log R: 3186.8 ,E: 39.9
residcheck(hw_resid, 12);
** Mean of the residuals: 0.84 ** Ljung Box Test, p-value: 0.3 (>0.05, Uncorrelated) ** Jarque Bera Normality Test, p_value: 0.419 (>0.05, Normal) ** AD Fuller, p_value: 0.0 (<0.05, Stationary)
accuracy(predictions.Sales,predictions["py_hw"] )
%MAPE | RMSE | |
---|---|---|
0 | 11.8 | 85.7 |
ETS standards for Error, Trend, Seasonality model (I have also seen some refer to it as ExponenTial Smoothing). It is similar to Holt-Winter's model above but with a State Space statistical framework. In HoltWinter's, the time series is decomposed into trend and seasonality and exponential weights are used to make the forecast. In a State Space approach, the underlying statistical process is identified and errors are factored in to make the forecast. Holt, Single Exponential Smoothing, Holt-Winter's, certain ARIMA models can all be categorised into ETS class models.
ETS models follow a taxonomy of ETS(XYZ) where:
Thus, ETS(ANN) is an exponential model with additive error, no trend, no seasonality (i.e single exponential smoothing) and ETS(MAM) is analogous to Holt-Winter's method described above. There can be 24 different ETS models based on above combinations but not all combinations of ETS are stable, especially when error is multiplicative.
statsmodels()
has a statespace implementation of exponential smoothing method in its tsa.statespace()
class. It only has addditive error, additive & damped trend models for now (as of 4/20/2020). Recall that we can convert a multiplicative model into additive by power transform, so this is not a problem.
Using statespace (ETS) can often work better than Holt Winter because of how the parameters are initialized and optimized. This can make a huge difference in the results as we will see. ETS framework also returns 95% predictive interval which HW does not.
Refer to ch.7 of Hyndman's book for a quick refrence on ETS models and this for more detailed explanation.
For this series, we already know that trend is exponential so I will use logged version of the training set and model ETS(AAA) and ETS(AAdA)
#https://www.statsmodels.org/stable/statespace.html#
ets_LAdA=sm.tsa.statespace.ExponentialSmoothing(train_log,
trend=True,
initialization_method= 'heuristic',
seasonal=4,
damped_trend=True).fit()
fc_LAdA = np.exp(ets_LAdA.forecast(6)) #inverting the Log
predictions["LAdA"]=fc_LAdA
#Plotting
train["Sales"].plot(figsize=(12,8), style="--", color="gray", legend=True, label="Train")
np.exp(ets_LAdA.fittedvalues).plot(color="b", legend=True, label="Log-AAdA_Fitted")
predictions["Sales"].plot(style="--",color="r", legend=True, label="Test")
predictions["LAdA"].plot(color="b", legend=True, label="Log-AAdA_Forecast");
accuracy(predictions["Sales"],predictions["LAdA"])
%MAPE | RMSE | |
---|---|---|
0 | 5.1 | 40.9 |
residcheck(ets_LAdA.resid,12);
** Mean of the residuals: 0.0 ** Ljung Box Test, p-value: 0.346 (>0.05, Uncorrelated) ** Jarque Bera Normality Test, p_value: 0.266 (>0.05, Normal) ** AD Fuller, p_value: 0.712 (>0.05, Non-stationary)
ets_AAdA=sm.tsa.statespace.ExponentialSmoothing(train,
trend=True,
initialization_method= 'concentrated',
seasonal=4,
damped_trend=True).fit()
fc_AAdA=ets_AAdA.forecast(len(test))
predictions["AAdA"]=fc_AAdA
#Plotting
train["Sales"].plot(figsize=(12,8), style="--", color="gray", legend=True, label="Train")
ets_AAdA.fittedvalues.plot(color="b", legend=True, label="AAdA_Fitted")
predictions["Sales"].plot(style="--",color="r", legend=True, label="Test")
predictions["AAdA"].plot(color="b", legend=True, label="AAdA_Forecast");
accuracy(predictions["Sales"],predictions["AAdA"])
%MAPE | RMSE | |
---|---|---|
0 | 4.9 | 43.4 |
residcheck(ets_AAdA.resid,12);
** Mean of the residuals: 0.4 ** Ljung Box Test, p-value: 0.165 (>0.05, Uncorrelated) ** Jarque Bera Normality Test, p_value: 0.582 (>0.05, Normal) ** AD Fuller, p_value: 1.0 (>0.05, Non-stationary)
SARIMA (Seasonal ARIMA) is a classical, statistical forecasting method that predicts the forecast values based on past values, i.e lagged values (AR) and lagged errors (MA). Unlike Holt-Winter's (or ETS), it needs the time series to be stationary before it can be used. That's where the "Integrated" part comes from. "Integration" means differecning the series to remove trend and make it stationary. You can learn amore about the fundamentals by watching below video. Prof. Shmueli also has an excellent book on Forecasting that I highly recommend.
I also recommend this free Coursera course Practical Time Series Analysis if you want to gain practical and intutitive understanding of ARIMA models.
I will share my main take-aways from these and other resources I have used.
youtube: https://youtu.be/0xHf-SJ9Z9U
Below video explain AR process really well
youtube: https://youtu.be/5-2C4eO4cPQ
Thus, ARIMA (p,d,q) = constant + (weighted sum of last p values) + (weighted sum of last q values of forecast errors) after d differencing
Below are the simulated MA and AR processes. If you run this cell a few times and observe the plots, you will not that it's not possible to distinguish an AR from MA by just looking at the plots. You need to study the ACF & PACF to know the difference.
#Simulating AR process
from statsmodels.tsa.arima_process import ArmaProcess
ar = np.array([1,-0.9])
ma = np.array([1, 0.9])
AR = ArmaProcess(ar=ar, ma=None)
MA = ArmaProcess(ar=None, ma=ma)
simulated_AR= AR.generate_sample(125)
simulated_MA= MA.generate_sample(125)
fig, (ax1, ax2) = plt.subplots(1, 2)
# fig = plt.figure(figsize=(10,8))
fig.suptitle('Simulated AR & MA Processes')
ax1.plot(simulated_AR);
ax2.plot(simulated_MA);
Finding the parameters of the ARIMA process (p,d,q) is an art and science. Generally, p+q <=3. Similar to ARIMA, Seasonal ARIMA (SARIMA) has (P,D,Q) parameters, so SARIMA is (p,d,q)(P,D,Q). p+d+q+P+D+Q <=6 (generally)
Instead of finding the above parameters manually by studying the ACF, PACF, we usually use grid searching just like HW method above. pmdarima
is a great library for SARIMA forecasting in Python. It returns the parameters that minimizes AICc and also has cross-validation tools.statsmodels
has arma_order_select_ic()
for identifying order of the ARMA model but not for SARIMA.
Let's first take a look at the ACF and PACF to identify potential order of the SARIMA model
ACF plot shows that 1st lag is significant (outside the blue band), the ACFs and PACF both decrease gradually. We will need at least 1 differencing to make the series stationary. When ACF and PACF plots do not have sharp cut offs and significant lags at higher orders, its a indication of ARMA process with seasonality.
plot_acf(train["Sales"]);
plot_pacf(train["Sales"]);
Using pmdarima()
to find the SARIMA order with lowest aicc. This may take a few minutes to run.
(auto_arima(train["Sales"],
seasonal=True,
m=4, #seasonality_order 4
d=1, #ACF plot showed we need at least 1 differencing
information_criterion='aicc'). #You can choose AIC, BIC. AICc is corrected AIC
summary())
Dep. Variable: | y | No. Observations: | 18 |
---|---|---|---|
Model: | SARIMAX(0, 1, 1)x(1, 1, [], 4) | Log Likelihood | -55.434 |
Date: | Mon, 20 Apr 2020 | AIC | 118.867 |
Time: | 12:57:38 | BIC | 121.127 |
Sample: | 0 | HQIC | 118.403 |
- 18 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | 7.0286 | 1.992 | 3.529 | 0.000 | 3.125 | 10.932 |
ma.L1 | -0.9992 | 158.481 | -0.006 | 0.995 | -311.617 | 309.618 |
ar.S.L4 | -0.7492 | 0.305 | -2.457 | 0.014 | -1.347 | -0.152 |
sigma2 | 176.2645 | 2.78e+04 | 0.006 | 0.995 | -5.43e+04 | 5.47e+04 |
Ljung-Box (Q): | 8.66 | Jarque-Bera (JB): | 1.55 |
---|---|---|---|
Prob(Q): | 0.73 | Prob(JB): | 0.46 |
Heteroskedasticity (H): | 1.17 | Skew: | -0.84 |
Prob(H) (two-sided): | 0.88 | Kurtosis: | 3.24 |
pmdarima()
has identified the training set as (0, 1, 1)x(1, 1, 0, 4) process. It's a seasonal AR(1) with d=D=1.
Summary also shows that Ljung Box p value (Prob(Q) and JB p value (Prob(JB) are > 0.05 thus residuals are uncorrelated and normally distributed. Summary also shows MA is significant at lag 1, seasonal AR is significant at lag 4.
#Creating SARIMA model in Python using statsmodels
sarima_model=(SARIMAX(endog=train["Sales"],
order=(0,1,1),
seasonal_order=(1,1,0,4),
trend='c',
enforce_invertibility=False))
sarima_fit=sarima_model.fit()
start = len(train)
end = len(train) +len(test) -1
sarima_fitted = sarima_fit.fittedvalues
sarima_resid = sarima_fit.resid
py_sarima = sarima_fit.predict(start, end, dynamic=False)
predictions["py_sarima"] = py_sarima
sarima_fit.plot_diagnostics();
SARIMAX()
has its own residual diagnostics (shown above). It shows the residuals are normally distributed, uncorrelated. Q-Q plot shows an outlier in the lower left but otherwise everyting looks good.
accuracy(predictions.Sales,py_sarima)
%MAPE | RMSE | |
---|---|---|
0 | 12.6 | 94.4 |
Recall that one of the observations from the HW method was log models performed better, so I will try log of the trainining set. The forecast will be logged values so I will inverse it with np.exp()
#Fitting model to log of train
(auto_arima(np.log(train["Sales"]),
seasonal=True,
m=4, #seasonality_order 4
information_criterion='aicc'). #You can choose AIC, BIC. AICc is corrected AIC
summary())
Dep. Variable: | y | No. Observations: | 18 |
---|---|---|---|
Model: | SARIMAX(1, 1, 0, 4) | Log Likelihood | 26.993 |
Date: | Mon, 20 Apr 2020 | AIC | -47.986 |
Time: | 12:57:49 | BIC | -46.068 |
Sample: | 0 | HQIC | -48.163 |
- 18 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | 0.2848 | 0.032 | 8.979 | 0.000 | 0.223 | 0.347 |
ar.S.L4 | -0.8106 | 0.163 | -4.966 | 0.000 | -1.131 | -0.491 |
sigma2 | 0.0009 | 0.001 | 1.357 | 0.175 | -0.000 | 0.002 |
Ljung-Box (Q): | 20.24 | Jarque-Bera (JB): | 0.47 |
---|---|---|---|
Prob(Q): | 0.09 | Prob(JB): | 0.79 |
Heteroskedasticity (H): | 0.40 | Skew: | -0.24 |
Prob(H) (two-sided): | 0.34 | Kurtosis: | 2.23 |
It's a SARIMA model with no (p,d,q). (P,D,Q,m) = (1,1,0,4)
Prob(Q) > 0.05 and Prob(JB) > 0.05 Thus, residuals are uncorrelated and normal !
sarima_logmodel=(SARIMAX(np.log(train["Sales"]),
order=(0,0,0),
seasonal_order=(1,1,0,4),
trend='c',
enforce_invertibility=False)).fit()
sarima_log = np.exp(sarima_logmodel.predict(start, end))
predictions["sarima_log"] = sarima_log
slog_fitted = np.exp(sarima_logmodel.fittedvalues)
sarima_logmodel.plot_diagnostics();
accuracy(predictions.Sales,sarima_log )
%MAPE | RMSE | |
---|---|---|
0 | 11.1 | 81.7 |
RMSE & %MAPE are now slightly better than the HW model !!! Plot below shows model did not fit well in the early data but performs well on validation set.
#Plotting
train["Sales"].plot(figsize=(12,8), style="--", color="gray", legend=True, label="Train")
slog_fitted.plot(color="b", legend=True, label="HW_Fitted")
predictions["Sales"].plot(style="--",color="r", legend=True, label="Test")
predictions["sarima_log"].plot(color="b", legend=True, label="LogSARIMA_forecast");
note: There is a misconception that ARIMA is a more accurate method that ETS/Holt-Winters. That's not accurate. In this example, ARIMA worked better but that may not always be the case and you won't know until you experiment.
Facebook described Prophet library as below in their documentation:
" Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well."
Video below gives great overview of this package.
youtube: https://youtu.be/95-HMzxsghY
My main take aways are:
ds
and time series observations in column y
from fbprophet import Prophet
data_fb = data.reset_index()
data_fb.columns=['ds','y'] #create new df with columns ds & y
train_fb, test_fb = data_fb.iloc[:-len(test)], data_fb.iloc[-len(test):] #create train & test df
#Fit the model to train
fb1_model=Prophet(weekly_seasonality=False,
daily_seasonality=False,
n_changepoints=10,
seasonality_mode="multiplicative").fit(train_fb) #I tried "additive too", it was slightly worse
#Prophet results are saved to a dataframe using make_future_dataframe()
fb1_df=fb1_model.make_future_dataframe(6, freq='Q') #set the freq argument to 'Q' for quarterly data
#We only need "ds" and "yhat" columns.. "ds" is the date column and "yhat" are predictions
fb1_fc_df=fb1_model.predict(fb1_df)[["ds","yhat"]]
fb1_fc__=fb1_model.predict(fb1_df)
#Residuals
fb1_resid = train["Sales"].values - fb1_fc_df['yhat'].iloc[:len(train)]
fb1_fc = fb1_fc_df.iloc[-len(test):]
predictions["fb1"] = fb1_fc["yhat"].values
fb1_fc_df.head()
ds | yhat | |
---|---|---|
0 | 2012-03-31 | 362.0 |
1 | 2012-06-30 | 385.0 |
2 | 2012-09-30 | 432.0 |
3 | 2012-12-31 | 341.0 |
4 | 2013-03-31 | 382.0 |
accuracy(test["Sales"],fb1_fc["yhat"].values)
%MAPE | RMSE | |
---|---|---|
0 | 8.3 | 65.8 |
Prophet performed significantly better than HW & SARIMA, that too on quarterly data ! I didn't expect that given how extensively it's been proven that Prophet does not work well on low frequency data. This is still not as good as the ETS models.
train["Sales"].plot(figsize=(12,8), style="--", color="gray", legend=True, label="Train")
fb1_fc_df.set_index('ds')["yhat"].iloc[:-len(test)].plot(color="b", legend=True, label="Prophet_Fitted")
predictions["Sales"].plot(style="--",color="r", legend=True, label="Test")
fb1_fc_df.set_index('ds')["yhat"].iloc[-len(test):].plot(color="b", legend=True, label="Prophet_forecast");
Prophet has its own plotting method. As you can see the main drawback with this model is how wide the confidence interval is.
fb1_model.plot(fb1_fc__);
residcheck(fb1_resid,12);
** Mean of the residuals: -0.0 ** Ljung Box Test, p-value: 0.402 (>0.05, Uncorrelated) ** Jarque Bera Normality Test, p_value: 0.003 (<0.05, Not-normal) ** AD Fuller, p_value: 0.969 (>0.05, Non-stationary)
We evaluated four different forecasting methods:
ETS gave the most accurate forecast, followed by Prophet.. Here's how the point forecasts compare with each other and the test set:
predictions.round(0)
Sales | py_snaive | py_hw | LAdA | AAdA | py_sarima | sarima_log | fb1 | |
---|---|---|---|---|---|---|---|---|
Date | ||||||||
2016-09-30 | 773.0 | 681.0 | 813.0 | 761.0 | 725.0 | 783.0 | 797.0 | 717.0 |
2016-12-31 | 592.0 | 557.0 | 650.0 | 620.0 | 613.0 | 678.0 | 650.0 | 518.0 |
2017-03-31 | 627.0 | 628.0 | 751.0 | 699.0 | 687.0 | 755.0 | 743.0 | 654.0 |
2017-06-30 | 725.0 | 707.0 | 813.0 | 781.0 | 724.0 | 811.0 | 803.0 | 716.0 |
2017-09-30 | 854.0 | 681.0 | 941.0 | 845.0 | 785.0 | 911.0 | 933.0 | 762.0 |
2017-12-31 | 661.0 | 557.0 | 753.0 | 688.0 | 671.0 | 799.0 | 762.0 | 572.0 |
forecasts = predictions.copy()
fc_melt=pd.melt(forecasts.reset_index(),
id_vars='Date',
value_vars=forecasts.columns,
var_name="Model",
value_name="Forecasts").round(0)
fc_melt.head()
Date | Model | Forecasts | |
---|---|---|---|
0 | 2016-09-30 | Sales | 773.0 |
1 | 2016-12-31 | Sales | 592.0 |
2 | 2017-03-31 | Sales | 627.0 |
3 | 2017-06-30 | Sales | 725.0 |
4 | 2017-09-30 | Sales | 854.0 |
Interactive Chart
#Ref:https://altair-viz.github.io/gallery/multiline_tooltip.html
# Create a selection that chooses the nearest point & selects based on x-value
nearest = alt.selection(type='single', nearest=True, on='mouseover',
fields=['Date'], empty='none')
# The basic line
line = alt.Chart(fc_melt).mark_line(point=True).encode(
x='Date',
y=alt.Y('Forecasts:Q',scale=alt.Scale(domain=[500,1000], clamp=True)),
color='Model:N',
tooltip=['Date','Forecasts','Model']
)
# Transparent selectors across the chart. This is what tells us
# the x-value of the cursor
selectors = alt.Chart(fc_melt).mark_point().encode(
x='Date',
opacity=alt.value(0),
).add_selection(
nearest
)
# Draw points on the line, and highlight based on selection
points = line.mark_point().encode(
opacity=alt.condition(nearest, alt.value(1), alt.value(0))
)
# Draw text labels near the points, and highlight based on selection
text = line.mark_text(align='left', baseline='top', dx=5, dy=-5).encode(
text=alt.condition(nearest, 'Forecasts:Q', alt.value(' '))
)
text2 = line.mark_text(align='left', baseline='bottom', dx=5, dy=-5).encode(
text=alt.condition(nearest, 'Model:N', alt.value(' '))
)
# Draw a rule at the location of the selection
rules = alt.Chart(fc_melt).mark_rule(color='gray').encode(
x='Date',
).transform_filter(
nearest
)
# Put the five layers into a chart and bind the data
alt.layer(
line, selectors, points, rules, text, text2
).properties(
width=800, height=500, title="Comaprison of various Forecasting Models"
).interactive()
In the above chart, red line is the test set and rest are forecasts. You can zoom-in/out, pan to inspect the fit over the test set.
Although there are many different ways to combine forecasts, simple averaging often works as good as a more complex methods and is easier to implement/monitor/debug. As we saw above, some forecasts are above the test and some are below. So hopefully averaging will bring it closer to the actual values.
We have 7 different models with more than 200 possible combinations. Let's compare their RSME with fc_combo()
function I wrote. This function averages the forecasts and calculates the RSME.
Note: This function uses
mean()
to average the forecasts but you should trymedian()
as well. Depending on the data and the skewness,meadian()
might work better in some cases.
forecasts
Sales | py_snaive | py_hw | LAdA | AAdA | py_sarima | sarima_log | fb1 | |
---|---|---|---|---|---|---|---|---|
Date | ||||||||
2016-09-30 | 773.0 | 681.0 | 812.646500 | 760.681119 | 724.566985 | 782.950175 | 797.175115 | 716.954330 |
2016-12-31 | 592.0 | 557.0 | 649.896932 | 620.038936 | 612.542359 | 677.965781 | 649.770583 | 517.539078 |
2017-03-31 | 627.0 | 628.0 | 750.899727 | 698.970622 | 687.029723 | 755.245430 | 743.222703 | 654.222708 |
2017-06-30 | 725.0 | 707.0 | 812.897039 | 780.999108 | 724.246686 | 810.558120 | 802.853193 | 715.815974 |
2017-09-30 | 854.0 | 681.0 | 941.221318 | 845.246326 | 784.520604 | 910.805446 | 932.851318 | 761.781268 |
2017-12-31 | 661.0 | 557.0 | 752.695160 | 687.517889 | 671.296876 | 798.603878 | 762.493309 | 571.860845 |
fc_combo(forecasts.iloc[:,1:])
('py_snaive',) RMSE ==> 92.0 ('py_hw',) RMSE ==> 85.7 ('LAdA',) RMSE ==> 40.9 ('AAdA',) RMSE ==> 43.4 ('py_sarima',) RMSE ==> 94.4 ('sarima_log',) RMSE ==> 81.7 ('fb1',) RMSE ==> 65.8 ('py_snaive', 'py_hw') RMSE ==> 36.1 ('py_snaive', 'LAdA') RMSE ==> 48.6 ('py_snaive', 'AAdA') RMSE ==> 61.8 ('py_snaive', 'py_sarima') RMSE ==> 43.4 ('py_snaive', 'sarima_log') RMSE ==> 36.1 ('py_snaive', 'fb1') RMSE ==> 77.2 ('py_hw', 'LAdA') RMSE ==> 60.3 ('py_hw', 'AAdA') RMSE ==> 49.3 ('py_hw', 'py_sarima') RMSE ==> 89.0 ('py_hw', 'sarima_log') RMSE ==> 83.5 ('py_hw', 'fb1') RMSE ==> 35.1 ('LAdA', 'AAdA') RMSE ==> 37.6 ('LAdA', 'py_sarima') RMSE ==> 65.3 ('LAdA', 'sarima_log') RMSE ==> 58.5 ('LAdA', 'fb1') RMSE ==> 37.1 ('AAdA', 'py_sarima') RMSE ==> 56.8 ('AAdA', 'sarima_log') RMSE ==> 48.4 ('AAdA', 'fb1') RMSE ==> 47.4 ('py_sarima', 'sarima_log') RMSE ==> 87.5 ('py_sarima', 'fb1') RMSE ==> 38.7 ('sarima_log', 'fb1') RMSE ==> 33.5 ('py_snaive', 'py_hw', 'LAdA') RMSE ==> 36.1 ('py_snaive', 'py_hw', 'AAdA') RMSE ==> 37.3 ('py_snaive', 'py_hw', 'py_sarima') RMSE ==> 46.8 ('py_snaive', 'py_hw', 'sarima_log') RMSE ==> 42.0 ('py_snaive', 'py_hw', 'fb1') RMSE ==> 39.3 ('py_snaive', 'LAdA', 'AAdA') RMSE ==> 45.2 ('py_snaive', 'LAdA', 'py_sarima') RMSE ==> 40.9 ('py_snaive', 'LAdA', 'sarima_log') RMSE ==> 35.8 ('py_snaive', 'LAdA', 'fb1') RMSE ==> 52.0 ('py_snaive', 'AAdA', 'py_sarima') RMSE ==> 42.8 ('py_snaive', 'AAdA', 'sarima_log') RMSE ==> 37.8 ('py_snaive', 'AAdA', 'fb1') RMSE ==> 60.9 ('py_snaive', 'py_sarima', 'sarima_log') RMSE ==> 46.4 ('py_snaive', 'py_sarima', 'fb1') RMSE ==> 41.7 ('py_snaive', 'sarima_log', 'fb1') RMSE ==> 39.6 ('py_hw', 'LAdA', 'AAdA') RMSE ==> 46.0 ('py_hw', 'LAdA', 'py_sarima') RMSE ==> 71.0 ('py_hw', 'LAdA', 'sarima_log') RMSE ==> 67.1 ('py_hw', 'LAdA', 'fb1') RMSE ==> 36.0 ('py_hw', 'AAdA', 'py_sarima') RMSE ==> 63.4 ('py_hw', 'AAdA', 'sarima_log') RMSE ==> 58.8 ('py_hw', 'AAdA', 'fb1') RMSE ==> 33.5 ('py_hw', 'py_sarima', 'sarima_log') RMSE ==> 86.5 ('py_hw', 'py_sarima', 'fb1') RMSE ==> 49.5 ('py_hw', 'sarima_log', 'fb1') RMSE ==> 46.0 ('LAdA', 'AAdA', 'py_sarima') RMSE ==> 50.8 ('LAdA', 'AAdA', 'sarima_log') RMSE ==> 45.4 ('LAdA', 'AAdA', 'fb1') RMSE ==> 36.9 ('LAdA', 'py_sarima', 'sarima_log') RMSE ==> 70.0 ('LAdA', 'py_sarima', 'fb1') RMSE ==> 38.9 ('LAdA', 'sarima_log', 'fb1') RMSE ==> 35.0 ('AAdA', 'py_sarima', 'sarima_log') RMSE ==> 62.8 ('AAdA', 'py_sarima', 'fb1') RMSE ==> 37.7 ('AAdA', 'sarima_log', 'fb1') RMSE ==> 33.3 ('py_sarima', 'sarima_log', 'fb1') RMSE ==> 48.5 ('py_snaive', 'py_hw', 'LAdA', 'AAdA') RMSE ==> 36.4 ('py_snaive', 'py_hw', 'LAdA', 'py_sarima') RMSE ==> 45.3 ('py_snaive', 'py_hw', 'LAdA', 'sarima_log') RMSE ==> 41.7 ('py_snaive', 'py_hw', 'LAdA', 'fb1') RMSE ==> 35.3 ('py_snaive', 'py_hw', 'AAdA', 'py_sarima') RMSE ==> 42.7 ('py_snaive', 'py_hw', 'AAdA', 'sarima_log') RMSE ==> 38.7 ('py_snaive', 'py_hw', 'AAdA', 'fb1') RMSE ==> 38.7 ('py_snaive', 'py_hw', 'py_sarima', 'sarima_log') RMSE ==> 53.3 ('py_snaive', 'py_hw', 'py_sarima', 'fb1') RMSE ==> 36.4 ('py_snaive', 'py_hw', 'sarima_log', 'fb1') RMSE ==> 33.6 ('py_snaive', 'LAdA', 'AAdA', 'py_sarima') RMSE ==> 40.4 ('py_snaive', 'LAdA', 'AAdA', 'sarima_log') RMSE ==> 36.5 ('py_snaive', 'LAdA', 'AAdA', 'fb1') RMSE ==> 47.5 ('py_snaive', 'LAdA', 'py_sarima', 'sarima_log') RMSE ==> 44.9 ('py_snaive', 'LAdA', 'py_sarima', 'fb1') RMSE ==> 37.6 ('py_snaive', 'LAdA', 'sarima_log', 'fb1') RMSE ==> 35.3 ('py_snaive', 'AAdA', 'py_sarima', 'sarima_log') RMSE ==> 42.7 ('py_snaive', 'AAdA', 'py_sarima', 'fb1') RMSE ==> 41.4 ('py_snaive', 'AAdA', 'sarima_log', 'fb1') RMSE ==> 39.1 ('py_snaive', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 36.2 ('py_hw', 'LAdA', 'AAdA', 'py_sarima') RMSE ==> 57.0 ('py_hw', 'LAdA', 'AAdA', 'sarima_log') RMSE ==> 53.5 ('py_hw', 'LAdA', 'AAdA', 'fb1') RMSE ==> 34.4 ('py_hw', 'LAdA', 'py_sarima', 'sarima_log') RMSE ==> 73.5 ('py_hw', 'LAdA', 'py_sarima', 'fb1') RMSE ==> 46.9 ('py_hw', 'LAdA', 'sarima_log', 'fb1') RMSE ==> 44.1 ('py_hw', 'AAdA', 'py_sarima', 'sarima_log') RMSE ==> 67.4 ('py_hw', 'AAdA', 'py_sarima', 'fb1') RMSE ==> 42.8 ('py_hw', 'AAdA', 'sarima_log', 'fb1') RMSE ==> 39.5 ('py_hw', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 56.6 ('LAdA', 'AAdA', 'py_sarima', 'sarima_log') RMSE ==> 56.5 ('LAdA', 'AAdA', 'py_sarima', 'fb1') RMSE ==> 37.5 ('LAdA', 'AAdA', 'sarima_log', 'fb1') RMSE ==> 34.1 ('LAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 46.2 ('AAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 42.4 ('py_snaive', 'py_hw', 'LAdA', 'AAdA', 'py_sarima') RMSE ==> 42.1 ('py_snaive', 'py_hw', 'LAdA', 'AAdA', 'sarima_log') RMSE ==> 39.0 ('py_snaive', 'py_hw', 'LAdA', 'AAdA', 'fb1') RMSE ==> 35.8 ('py_snaive', 'py_hw', 'LAdA', 'py_sarima', 'sarima_log') RMSE ==> 50.5 ('py_snaive', 'py_hw', 'LAdA', 'py_sarima', 'fb1') RMSE ==> 36.7 ('py_snaive', 'py_hw', 'LAdA', 'sarima_log', 'fb1') RMSE ==> 34.3 ('py_snaive', 'py_hw', 'AAdA', 'py_sarima', 'sarima_log') RMSE ==> 47.2 ('py_snaive', 'py_hw', 'AAdA', 'py_sarima', 'fb1') RMSE ==> 36.5 ('py_snaive', 'py_hw', 'AAdA', 'sarima_log', 'fb1') RMSE ==> 34.0 ('py_snaive', 'py_hw', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 40.2 ('py_snaive', 'LAdA', 'AAdA', 'py_sarima', 'sarima_log') RMSE ==> 42.0 ('py_snaive', 'LAdA', 'AAdA', 'py_sarima', 'fb1') RMSE ==> 38.2 ('py_snaive', 'LAdA', 'AAdA', 'sarima_log', 'fb1') RMSE ==> 36.0 ('py_snaive', 'LAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 36.5 ('py_snaive', 'AAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 36.6 ('py_hw', 'LAdA', 'AAdA', 'py_sarima', 'sarima_log') RMSE ==> 61.3 ('py_hw', 'LAdA', 'AAdA', 'py_sarima', 'fb1') RMSE ==> 42.3 ('py_hw', 'LAdA', 'AAdA', 'sarima_log', 'fb1') RMSE ==> 39.6 ('py_hw', 'LAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 52.8 ('py_hw', 'AAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 48.7 ('LAdA', 'AAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 41.9 ('py_snaive', 'py_hw', 'LAdA', 'AAdA', 'py_sarima', 'sarima_log') RMSE ==> 46.0 ('py_snaive', 'py_hw', 'LAdA', 'AAdA', 'py_sarima', 'fb1') RMSE ==> 36.5 ('py_snaive', 'py_hw', 'LAdA', 'AAdA', 'sarima_log', 'fb1') RMSE ==> 34.3 ('py_snaive', 'py_hw', 'LAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 40.2 ('py_snaive', 'py_hw', 'AAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 38.5 ('py_snaive', 'LAdA', 'AAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 36.5 ('py_hw', 'LAdA', 'AAdA', 'py_sarima', 'sarima_log', 'fb1') RMSE ==> 47.1
For the final model, we use all the data and not just the training set. If we want a single best model we will choose the Log ETS(A,Ad,A). The best combination model is ('AAdA', 'sarima_log', 'fb1') which has the lowest RMSE. The difference between them is not significant when accuracy is compared. In the next blog, I am going to demonstrate how to deploy this forecasting model in PowerBI & AzureML so to keep things simple, I will use the Log ETS(A,Ad,A) model. Also, ETS allows simulations which can be used for aggregate forecasting and communicating uncertainty. Forecast combination is a fascinating topic and deserves a seperate blog post.
ets_model=sm.tsa.statespace.ExponentialSmoothing(np.log(data),
trend=True,
initialization_method= 'heuristic',
seasonal=4,
damped_trend=True).fit()
results = ets_model.get_forecast(4)
results_df=results.summary_frame(alpha=0.05).apply(np.exp)[["mean","mean_ci_lower","mean_ci_upper"]]
results_df.round(1)
Sales | mean | mean_ci_lower | mean_ci_upper |
---|---|---|---|
2018-03-31 | 692.5 | 599.8 | 799.5 |
2018-06-30 | 799.1 | 692.1 | 922.6 |
2018-09-30 | 939.4 | 813.6 | 1084.6 |
2018-12-31 | 725.6 | 628.5 | 837.8 |
fig, ax = plt.subplots()
results_df['mean'].plot(ax=ax, legend=True, label="Forecast" )
ax.fill_between(results_df.index, results_df['mean_ci_lower'],
results_df['mean_ci_upper'], alpha=0.2, color='gray')
data["Sales"].plot(legend=True, label="Data", figsize=(12,8));
Forecasts are based on history and are probabilistic. When creating and communicating the forecasts, it is important that the stakeholders who will be using the forecasts understand the uncertainty in the estimates. Forecasts have four sources of uncertainties:
While it's not possible to estimate all sources of uncertainties, the forecaster should provide enough information/tools to the stakeholders to understand and gauge the risk in using the forecast. Stochastic simulation can be used for risk modeling.
Imagine that the stakeholders are interested in the aggregate sales for 2018, i.e the total sales and not just the quarterly sales. The obvious way is to calculate quarterly forecasts, like we did, and then summing them up to get annual sales. While this may provide a good estimate, it will not be correct. Since the quarterly forecasts are distributions, we need to account for the variance in those distributions when calculating the aggregates. If the forecasts were based on median or percentile, we cannot simply add them up. Also, it's possible that not all forecasts are from normal distributions and thus cannot be added.
Monte Carlo simulation can be used to overcome these challenges. Below we will generate 5000 future forecasts based on the ETS model we have and then calculate the aggregates, prediction intervals and cumulative distribution function for it.
#Simulating 5000 futures
sim_frame=ets_model.simulate(4,
anchor='2018-03-31',
repetitions=5000 ).T.reset_index().iloc[:,2:].apply(np.exp).round(1)
sim_frame["aggregate"] = sim_frame.sum(axis=1)
sim_frame.tail()
2018-03-31 00:00:00 | 2018-06-30 00:00:00 | 2018-09-30 00:00:00 | 2018-12-31 00:00:00 | aggregate | |
---|---|---|---|---|---|
4995 | 632.1 | 858.9 | 868.6 | 804.4 | 3164.0 |
4996 | 732.0 | 741.2 | 1019.0 | 698.6 | 3190.8 |
4997 | 721.6 | 914.4 | 938.7 | 785.9 | 3360.6 |
4998 | 627.2 | 868.6 | 882.2 | 728.8 | 3106.8 |
4999 | 585.7 | 905.0 | 873.0 | 765.6 | 3129.3 |
mean_2018_sales = np.mean(sim_frame["aggregate"])
print("25th percntile Sales for FY2018:",np.quantile(sim_frame["aggregate"], 0.25).round(0))
print("\nAvg Sales for FY2018, Simulation:",mean_2018_sales.round(0))
print("\n75th percntile Sales for FY2018:",np.quantile(sim_frame["aggregate"], 0.75).round(0))
print("\nFY2018 Based on Quarterly Forecasts:",results_df["mean"].sum().round(0))
25th percntile Sales for FY2018: 3084.0 Avg Sales for FY2018, Simulation: 3163.0 75th percntile Sales for FY2018: 3241.0 FY2018 Based on Quarterly Forecasts: 3157.0
As you can see, if we just summed up the quarterly forecasts, the aggregate for 2018 would be 3157, and if we used the simulations it would be 3163. In this case the difference is not much and is within the forecast error but that may not always be the case.
When communicating forecasts (or any other uncertainty), providing range is a better solution than providing point estimates. e.g. we calculated the 25% & 75% percentile for the aggregate forecast. Thus, we could say that we are 50% confident that the FY2018 sales would be between 3083 and 3240.i.e P50 [3083 , 3240]. This can also be interpreted as there is only 25% chance that Sales would be >3240 next year. This conveys the uncertainty far better than point estimates.
It's common to calculate and show 95% confidence around the mean forecast. 95% is a conventional measure and doesn't have any particular significance. Confidence interval (CI)is not the same as prediction interval (PI). CI is often confused as a measure of uncertainty and is used to show the upper and lower bound for the estimate - it's not!
Confidence interval is around the populatio point estimate. e.g. if we calculate the mean, to convey the error in calculating it we compute the CI. In our case the mean for Q1 is 692.5 and lower and upper CI bound are [599.8,799.5]. We are 95% confident that the mean will be within this bound. This doesn't say the forecast will be between these bounds with 95% confidence. Think of CI as error around the population point estimate because of limited sample size. CI will be lower if increase the sample size or lower the alpha. Point estimate means calculating a single number e.g. mean, median, percentile etc. about the population/sample. Also make a note if the CI is calculated using parameters of the models (residuals).
To calculate the upper and lower bound on the forecast, we need prediction interval. PI can be calculated using the forecast errors or simulations. PI gives us the upper and lower bounds based on the model we have created. By definition PI > CI. In our examle, the 97.5th percentile is 923 and 2.5% percentils is 691. Now we can be 95% confident that the forecast for Q1 2018 will be between 923 and 691 !
Many years ago at the end of my Business Simulations class in grad school, Prof.Nelson said the main take-away from his class should be the difference between PI and CI and correct use/interpretation of CI. I hope after reading this blog, among other things, you will also learn to interprete CI correctly, especially in the context of forecasting. CI is useless for forecasting.
q1_975=np.quantile(sim_frame.iloc[:,1],0.975)
q1_025=np.quantile(sim_frame.iloc[:,1],0.025)
print("PI_95%:", q1_95.round(1))
print("PI_5%:", q1_05.round(1))
PI_95%: 923.3 PI_5%: 691.3
#Reference: http://stanford.edu/~raejoon/blog/2017/05/16/python-recipes-for-cdfs.html
num_bins = 100
counts, bin_edges = np.histogram (sim_frame["aggregate"], bins=num_bins, normed=True)
cdf = np.cumsum (counts)
cdf_x = bin_edges[1:]
cdf_y = cdf/cdf[-1]
cdf_df = pd.DataFrame({'Sales':cdf_x, 'Percentile':cdf_y})
cdf_df.head()
Sales | Percentile | |
---|---|---|
0 | 2798.317 | 0.0002 |
1 | 2806.334 | 0.0002 |
2 | 2814.351 | 0.0006 |
3 | 2822.368 | 0.0008 |
4 | 2830.385 | 0.0010 |
#collapse-hide
# Create a selection that chooses the nearest point & selects based on x-value
nearest = alt.selection(type='single', nearest=True, on='mouseover',
fields=['Sales'], empty='none')
# The basic line
line = alt.Chart(cdf_df).mark_line(interpolate='basis').encode(
x='Sales:Q',
y='Percentile:Q', tooltip = ["Sales", "Percentile"])
# Transparent selectors across the chart. This is what tells us
# the x-value of the cursor
selectors = alt.Chart(cdf_df).mark_point().encode(
x='Sales:Q',
opacity=alt.value(0),
).add_selection(
nearest
)
# Draw points on the line, and highlight based on selection
points = line.mark_point().encode(
opacity=alt.condition(nearest, alt.value(1), alt.value(0))
)
# Draw text labels near the points, and highlight based on selection
text1 = line.mark_text(align='left', baseline="bottom",dx=5, dy=-5).encode(
text=alt.condition(nearest, 'Percentile:Q', alt.value(' '))
)
text2 = line.mark_text(align='left', baseline="top", dx=5, dy=-5).encode(
text=alt.condition(nearest, 'Sales:Q', alt.value(' '))
)
# Draw a rule at the location of the selection
rules = alt.Chart(cdf_df).mark_rule(color='gray').encode(
x='Sales:Q',
).transform_filter(
nearest
)
# Put the five layers into a chart and bind the data
alt.layer(
line, selectors, points, rules, text1, text2
).properties(
width=600, height=300, title ="FY2018 Sales Probability "
).configure_axis(grid=False)
This is a dynamic and interactive way for the stakeholders to understand forecast uncertainty. For example, if this French retailer has a Revenue target at least 3300, they have only 12% (1-0.88) chance of achiveing that based on this forecating model. The stakeholders will find this more beneficial for decision making than the typical static forecast chart. You could create similar chart for all quarters and make it more interactive. Power BI can be used for that.
Import rpy2
Use pandas2ri.activate()
to convert pandas to R dataframe
%load_ext rpy2.ipython
to use %%R magic command to run R in a cell
import rpy2
import warnings
warnings.filterwarnings('ignore')
from rpy2.robjects import pandas2ri
import rpy2.rinterface as rinterface
pandas2ri.activate()
%load_ext rpy2.ipython
%%R -i data -o fets,fc_ets
library(fpp2)
r_train <- ts(data$Sales, start=c(2012,01), frequency=4)
fets <- r_train %>% ets()
fc_ets<-r_train %>% ets() %>% forecast(h=4) %>% summary()
r_train %>% ets() %>% forecast(h=4) %>% autoplot()
print(fets,end="")
ETS(M,A,M) Call: ets(y = .) Smoothing parameters: alpha = 0.1564 beta = 1e-04 gamma = 2e-04 Initial states: l = 339.3571 b = 16.5403 s = 0.8781 1.1272 1.0301 0.9646 sigma: 0.05 AIC AICc BIC 241.8271 254.6843 252.4296
ets()
returned ETS(M,A,M) as the best model based on AICc, i.e "multiplicative" error, "additive" trend and "multiplicative" error. We also obtained same results in Python. But because we used a log transform the multiplicative term became additive.
fc_ets
Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
---|---|---|---|---|---|
0 | 737.458963 | 690.235833 | 784.682093 | 665.237417 | 809.680508 |
1 | 804.569111 | 752.445770 | 856.692452 | 724.853340 | 884.284882 |
2 | 899.154431 | 840.263229 | 958.045633 | 809.088110 | 989.220753 |
3 | 714.953180 | 667.641498 | 762.264862 | 642.596206 | 787.310153 |
%%R -i train -o fsarima
r_train <- ts(train$Sales, start=c(2012,01), frequency=4)
fsarima <- r_train %>% auto.arima(stepwise=FALSE)
%%R -i data -o fc_arima
r_data <- ts(data$Sales, start=c(2012,01), frequency=4)
fc_arima<- r_data %>% auto.arima(stepwise=FALSE) %>% forecast(h=4) %>% summary()
r_data %>% auto.arima(stepwise=FALSE) %>% forecast(h=4) %>% autoplot()
print(fsarima, end="")
Series: . ARIMA(0,1,0)(1,1,0)[4] Coefficients: sar1 -0.6495 s.e. 0.2501 sigma^2 estimated as 433: log likelihood=-58.48 AIC=120.96 AICc=122.16 BIC=122.09
We obtained the same SARIMA order as Python ARIMA(0,1,0)(1,1,0)[4]
We explored various time series forecasting methods, model selection procedures and created an ensemble forecast that can be used to improve the forecast accuracy. We also briefly covered communicating uncertainty in forecasts. I did not include any deep-learning methods as they generally do well on large dataset with multiple features. Classical methods such as HW/ETS & ARIMA provide more interpretable results which can prove useful for understanding the behaviour of the time time series and make business decisions. In the next blog I will cover deploying this model in PowerBI.