These notes are a Python-centered read-along of the excellent Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos $^1$. The author of these notes is Mike Richman. I am about six months into my first data science job, which comes on the heels of about a decade as a Ph.D. and postdoc in neutrino astrophysics.
In industry, we have to deal with timeseries data in a way that shares little in common with my past work as an academic. Prior to this exercise, my only experience with forecasting consists of an interview take-home for a trading-related job I did not ultimately take.
Deep learning methods are all the rage right now, and rightly so. See here for a recent overview. However, until relatively recently, it was less clear that deep learning could significantly improve on mathematically straightforward models implemented by experienced forecasters.
Given an opportunity for some personal research, I decided to take the time to go through Forecasting: Principles and Practice in detail. This serves two purposes. First, as they say, you must walk before you can run; I should make sure I have the basics down before I get too deep into the latest research architectures. Second, I did not easily come across a comparable resource based on Python.
This is in no way a substitute for the original source material. It could be a companion for the original. As noted above, I am new to this stuff. That may result in doing some things in silly ways. But if you follow the original text with these notes, you should rarely if ever feel like you've reached a dead end.
Similarly, the tools available in Python are not (or do not currently appear to be) a complete substitute for those available in R. There are cells in these notebooks that look ridiculous compared to the R code they mimic, and I think only some of this is my fault. Timeseries analysis is a core strength of R. Python has different strengths. If you love Python, or you have no choice but to use it in your work, then tolerating apparent shortcomings in its ecosystem is worth it.
I will include some comments on the code, but only links to the source text for deeper exposition.
The notes start with Chapter 2, as there's essentially no code in Chapter 1.
These notes make use of the following libraries:
To keep things streamlined, we standardize the working environment in utils.py, from which I will import *
in the per-chapter notebooks. ~Because I love manually syncing files~ for convience, the contents of that file are listed below.
import itertools
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pmdarima as pmd
import rdatasets
from tqdm import tqdm
plt.rc('axes', titlesize='medium')
plt.rc('axes', titlelocation='left')
plt.rc('axes.spines', right=False)
plt.rc('axes.spines', top=False)
sizets = (8,4.5)
plt.rc('figure', figsize=sizets)
def summarize(gb, f):
"""Summarize grouped things."""
return gb.apply(lambda x: pd.Series(f(x)))
def compute(df, f):
"""Compute new (or replacement) columns."""
newdf = pd.DataFrame(f(df), index=df.index)
dropcols = [col for col in newdf.columns if col in df.columns]
if dropcols:
df = df.drop(columns=dropcols)
return df.join(newdf)
def set_freq(df, freq=None):
"""Set frequency of DateTimeIndex."""
if freq is None:
freq = pd.infer_freq(df.index)
return df.asfreq(freq)
def extend_timeseries(df, tmax=None, tmin=None, dt=None):
"""Extend timeseries data to later or earlier times."""
freq = df.index.freq
if tmax is tmin is dt is None:
dt = 1
if tmin is None:
tmin = df.index.min()
if tmax is None:
tmax = df.index.max()
if dt is not None:
if isinstance(dt, int):
if dt > 0:
tmax += dt * freq
elif dt < 0:
tmin += dt * freq
else:
dt = pd.to_timedelta(dt)
if dt > pd.to_timedelta('0d'):
tmax += dt
else:
tmin -= dt
index = pd.date_range(tmin, tmax, freq=freq)
return df.reindex(index)
def suptitle(fig, text=None, **kw):
"""Add a nice left-aligned suptitle."""
if text is None:
fig, text = plt.gcf(), fig
fig = fig.figure or fig
fig.text(fig.subplotpars.left, .99, text, ha='left', va='top', size='large', **kw)
def rlabel(ax, label=None, **kw):
"""Add a right-side axis title."""
if label is None:
ax, label = plt.gca(), ax
bbox = kw.pop('bbox', dict(facecolor='.9', alpha=0.2))
ax.text(1, .5, label,
rotation=-90, ha='left', va='center', transform=ax.transAxes,
bbox=bbox, **kw)
def xdate(ax, fmt=None, freq=None):
"""Tweak x-axis date formatting."""
dates = plt.matplotlib.dates
if fmt is None:
ax, fmt = plt.gca(), ax
if freq:
t1, t2 = dates.num2date(ax.get_xticks()[[0,-1]])
ticks = pd.date_range(t1, t2, freq=freq)
ax.set_xticks(ticks)
ax.xaxis.set_major_formatter(dates.DateFormatter(fmt))
def plot_tsresiduals(Y, y, acf_lags=np.r_[1:26]):
"""Plot timeseries residuals for ground truth Y and estimate y."""
fig = plt.figure()
gs = plt.GridSpec(3, 2, figure=fig)
ts_ax = fig.add_subplot(gs[0,:])
axs = np.array([ts_ax] + [fig.add_subplot(gs[i,j]) for j in (0,1) for i in (1,2)])
ax, rax, hax, acfax, pacfax = axs
#((ax, hax), (rax, acfax)) = axs
mask = ~(np.isnan(Y) | np.isnan(y))
Y, y = Y[mask], y[mask]
#dy = y - Y
# I was surprised by this convention but ok
dy = Y - y
ax.plot(Y, color='k')
ax.plot(y)
ax.set(title='Time Series')
lim = 1.1 * max(-dy.min(), dy.max())
lim = -lim, lim
rax.plot(dy)
rax.set(ylim=lim, title='Residuals')
sns.distplot(dy, bins=np.linspace(lim[0], lim[1], 22),
hist=True, kde=True, rug=True, ax=hax)
hax.set(title='Residual Distribution')
sm.graphics.tsa.plot_acf(dy, lags=acf_lags, ax=acfax)
sm.graphics.tsa.plot_pacf(dy, lags=acf_lags, ax=pacfax)
for a in axs.ravel():
a.grid()
plt.tight_layout()
return fig, axs
def RMSE(Y, y):
"""Root-mean-square error."""
return np.sqrt(np.mean((Y-y)**2))
def MAE(Y, y):
"""Mean absolute error."""
return np.mean(np.abs(Y-y))
def MAPE(Y, y):
"""Mean absolute percent error."""
return 100 * np.mean(np.abs((Y-y)/Y))
def MASE(Y, y):
"""TODO"""
return np.nan # TODO
def tsaccuracy(Ytest, models):
"""Gather some metrics for a few models."""
fs = RMSE, MAE, MAPE, MASE
return pd.DataFrame({
label: [ f(Ytest, model.predict(Ytest.index.min(), Ytest.index.max()))
for f in (RMSE, MAE, MAPE, MASE) ]
for (label, model) in models.items()
}, index=[f.__name__ for f in fs]).T
def ciclean(ci_df):
"""Clean up conf_int() result column names."""
ci_df = ci_df.copy()
ci_df.columns = 'lower', 'upper'
return ci_df
legend_right = dict(loc='center left', bbox_to_anchor=[1, .5])
To make sure I could easily access the example data used in the book, I exported the datasets using this R notebook. In hindsight (not to be confused with Hyndsight!) I probably could have made more use of rdatasets.