In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))

from formats import load_style
load_style(css_style='custom2.css', plot_style=False)
In [2]:

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import minimize
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import TimeSeriesSplit

%watermark -a 'Ethen' -d -t -v -p numpy,scipy,pandas,sklearn,matplotlib
Ethen 2018-08-22 12:37:16 

CPython 3.6.4
IPython 6.4.0

numpy 1.14.1
scipy 1.1.0
pandas 0.23.0
sklearn 0.19.1
matplotlib 2.2.2

Getting Started with Time Series Analysis

The gist behind time series analysis is that we are given some quantitative measures about the past and we wish to use these informations to predict the future to enable better planning, decision-making and so on. The main difference between time series problem and traditional prediction problems is that: in traditional prediction problems such as image classification, the data points there are assumed to be independent of one another. Whereas, time series analysis' data points have a temporal nature in them, i.e. The time dimension adds an explicit ordering to our data points that should be preserved because they can provide additional/important information to the learning algorithms.

As an example, we will look at a real mobile game data that depicts ads watched per hour.

In [3]:
ads_col = 'Ads'
time_col = 'Time'
input_path = os.path.join('data', 'ads.csv')
ads = pd.read_csv(input_path, index_col=time_col, parse_dates=[time_col])
print('dimension: ', ads.shape)
dimension:  (216, 1)
2017-09-13 00:00:00 80115
2017-09-13 01:00:00 79885
2017-09-13 02:00:00 89325
2017-09-13 03:00:00 101930
2017-09-13 04:00:00 121630
In [4]:
plt.figure(figsize=(15, 7))
plt.title('Ads watched (hourly data)')


To make a prediction of the next point given this time series, one of the most naive method that we can use is the arithmetic average of all the previously observed data points. We take all the values we know, calculate the average and bet that that's going to be the next value. Of course it won't be it exactly, but it probably will be somewhere in the ballpark (e.g. your final school grade may be the average of all the previous grades).

\begin{align} \hat{y}_{x+1} = \dfrac{1}{x}\sum_{i=1}^{x}y_i \end{align}
  • Where $\hat{y}_{x+1}$ refers to the predicted value at time $x + 1$
In [5]:
def average(series):
    return np.mean(series)

series = ads[ads_col]

An improvement over simple arithmetic average is the average of $n$ last points. The rationale here is that only recent values matter and adding previous data points into consideration would only be adding noise. Calculation of the moving average involves what is sometimes called a "sliding window" of size $n$.

\begin{align} \hat{y}_{x+1} = \dfrac{1}{x} \sum_{i=x - n}^{x}y_i \end{align}
In [6]:
def moving_average(series, n):
    """Calculate average of last n observations"""
    return np.mean(series[-n:])

# prediction for the last observed day (past 24 hours)
moving_average(series, 24)

Although we can't really use this method for making predictions really far out into the future (because in order to get the value for the next step, we need the previous values to be actually observed), the moving average method can be used to smooth the original time series for spotting trend. As we'll soon see, the wider the window, the smoother the trend.

In [7]:
def plot_moving_avg(series, window):
    rolling_mean = series.rolling(window=window).mean()
    plt.title('Moving average\n window size = {}'.format(window))
    plt.plot(rolling_mean, label='Rolling mean trend')

    plt.plot(series[window:], label='Actual values')

plot_moving_avg(ads, window=4)

While it might not seem that useful when we set the window size to be 4, if we were to apply the smoothing on a 24 hour window, we would get a daily trend that shows us a more interesting and perhaps expected pattern. That is: during the weekends, the values are higher (more time to play on the weekends?) while fewer ads are watched on weekdays.

In [8]:
plot_moving_avg(ads, window=24)

Side note, the following code chunk shows an implementation of moving average without using pandas' rolling functionality.

In [9]:
def moving_average(arr, window):
    ret = np.cumsum(arr)
    ret[window:] = ret[window:] - ret[:-window]
    return ret[window - 1:] / window

arr = np.arange(20)
print(moving_average(arr, window=4))
[ 1.5  2.5  3.5  4.5  5.5  6.5  7.5  8.5  9.5 10.5 11.5 12.5 13.5 14.5
 15.5 16.5 17.5]

Exponential Smoothing

Now let's see what happens if, instead of only weighting the time series' last $k$ values, we would weight all available observations while exponentially decreasing the weights as we move further back in time.

A weighted moving average is a moving average where within the sliding window values are given different weights, typically so that more recent points matter more. Instead of only weighting the time series' last $k$ values, however, we could instead consider all of the data points, while assigning exponentially smaller weights as we go back in time. This method is so called Exponential Smoothing. The mathematical notation for this method is:

\begin{align} \hat{y}_x = \alpha \cdot y_x + (1 - \alpha) \cdot \hat{y}_{x-1} \end{align}

To compute the formula, we pick an $0 < \alpha < 1$ and a starting value $\hat{y}_{0}$ (i.e. the first value of the observed data), and then calculate $\hat{y}_x$ recursively for $x = 1, 2, 3, \dots$. As we'll see in later sections, $\hat{y}_x$ is also referred to as levels.

We can think of $\alpha$ as the smoothing factor or memory decay rate, it defines how quickly we will "forget" the last available true observation. The smaller $\alpha$ is, the more influence the previous observations have and the smoother the series is. In other words, the higher the $\alpha$, the faster the method "forgets" about the past.

In [10]:
def exponential_smoothing(series, alpha):
    """given a series and alpha, return series of expoentially smoothed points"""
    results = np.zeros_like(series)

    # first value remains the same as series,
    # as there is no history to learn from
    results[0] = series[0] 
    for t in range(1, series.shape[0]):
        results[t] = alpha * series[t] + (1 - alpha) * results[t - 1]

    return results
In [11]:
def plot_exponential_smoothing(series, alphas):
    """Plots exponential smoothing with different alphas."""  
    plt.figure(figsize=(15, 7))
    for alpha in alphas:
        plt.plot(exponential_smoothing(series, alpha), label='Alpha {}'.format(alpha))

    plt.plot(series, label='Actual')
    plt.title('Exponential Smoothing')

plot_exponential_smoothing(series.values, [0.3, 0.05])

Keep in mind that each series has its best $\alpha$ value. The process of finding the best $\alpha$ is referred to as fitting and we will discuss it in later section.

One very important characteristic of exponential smoothing is that remarkably, they can only forecast the current level. If we look at the formula again $\hat{y}_x = \alpha \cdot y_x + (1-\alpha) \cdot \hat{y}_{x-1}$, we can see that in order to make the prediction for $\hat{y}_x$ we also need to have the observed value $y_x$. In some other software like R, if we were to use it to predict the future, it will simply assign all future prediction the last value of the time series.

Double Exponential Smoothing - Holt Method

The idea behind Double Exponential Smoothing (a.k.a the Holt Method) is exponential smoothing applied to both level and trend. The basic idea is saying if our time series has a trend, we can incorporate that information to do better than just estimating the current level and using that to forecast the future observations. To achieve this, we will introduce two new notations: the current "trend", denoted by $T$ (we can think of it as the slope of the time series), as well as the current "level", denoted by $\ell$.

To express this in mathematical notation we now need three equations: one for level, one for the trend and one to combine the level and trend to get the expected $\hat{y}$.

\begin{align} \ell_x &= \alpha y_x + (1-\alpha)(\ell_{x-1} + T_{x-1})& \mbox{level} \nonumber \\ T_x &= \beta(\ell_x - \ell_{x-1}) + (1-\beta)T_{x-1} & \mbox{trend} \nonumber \\ \hat{y}_{x+1} &= \ell_x + T_x & \mbox{1 step forecast}\\ \end{align}
  • $\ell$, level is simply predicted point. But because now it's going to be only part of calculation of the forecast (our forecast is a combination of predicted point and trend), we can no longer refer to it as $\hat{y}$
  • The second equation introduces $0 < \beta< 1$, the trend coefficient. As with $\alpha$, some values of $\beta$ work better than others depending on the series. When $\beta$ is big, we won't give too much weight to the past trends when estimating current trend
  • Similar to exponential smoothing, where we used the first observed value as the first expected value, we can use the first observed trend as the first expected trend, i.e. we'll use the first two points to compute the initial trend, i.e. $(y_x - y_{x-1}) / 1$.

Side note: Additive vs Multiplicative

Another thing to know about trend is that instead of subtracting $y_{x−1}$ from $y_x$ to estimate its initial value, we could instead divide one by the other thereby getting a ratio. The difference between these two approaches is similar to how we can say something costs $20 more or 5% more. The variant of the method based on subtraction is known as additive, while the one based on division is known as multiplicative. The additive method, is more straightforward to understand. Thus, we will stick with the additive method here. In practice, we can always try both to see which one is better.

To perform k-step-ahead forecast, we can use linear extrapolation:

\begin{align} \hat{y}_{x+1} &= \ell_x + k T_x \end{align}
In [12]:
def double_exponential_smoothing(series, alpha, beta, n_preds=2):
    Given a series, alpha, beta and n_preds (number of
    forecast/prediction steps), perform the prediction.
    n_record = series.shape[0]
    results = np.zeros(n_record + n_preds)

    # first value remains the same as series,
    # as there is no history to learn from;
    # and the initial trend is the slope/difference
    # between the first two value of the series
    level = series[0]
    results[0] = series[0]
    trend = series[1] - series[0]
    for t in range(1, n_record + 1):
        if t >= n_record:
            # forecasting new points
            value = results[t - 1]
            value = series[t]

        previous_level = level
        level = alpha * value + (1 - alpha) * (level + trend)
        trend = beta * (level - previous_level) + (1 - beta) * trend 
        results[t] = level + trend

    # for forecasting beyond the first new point,
    # the level and trend is all fixed
    if n_preds > 1:
        results[n_record + 1:] = level + np.arange(2, n_preds + 1) * trend

    return results
In [13]:
def plot_double_exponential_smoothing(series, alphas, betas):
    """Plots double exponential smoothing with different alphas and betas."""    
    plt.figure(figsize=(20, 8))
    for alpha, beta in zip(alphas, betas):
        results = double_exponential_smoothing(series, alpha, beta)
        plt.plot(results, label='Alpha {}, beta {}'.format(alpha, beta))

    plt.plot(series, label='Actual')
    plt.title('Double Exponential Smoothing')

plot_double_exponential_smoothing(series.values, alphas=[0.9, 0.9], betas=[0.1, 0.9])