Advanced Time Series Analysis

statsmodels for Time Series

Bruno Gonçalves
www.data4sci.com
@bgoncalves, @data4sci

In [1]:

from collections import Counter
from pprint import pprint

import pandas as pd
import numpy as np

import statsmodels.api as sm

import matplotlib as mpl
import matplotlib.pyplot as plt 

import scipy
from scipy.stats import pearsonr

import watermark

%load_ext watermark
%matplotlib inline

We start by print out the versions of the libraries we're using for future reference

In [2]:

%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 21.2.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit

Git hash: 81e218c0b3e3e3def5db50d4abcfeed7b0527d92

matplotlib : 3.3.2
statsmodels: 0.12.0
watermark  : 2.1.0
json       : 2.0.9
pandas     : 1.1.3
scipy      : 1.6.1
numpy      : 1.19.2

Load default figure style

In [3]:

plt.style.use('./d4sci.mplstyle')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

Linear Regression¶

Let us load up one of the datasets from the Anscombe quartet

In [6]:

data = pd.read_csv('data/Anscombe1.dat', sep=' ', header=None)
data.columns = ['x1', 'y']

And make a quick plot

In [7]:

ax = data.plot(x='x1', y='y', kind='scatter', s=300)
ax.legend(['Data'], loc='upper left')

Out[7]:

<matplotlib.legend.Legend at 0x7fead8e83760>

We'll use statsmodels to perform a quick OLS to perform a Linear Regression fit. While we won't be using Linear Regressions explicitly in this lecture, many time series analysis algorithms rely on it so we include it for ease of reference. We start by explicitly adding a column of 1s for the intercept

In [8]:

data['const'] = 1

Perform the fit

In [9]:

model = sm.OLS(endog=data['y'], exog=data[['const', 'x1']])
results = model.fit()

And get a summary table of the results

In [10]:

results.summary()

/Users/bgoncalves/opt/anaconda3/lib/python3.8/site-packages/scipy/stats/stats.py:1603: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11
  warnings.warn("kurtosistest only valid for n>=20 ... continuing "

Out[10]:

OLS Regression Results
Dep. Variable:	y	R-squared:	0.667
Model:	OLS	Adj. R-squared:	0.629
Method:	Least Squares	F-statistic:	17.99
Date:	Mon, 07 Mar 2022	Prob (F-statistic):	0.00217
Time:	15:09:23	Log-Likelihood:	-16.841
No. Observations:	11	AIC:	37.68
Df Residuals:	9	BIC:	38.48
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	3.0001	1.125	2.667	0.026	0.456	5.544
x1	0.5001	0.118	4.241	0.002	0.233	0.767

Omnibus:	0.082	Durbin-Watson:	3.212
Prob(Omnibus):	0.960	Jarque-Bera (JB):	0.289
Skew:	-0.122	Prob(JB):	0.865
Kurtosis:	2.244	Cond. No.	29.1

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

And calculate the model prediction for each point

In [11]:

Y1 = results.predict(data[['const', 'x1']])

And finally plot the fitted line:

In [12]:

ax = data.plot(x='x1', y='y', kind='scatter', s=300)
ax.plot(data['x1'], Y1, color=colors[1],)
ax.legend(['Fit', 'Data'])

Out[12]:

<matplotlib.legend.Legend at 0x7fead8e7ae50>

Correlations¶

Let's load a time series to take as an example. We'll use the time series of sunspot activity

In [13]:

Sun = pd.read_csv('data/sun.csv')

In [14]:

Sun

Out[14]:

	YEAR	SUNACTIVITY
0	1700.0	5.0
1	1701.0	11.0
2	1702.0	16.0
3	1703.0	23.0
4	1704.0	36.0
...	...	...
304	2004.0	40.4
305	2005.0	29.8
306	2006.0	15.2
307	2007.0	7.5
308	2008.0	2.9

309 rows × 2 columns

In [15]:

Sun.set_index('YEAR').plot()

Out[15]:

<AxesSubplot:xlabel='YEAR'>

In [16]:

sm.tsa.stattools.acf(Sun['SUNACTIVITY'], nlags=40)

/Users/bgoncalves/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/stattools.py:662: FutureWarning: fft=True will become the default after the release of the 0.12 release of statsmodels. To suppress this warning, explicitly set fft=False.
  warnings.warn(

Out[16]:

array([ 1.        ,  0.82020129,  0.45126849,  0.03957655, -0.27579196,
       -0.42523943, -0.37659509, -0.15737391,  0.15820254,  0.47309753,
        0.65898002,  0.65029082,  0.45666254,  0.16179329, -0.12205105,
       -0.3161808 , -0.37471125, -0.30605753, -0.1348069 ,  0.09158727,
        0.2975632 ,  0.4207074 ,  0.41183954,  0.27020758,  0.04496208,
       -0.17428715, -0.33045026, -0.37287834, -0.28555061, -0.11794414,
        0.08293231,  0.24897507,  0.32752101,  0.28335919,  0.1375272 ,
       -0.05526386, -0.22973205, -0.31338879, -0.29355684, -0.17897285,
       -0.01769038])

Due to a small issue with the way statsmodels interacts with jupyter, we terminate the line with a semi-colon (or assign the output to a variable) in order to prevent double plotting.

In [17]:

sm.graphics.tsa.plot_acf(Sun['SUNACTIVITY'], lags=40);

The corresponding partial-autocorrelation functions work in similar ways:

In [18]:

sm.tsa.stattools.pacf(Sun['SUNACTIVITY'])

/Users/bgoncalves/opt/anaconda3/lib/python3.8/site-packages/statsmodels/tsa/stattools.py:1016: FutureWarning: The default number of lags is changing from 40 tomin(int(10 * np.log10(nobs)), nobs // 2 - 1) after 0.12is released. Set the number of lags to an integer to  silence this warning.
  warnings.warn(

Out[18]:

array([ 1.00000000e+00,  8.22864286e-01, -6.90310216e-01, -1.40572128e-01,
        5.60804025e-02,  4.90077270e-03,  1.82247989e-01,  2.19939368e-01,
        2.28436387e-01,  2.62692745e-01, -1.64957719e-02, -7.62590556e-03,
       -1.50968950e-02, -8.28648718e-04,  6.16072896e-02, -8.27061703e-02,
       -7.88015856e-02, -1.62447551e-01, -9.12918340e-02,  4.47916043e-02,
        2.45683035e-03,  1.12617329e-01, -1.31318378e-03, -9.92893352e-02,
       -5.70561869e-02,  2.29911661e-02, -6.61542747e-02,  9.74598824e-02,
        1.11323622e-01, -1.26321199e-01,  2.49774293e-02, -4.76464467e-02,
       -1.07618291e-02,  5.20777678e-03,  9.02935874e-03,  2.03912400e-02,
       -4.84266992e-02,  6.88517686e-02, -4.06046146e-02,  4.47681852e-02,
        4.77286574e-02])

As before, we include the semi-colon

In [19]:

sm.graphics.tsa.plot_pacf(Sun['SUNACTIVITY']);

Dickey-Fuller test¶

For this example, let's load the GDP dataset

In [20]:

GDP = pd.read_csv('data/GDP.csv', parse_dates=['DATE'], index_col=[0])

In [21]:

GDP

Out[21]:

	GDP
DATE
1947-01-01	2033.061
1947-04-01	2027.639
1947-07-01	2023.452
1947-10-01	2055.103
1948-01-01	2086.017
...	...
2018-04-01	18598.135
2018-07-01	18732.720
2018-10-01	18783.548
2019-01-01	18927.281
2019-04-01	19021.860

290 rows × 1 columns

In [22]:

GDP.plot()

Out[22]:

<AxesSubplot:xlabel='DATE'>

If we run the Augmented Dickey-Fuller test for the original dataset we find

In [23]:

adf, pvalue, critical, results = sm.tsa.stattools.adfuller(GDP, regresults=True)

And in particular, the p-value makes it clear that the series is not stationary

In [24]:

print('ADF Statistic:', adf)
print('p-value:', pvalue)
print('Critical Values:')

for key, value in critical.items():
    print('\t%s:'% key, value)

ADF Statistic: 2.9622287233788263
p-value: 1.0
Critical Values:
	1%: -3.4541800885158525
	5%: -2.872031361137725
	10%: -2.5723603999791473

On the other hand, if we look instead at the QoQ values:

In [25]:

QoQ = GDP['GDP'].diff(1).dropna()
adf, pvalue, critical, results = sm.tsa.stattools.adfuller(QoQ, regresults=True)

We find a clearly significant result

In [26]:

QoQ.plot()

Out[26]:

<AxesSubplot:xlabel='DATE'>

In [27]:

pvalue

Out[27]:

0.0009325178905257833

Or for a more complete overview:

In [28]:

print('ADF Statistic:', adf)
print('p-value:', pvalue)
print('Critical Values:')

for key, value in critical.items():
    print('\t%s:'% key, value)

ADF Statistic: -4.109942087089415
p-value: 0.0009325178905257833
Critical Values:
	1%: -3.4541800885158525
	5%: -2.872031361137725
	10%: -2.5723603999791473

We also obtain a lot more information in the results object, like the number of lags used, how many total lags were tested, etc. In particullar

In [29]:

print(results)
print()
for key, value in results.__dict__.items():
    if not key.startswith("_") and key not in ("resols", "critvalues", "autolag_results"):
        print("%s: %s" % (key, value))
    elif key == "critvalues":
        print("%s:" % key)
        for key2, critical_value in value.items():
            print("\t%s:" % key2, critical_value )

Augmented Dickey-Fuller Test Results

maxlag: 16
usedlag: 11
adfstat: -4.109942087089415
critvalues:
	1%: -3.4541800885158525
	5%: -2.872031361137725
	10%: -2.5723603999791473
nobs: 277
H0: The coefficient on the lagged level equals 1 - unit root
HA: The coefficient on the lagged level < 1 - stationary
icbest: 3025.166829576211

And the full results of the OLS linear regression:

In [30]:

results.resols.summary()

Out[30]:

OLS Regression Results
Dep. Variable:	y	R-squared:	0.374
Model:	OLS	Adj. R-squared:	0.346
Method:	Least Squares	F-statistic:	13.16
Date:	Mon, 07 Mar 2022	Prob (F-statistic):	3.89e-21
Time:	15:18:03	Log-Likelihood:	-1524.9
No. Observations:	277	AIC:	3076.
Df Residuals:	264	BIC:	3123.
Df Model:	12
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	-0.4325	0.105	-4.110	0.000	-0.640	-0.225
x2	-0.2529	0.109	-2.326	0.021	-0.467	-0.039
x3	-0.0278	0.109	-0.256	0.798	-0.242	0.186
x4	-0.0055	0.106	-0.052	0.959	-0.215	0.204
x5	-0.0105	0.100	-0.105	0.917	-0.208	0.187
x6	-0.0613	0.097	-0.634	0.527	-0.252	0.129
x7	0.0228	0.093	0.244	0.807	-0.161	0.206
x8	0.0006	0.088	0.007	0.995	-0.173	0.174
x9	-0.1256	0.083	-1.507	0.133	-0.290	0.039
x10	0.0551	0.079	0.698	0.486	-0.100	0.211
x11	0.1870	0.074	2.539	0.012	0.042	0.332
x12	0.1049	0.061	1.707	0.089	-0.016	0.226
const	26.4032	7.156	3.689	0.000	12.312	40.494

Omnibus:	32.007	Durbin-Watson:	1.997
Prob(Omnibus):	0.000	Jarque-Bera (JB):	106.856
Skew:	-0.420	Prob(JB):	6.26e-24
Kurtosis:	5.925	Cond. No.	213.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As well as the results for each of the lags considered:

In [31]:

results.autolag_results

Out[31]:

{2: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e3ac40>,
 3: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fea782e0df0>,
 4: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fead8f05df0>,
 5: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fea7832bca0>,
 6: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fea684e5a60>,
 7: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fea7832b580>,
 8: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b0a0>,
 9: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b190>,
 10: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b2b0>,
 11: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b3d0>,
 12: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b4c0>,
 13: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b5b0>,
 14: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b6a0>,
 15: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b790>,
 16: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b880>,
 17: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5b970>,
 18: <statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7feac8e5ba60>}

Times series decomposition¶

To illustrate the time series decomposition functionality of statsmodels, we're going to use the airline passengers dataset

In [33]:

airline = pd.read_csv('data/international-airline-passengers.csv', sep=';', 
                      parse_dates=['Month'], index_col=0)
airline.plot()

Out[33]:

<AxesSubplot:xlabel='Month'>

To obtain the decomposition we must simply do:

In [38]:

decomposition = sm.tsa.seasonal_decompose(airline, model='multiplicative',
                                          two_sided=False)

We can quickly plot the usual decomposition visualization

In [39]:

decomposition.plot();

Each of the components is provided as a separate pandas Series for easy access and analysis

In [40]:

decomposition.seasonal.tail()

Out[40]:

Month
1960-08-01    1.221410
1960-09-01    1.050409
1960-10-01    0.915272
1960-11-01    0.794158
1960-12-01    0.891666
Name: seasonal, dtype: float64

And, of course, visualization

In [41]:

decomposition.resid.plot()

Out[41]:

<AxesSubplot:xlabel='Month'>