A Summary of lecture "Time Series Analysis in Python", via datacamp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
coint
function in statsmodels that combines both stepsThe Heating Oil and Natural Gas prices are pre-loaded in DataFrames HO and NG. First, plot both price series, which look like random walks. Then plot the difference between the two series, which should look more like a mean reverting series (to put the two series in the same units, we multiply the heating oil prices, in $/gallon, by 7.25, which converts it to $/millionBTU, which is the same units as Natural Gas).
The data for continuous futures (each contract has to be spliced together in a continuous series as contracts expire) was obtained from Quandl.
HO = pd.read_csv('./dataset/CME_HO1.csv', index_col=0)
NG = pd.read_csv('./dataset/CME_NG1.csv', index_col=0)
HO.index = pd.to_datetime(HO.index, format='%m/%d/%Y')
NG.index = pd.to_datetime(NG.index, format='%m/%d/%Y')
HO = HO.sort_index()
NG = NG.sort_index()
HO.head()
Close | |
---|---|
Date | |
1991-01-02 | 0.7330 |
1991-01-03 | 0.7024 |
1991-01-04 | 0.6830 |
1991-01-07 | 0.7617 |
1991-01-08 | 0.7430 |
NG.head()
Close | |
---|---|
Date | |
1991-01-02 | 1.832 |
1991-01-03 | 1.782 |
1991-01-04 | 1.743 |
1991-01-07 | 1.785 |
1991-01-08 | 1.786 |
# Plot the prices separately
plt.subplot(2, 1, 1)
plt.plot(7.25 * HO, label='Heating Oil');
plt.plot(NG, label='Natural Gas');
plt.legend(loc='best', fontsize='small');
# Plot the spread
plt.subplot(2, 1, 2)
plt.plot(7.25 * HO - NG, label='Spread');
plt.legend(loc='best', fontsize='small');
plt.axhline(y=0, linestyle='--', color='k');
To verify that Heating Oil and Natural Gas prices are cointegrated, First apply the Dickey-Fuller test separately to show they are random walks. Then apply the test to the difference, which should strongly reject the random walk hypothesis.
from statsmodels.tsa.stattools import adfuller
# Compute the ADF for HO and NG
result_HO = adfuller(HO['Close'])
print("The p-value for the ADF test on HO is ", result_HO[1])
result_NG = adfuller(NG['Close'])
print("The p-value for the ADF test on NG is ", result_NG[1])
# Compute the ADF of the spread
result_spread = adfuller(7.25 * HO['Close'] - NG['Close'])
print("The p-value for the ADF test on the spread is ", result_spread[1])
The p-value for the ADF test on HO is 0.956710878501786 The p-value for the ADF test on NG is 0.9008747444676729 The p-value for the ADF test on the spread is 7.019439302142287e-05
Cointegration involves two steps: regressing one time series on the other to get the cointegration vector, and then perform an ADF test on the residuals of the regression. In the last example, there was no need to perform the first step since we implicitly assumed the cointegration vector was (1,−1). In other words, we took the difference between the two series (after doing a units conversion). Here, you will do both steps.
You will regress the value of one cryptocurrency, bitcoin (BTC), on another cryptocurrency, ethereum (ETH). If we call the regression coefficient b, then the cointegration vector is simply (1,−b). Then perform the ADF test on BTC −b ETH.
BTC = pd.read_csv('./dataset/BTC.csv', index_col=0)
ETH = pd.read_csv('./dataset/ETH.csv', index_col=0)
BTC.index = pd.to_datetime(BTC.index, format='%Y-%m-%d')
ETH.index = pd.to_datetime(ETH.index, format='%Y-%m-%d')
import statsmodels.api as sm
# Regress BTC on ETH
ETH = sm.add_constant(ETH)
result = sm.OLS(BTC, ETH).fit()
# Compare ADF
b = result.params[1]
adf_stats = adfuller(BTC['Price'] - b * ETH['Price'])
print("The p-value for the ADF test is ", adf_stats[1])
The p-value for the ADF test is 0.02336900232347285
An ARMA model is a simplistic approach to forecasting climate changes, but it illustrates many of the topics covered in this class.
The DataFrame temp_NY
contains the average annual temperature in Central Park, NY from 1870-2016 (the data was downloaded from the NOAA here). Plot the data and test whether it follows a random walk (with drift).
temp_NY = pd.read_csv('./dataset/NOAA_TAVG.csv', index_col=0)
temp_NY.index = pd.to_datetime(temp_NY.index, format='%Y')
temp_NY.head()
TAVG | |
---|---|
DATE | |
1870-01-01 | 53.8 |
1871-01-01 | 51.3 |
1872-01-01 | 51.3 |
1873-01-01 | 50.9 |
1874-01-01 | 51.3 |
# Plot average temperatures
temp_NY.plot();
# Compute and print ADF p-value
result = adfuller(temp_NY['TAVG'])
print("The p-value for the ADF test is ", result[1])
The p-value for the ADF test is 0.583293898787112
Since the temperature series, temp_NY
, is a random walk with drift, take first differences to make it stationary. Then compute the sample ACF and PACF. This will provide some guidance on the order of the model.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
# Take first difference of the temperature Series
chg_temp = temp_NY.diff()
chg_temp = chg_temp.dropna()
# Plot the ACF and PACF on the same page
fig, axes = plt.subplots(2, 1)
# Plot the ACF
plot_acf(chg_temp, lags=20, ax=axes[0]);
# Plot the PACF
plot_pacf(chg_temp, lags=20, ax=axes[1]);
plt.tight_layout()
Recall from Chapter 3 that the Akaike Information Criterion (AIC) can be used to compare models with different numbers of parameters. It measures goodness-of-fit, but places a penalty on models with more parameters to discourage overfitting. Lower AIC scores are better.
Fit the temperature data to an AR(1), AR(2), and ARMA(1,1) and see which model is the best fit, using the AIC criterion. The AR(2) and ARMA(1,1) models have one more parameter than the AR(1) has.
from statsmodels.tsa.arima_model import ARMA
# Fit the data to an AR(1) model and print AIC:
mod_ar1 = ARMA(chg_temp, order=(1, 0))
res_ar1 = mod_ar1.fit()
print("The AIC for an AR(1) is: ", res_ar1.aic)
# Fit the data to an AR(2) model and print AIC
mod_ar2 = ARMA(chg_temp, order=(2, 0))
res_ar2 = mod_ar2.fit()
print("The AIC for an AR(2) is: ", res_ar2.aic)
# fit the data to an ARMA(1, 1) model and print AIC
mod_arma11 = ARMA(chg_temp, order=(1, 1))
res_arma11 = mod_arma11.fit()
print("The AIC for an ARMA(1,1) is: ", res_arma11.aic)
/home/chanseok/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency AS-JAN will be used. % freq, ValueWarning) /home/chanseok/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency AS-JAN will be used. % freq, ValueWarning) /home/chanseok/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency AS-JAN will be used. % freq, ValueWarning)
The AIC for an AR(1) is: 510.5346898313909 The AIC for an AR(2) is: 501.9274123160227 The AIC for an ARMA(1,1) is: 469.0729133043228
Finally, you will forecast the temperature over the next 30 years using an ARMA(1,1) model, including confidence bands around that estimate. Keep in mind that the estimate of the drift will have a much bigger impact on long range forecasts than the ARMA parameters.
Earlier, you determined that the temperature data follows a random walk and you looked at first differencing the data. In this exercise, you will use the ARIMA module on the temperature data (before differencing), which is identical to using the ARMA module on changes in temperature, followed by taking cumulative sums of these changes to get the temperature forecast.
from statsmodels.tsa.arima_model import ARIMA
# Forecast temperatures using an ARIMA(1,1,1) model
mod = ARIMA(temp_NY, order=(1,1,1))
res = mod.fit()
# Plot the original series and the forecasted series
res.plot_predict(start='1872-01-01', end='2046-01-01');
plt.savefig('../images/tavg_predict.png')
/home/chanseok/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency AS-JAN will be used. % freq, ValueWarning) /home/chanseok/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/base/tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency AS-JAN will be used. % freq, ValueWarning)