In this final chapter, you'll learn how to use seasonal ARIMA models to fit more complex data. You'll learn how to decompose this data into seasonal and non-seasonal parts and then you'll get the chance to utilize all your ARIMA tools on one last global forecast challenge. This is the Summary of lecture "ARIMA Models in Python", via datacamp.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (10, 5)
plt.style.use('fivethirtyeight')
You can think of a time series as being composed of trend, seasonal and residual components. This can be a good way to think about the data when you go about modeling it. If you know the period of the time series you can decompose it into these components.
In this exercise you will decompose a time series showing the monthly milk production per cow in the USA. This will give you a clearer picture of the trend and the seasonal cycle. Since the data is monthly you will guess that the seasonality might be 12 time periods, however this won't always be the case.
milk_production = pd.read_csv('./dataset/milk_production.csv', index_col='date', parse_dates=True)
milk_production = milk_production.asfreq('MS')
milk_production.head()
pounds_per_cow | |
---|---|
date | |
1962-01-01 | 589.0 |
1962-02-01 | 561.0 |
1962-03-01 | 640.0 |
1962-04-01 | 656.0 |
1962-05-01 | 727.0 |
from statsmodels.tsa.seasonal import seasonal_decompose
# Perform additive decomposition
decomp = seasonal_decompose(milk_production['pounds_per_cow'], period=12)
# Plot decomposition
decomp.plot();
plt.tight_layout();
Below is a time series showing the estimated number of water consumers in London. By eye you can't see any obvious seasonal pattern, however your eyes aren't the best tools you have.
water = pd.read_csv('./dataset/water.csv', index_col='date', parse_dates=True)
water = water.asfreq('MS')
water.head()
water_consumers | |
---|---|
date | |
1983-01-01 | 24963 |
1983-02-01 | 27380 |
1983-03-01 | 32588 |
1983-04-01 | 25511 |
1983-05-01 | 32313 |
water.plot();
In this exercise you will use the ACF and PACF to test this data for seasonality. You can see from the plot above that the time series isn't stationary, so you should probably detrend it. You will detrend it by subtracting the moving average. Remember that you could use a window size of any value bigger than the likely period.
from statsmodels.graphics.tsaplots import plot_acf
# Create figure and subplot
fig, ax1 = plt.subplots()
# Plot the ACF on ax1
plot_acf(water['water_consumers'], lags=25, zero=False, ax=ax1);
# Subtract the rolling mean
water_2 = water - water.rolling(15).mean()
# Drop the NaN values
water_2 = water_2.dropna()
# Create figure and subplots
fig, ax1 = plt.subplots()
# Plot the ACF
plot_acf(water_2['water_consumers'], lags=25, zero=False, ax=ax1);
Based on this figure, 12 time steps is the time period of the seasonal component.
- Non-seasonal orders
- p: autoregressive order
- d: differencing order
- q: moving average order
- Seasonal orders
- P: seasonal autoregressive order
- D: seasonal differencing order
- Q: seasonal moving average order
- S: Number of time steps per cycle
- $\text{SARIMA}(0,0,0)(2, 0, 1)_7$ model:
$$ y_t = a_7 y_{t-7} + a_{14} y_{t-14} + m_y \epsilon_{t-7} + \epsilon_t $$Fitting SARIMA models is the beginning of the end of this journey into time series modeling.
It is important that you get to know your way around the SARIMA model orders and so that's what you will focus on here.
In this exercise, you will practice fitting different SARIMA models to a set of time series.
df1 = pd.read_csv('./dataset/df1.csv', index_col=0, parse_dates=True)
df2 = pd.read_csv('./dataset/df2.csv', index_col=0, parse_dates=True)
df3 = pd.read_csv('./dataset/df3.csv', index_col=0, parse_dates=True)
df1 = df1.asfreq('d')
df2 = df2.asfreq('d')
df3 = df3.asfreq('d')
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Create a SARIMAX model
model = SARIMAX(df1, order=(1, 0, 0), seasonal_order=(1, 1, 0, 7))
# Fit the model
results = model.fit()
# Print the results summary
results.summary()
Dep. Variable: | Y | No. Observations: | 90 |
---|---|---|---|
Model: | SARIMAX(1, 0, 0)x(1, 1, 0, 7) | Log Likelihood | -556.243 |
Date: | Tue, 16 Jun 2020 | AIC | 1118.486 |
Time: | 15:34:16 | BIC | 1125.742 |
Sample: | 01-01-2013 | HQIC | 1121.401 |
- 03-31-2013 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ar.L1 | 0.1032 | 0.103 | 1.002 | 0.316 | -0.099 | 0.305 |
ar.S.L7 | 0.2781 | 0.104 | 2.665 | 0.008 | 0.074 | 0.483 |
sigma2 | 3.858e+04 | 7224.680 | 5.340 | 0.000 | 2.44e+04 | 5.27e+04 |
Ljung-Box (Q): | 41.76 | Jarque-Bera (JB): | 1.69 |
---|---|---|---|
Prob(Q): | 0.39 | Prob(JB): | 0.43 |
Heteroskedasticity (H): | 1.35 | Skew: | -0.15 |
Prob(H) (two-sided): | 0.43 | Kurtosis: | 2.37 |
# Create a SARIMAX model
model = SARIMAX(df2, order=(2, 1, 1), seasonal_order=(1, 0, 0, 4))
# Fit the model
results = model.fit()
# Print the results summary
results.summary()
Dep. Variable: | Y | No. Observations: | 80 |
---|---|---|---|
Model: | SARIMAX(2, 1, 1)x(1, 0, [], 4) | Log Likelihood | -560.340 |
Date: | Tue, 16 Jun 2020 | AIC | 1130.679 |
Time: | 15:34:17 | BIC | 1142.526 |
Sample: | 01-01-2013 | HQIC | 1135.426 |
- 03-21-2013 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ar.L1 | 0.2701 | 0.162 | 1.672 | 0.095 | -0.047 | 0.587 |
ar.L2 | 0.5015 | 0.110 | 4.560 | 0.000 | 0.286 | 0.717 |
ma.L1 | -0.4271 | 0.178 | -2.401 | 0.016 | -0.776 | -0.078 |
ar.S.L4 | 0.1075 | 0.127 | 0.847 | 0.397 | -0.141 | 0.356 |
sigma2 | 8.45e+04 | 1.63e+04 | 5.178 | 0.000 | 5.25e+04 | 1.16e+05 |
Ljung-Box (Q): | 30.80 | Jarque-Bera (JB): | 0.95 |
---|---|---|---|
Prob(Q): | 0.85 | Prob(JB): | 0.62 |
Heteroskedasticity (H): | 0.60 | Skew: | -0.07 |
Prob(H) (two-sided): | 0.20 | Kurtosis: | 2.48 |
# Create a SARIMAX model
model = SARIMAX(df3, order=(1, 1, 0), seasonal_order=(0, 1, 1, 12))
# Fit the model
results = model.fit()
# Print the results summary
results.summary()
Dep. Variable: | Y | No. Observations: | 100 |
---|---|---|---|
Model: | SARIMAX(1, 1, 0)x(0, 1, [1], 12) | Log Likelihood | -521.376 |
Date: | Tue, 16 Jun 2020 | AIC | 1048.752 |
Time: | 15:34:17 | BIC | 1056.149 |
Sample: | 01-01-2013 | HQIC | 1051.730 |
- 04-10-2013 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ar.L1 | 0.4236 | 0.090 | 4.719 | 0.000 | 0.248 | 0.600 |
ma.S.L12 | -0.0898 | 0.116 | -0.776 | 0.438 | -0.317 | 0.137 |
sigma2 | 9347.1462 | 1407.490 | 6.641 | 0.000 | 6588.516 | 1.21e+04 |
Ljung-Box (Q): | 33.11 | Jarque-Bera (JB): | 0.02 |
---|---|---|---|
Prob(Q): | 0.77 | Prob(JB): | 0.99 |
Heteroskedasticity (H): | 0.77 | Skew: | 0.02 |
Prob(H) (two-sided): | 0.48 | Kurtosis: | 3.05 |
In this exercise you will find the appropriate model order for a new set of time series. This is monthly series of the number of employed persons in Australia (in thousands). The seasonal period of this time series is 12 months.
You will create non-seasonal and seasonal ACF and PACF plots and use the table below to choose the appropriate model orders.
AR(p) | MA(q) | ARMA(p, q) | |
---|---|---|---|
ACF | Tails off | Cuts off after lag q | Tails off |
PACF | Cuts off after lag p | Tails off | Tails off |
aus_employment = pd.read_csv('./dataset/aus_employment.csv', index_col='date', parse_dates=True)
aus_employment = aus_employment.asfreq('MS')
aus_employment.head()
people_employed | |
---|---|
date | |
1978-01-01 | 5985.7 |
1978-02-01 | 6040.6 |
1978-03-01 | 6054.2 |
1978-04-01 | 6038.3 |
1978-05-01 | 6031.3 |
from statsmodels.graphics.tsaplots import plot_pacf
# Take the first and seasonal differences and drop NaNs
aus_employment_diff = aus_employment.diff().diff(12).dropna()
# Create the figure
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,6))
# Plot the ACF on ax1
plot_acf(aus_employment_diff, lags=11, zero=False, ax=ax1);
# Plot the PACF on ax2
plot_pacf(aus_employment_diff, lags=11, zero=False, ax=ax2);
plt.tight_layout();
# Make list of lags
lags = [12, 24, 36, 48, 60]
# Create the figure
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,6))
# Plot the ACF on ax1
plot_acf(aus_employment_diff, lags=lags, zero=False, ax=ax1);
# Plot the PACF on ax2
plot_pacf(aus_employment_diff, lags=lags, zero=False, ax=ax2);
plt.tight_layout();
The non-seasonal ACF doesn't show any of the usual patterns of MA, AR or ARMA models so we choose none of these. The Seaosnal ACF and PACF look like an MA(1) model. : $\text{SARIMAX}(0,1,0)(0,1,1)_{12}$
In this exercise, you will see the effect of using a SARIMA model instead of an ARIMA model on your forecasts of seasonal time series.
Two models, an ARIMA(3,1,2) and a SARIMA(0,1,1)(1,1,1)12, have been fit to the Wisconsin employment time series. These were the best ARIMA model and the best SARIMA model available according to the AIC.
In the exercise you will use these two models to make dynamic future forecast for 25 months and plot these predictions alongside held out data for this period, wisconsin_test.
wisconsin_test = pd.read_csv('./dataset/wisconsin_test.csv', index_col='date', parse_dates=True)
wisconsin_test = wisconsin_test.asfreq('MS')
dates = wisconsin_test.index
wisconsin_test.head()
number_in_employment | |
---|---|
date | |
1973-10-01 | 374.5 |
1973-11-01 | 380.2 |
1973-12-01 | 384.6 |
1974-01-01 | 360.6 |
1974-02-01 | 354.4 |
model = SARIMAX(wisconsin_test, order=(3, 1, 2), trend='c',
enforce_stationarity=True, enforce_invertibility=True)
arima_results = model.fit()
/home/chanseok/anaconda3/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)
model = SARIMAX(wisconsin_test, order=(0, 1, 1), seasonal_order=(1, 1, 1, 12), trend='c',
enforce_stationarity=True, enforce_invertibility=True)
sarima_results = model.fit()
/home/chanseok/anaconda3/lib/python3.7/site-packages/statsmodels/tsa/statespace/sarimax.py:868: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros. ' zeros.' % warning_description) /home/chanseok/anaconda3/lib/python3.7/site-packages/statsmodels/base/model.py:568: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals "Check mle_retvals", ConvergenceWarning)
arima_results.specification
{'seasonal_periods': 0, 'measurement_error': False, 'time_varying_regression': False, 'simple_differencing': False, 'enforce_stationarity': True, 'enforce_invertibility': True, 'hamilton_representation': False, 'concentrate_scale': False, 'trend_offset': 1, 'order': (3, 1, 2), 'seasonal_order': (0, 0, 0, 0), 'k_diff': 1, 'k_seasonal_diff': 0, 'k_ar': 3, 'k_ma': 2, 'k_seasonal_ar': 0, 'k_seasonal_ma': 0, 'k_ar_params': 3, 'k_ma_params': 2, 'trend': 'c', 'k_trend': 1, 'k_exog': 0, 'mle_regression': False, 'state_regression': False}
sarima_results.specification
{'seasonal_periods': 12, 'measurement_error': False, 'time_varying_regression': False, 'simple_differencing': False, 'enforce_stationarity': True, 'enforce_invertibility': True, 'hamilton_representation': False, 'concentrate_scale': False, 'trend_offset': 1, 'order': (0, 1, 1), 'seasonal_order': (1, 1, 1, 12), 'k_diff': 1, 'k_seasonal_diff': 1, 'k_ar': 0, 'k_ma': 1, 'k_seasonal_ar': 12, 'k_seasonal_ma': 12, 'k_ar_params': 0, 'k_ma_params': 1, 'trend': 'c', 'k_trend': 1, 'k_exog': 0, 'mle_regression': False, 'state_regression': False}
# Create ARIMA mean forecast
arima_pred = arima_results.get_forecast(25)
arima_mean = arima_pred.predicted_mean
# Create SARIMA mean forecast
sarima_pred = sarima_results.get_forecast(25)
sarima_mean = sarima_pred.predicted_mean
# Plot mean ARIMA and SARIMA predictions and observed
plt.plot(dates, sarima_mean, label='SARIMA');
plt.plot(dates, arima_mean, label='ARIMA');
plt.plot(wisconsin_test, label='observed');
plt.legend();
The pmdarima
package is a powerful tool to help you choose the model orders. You can use the information you already have from the identification step to narrow down the model orders which you choose by automation.
Remember, although automation is powerful, it can sometimes make mistakes that you wouldn't. It is hard to guess how the input data could be imperfect and affect the test scores.
In this exercise you will use the pmdarima package to automatically choose model orders for some time series datasets.
import pmdarima as pm
# Create auto_arima model
model1 = pm.auto_arima(df1,
seasonal=True, m=7,
d=0, D=1,
max_p=2, max_q=2,
trace=True,
error_action='ignore',
suppress_warnings=True)
# Print model summary
print(model1.summary())
Performing stepwise search to minimize aic Fit ARIMA(2,0,2)x(1,1,1,7) [intercept=True]; AIC=1125.412, BIC=1144.763, Time=0.506 seconds Fit ARIMA(0,0,0)x(0,1,0,7) [intercept=True]; AIC=1124.433, BIC=1129.271, Time=0.009 seconds Fit ARIMA(1,0,0)x(1,1,0,7) [intercept=True]; AIC=1120.434, BIC=1130.109, Time=0.119 seconds Fit ARIMA(0,0,1)x(0,1,1,7) [intercept=True]; AIC=1122.218, BIC=1131.893, Time=0.075 seconds Fit ARIMA(0,0,0)x(0,1,0,7) [intercept=False]; AIC=1122.596, BIC=1125.015, Time=0.006 seconds Fit ARIMA(1,0,0)x(0,1,0,7) [intercept=True]; AIC=1125.048, BIC=1132.305, Time=0.016 seconds Fit ARIMA(1,0,0)x(2,1,0,7) [intercept=True]; AIC=1119.909, BIC=1132.004, Time=0.233 seconds Fit ARIMA(1,0,0)x(2,1,1,7) [intercept=True]; AIC=1121.791, BIC=1136.304, Time=0.209 seconds Fit ARIMA(1,0,0)x(1,1,1,7) [intercept=True]; AIC=1120.954, BIC=1133.049, Time=0.157 seconds Fit ARIMA(0,0,0)x(2,1,0,7) [intercept=True]; AIC=1119.066, BIC=1128.742, Time=0.166 seconds Fit ARIMA(0,0,0)x(1,1,0,7) [intercept=True]; AIC=1119.294, BIC=1126.550, Time=0.092 seconds Fit ARIMA(0,0,0)x(2,1,1,7) [intercept=True]; AIC=1120.813, BIC=1132.907, Time=0.173 seconds Fit ARIMA(0,0,0)x(1,1,1,7) [intercept=True]; AIC=1120.054, BIC=1129.729, Time=0.090 seconds Fit ARIMA(0,0,1)x(2,1,0,7) [intercept=True]; AIC=1119.937, BIC=1132.031, Time=0.126 seconds Fit ARIMA(1,0,1)x(2,1,0,7) [intercept=True]; AIC=1120.607, BIC=1135.120, Time=0.397 seconds Near non-invertible roots for order (1, 0, 1)(2, 1, 0, 7); setting score to inf (at least one inverse root too close to the border of the unit circle: 1.000) Total fit time: 2.380 seconds SARIMAX Results =============================================================================== Dep. Variable: y No. Observations: 90 Model: SARIMAX(2, 1, 0, 7) Log Likelihood -555.533 Date: Tue, 16 Jun 2020 AIC 1119.066 Time: 15:34:55 BIC 1128.742 Sample: 0 HQIC 1122.953 - 90 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept -5.3485 22.658 -0.236 0.813 -49.757 39.060 ar.S.L7 0.2326 0.116 2.014 0.044 0.006 0.459 ar.S.L14 0.1716 0.127 1.356 0.175 -0.077 0.420 sigma2 3.79e+04 7066.970 5.363 0.000 2.4e+04 5.18e+04 =================================================================================== Ljung-Box (Q): 33.73 Jarque-Bera (JB): 1.88 Prob(Q): 0.75 Prob(JB): 0.39 Heteroskedasticity (H): 1.37 Skew: -0.24 Prob(H) (two-sided): 0.41 Kurtosis: 2.44 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step).
# Create model
model2 = pm.auto_arima(df2,
seasonal=False,
d=1,
trend='c',
max_p=2, max_q=2,
trace=True,
error_action='ignore',
suppress_warnings=True)
# Print model summary
print(model2.summary())
Performing stepwise search to minimize aic Fit ARIMA(2,1,2)x(0,0,0,0) [intercept=True]; AIC=1131.956, BIC=1146.173, Time=0.092 seconds Fit ARIMA(0,1,0)x(0,0,0,0) [intercept=True]; AIC=1153.963, BIC=1158.702, Time=0.005 seconds Fit ARIMA(1,1,0)x(0,0,0,0) [intercept=True]; AIC=1155.355, BIC=1162.464, Time=0.028 seconds Fit ARIMA(0,1,1)x(0,0,0,0) [intercept=True]; AIC=1155.747, BIC=1162.855, Time=0.011 seconds Fit ARIMA(0,1,0)x(0,0,0,0) [intercept=False]; AIC=1153.963, BIC=1158.702, Time=0.005 seconds Fit ARIMA(1,1,2)x(0,0,0,0) [intercept=True]; AIC=1136.103, BIC=1147.950, Time=0.114 seconds Fit ARIMA(2,1,1)x(0,0,0,0) [intercept=True]; AIC=1130.992, BIC=1142.840, Time=0.058 seconds Fit ARIMA(1,1,1)x(0,0,0,0) [intercept=True]; AIC=1149.020, BIC=1158.498, Time=0.087 seconds Fit ARIMA(2,1,0)x(0,0,0,0) [intercept=True]; AIC=1132.526, BIC=1142.003, Time=0.067 seconds Total fit time: 0.469 seconds SARIMAX Results ============================================================================== Dep. Variable: y No. Observations: 80 Model: SARIMAX(2, 1, 1) Log Likelihood -560.496 Date: Tue, 16 Jun 2020 AIC 1130.992 Time: 15:34:55 BIC 1142.840 Sample: 0 HQIC 1135.739 - 80 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept -9.5411 20.071 -0.475 0.635 -48.879 29.797 ar.L1 0.2744 0.148 1.849 0.064 -0.016 0.565 ar.L2 0.5383 0.098 5.503 0.000 0.347 0.730 ma.L1 -0.4587 0.159 -2.886 0.004 -0.770 -0.147 sigma2 8.558e+04 1.65e+04 5.197 0.000 5.33e+04 1.18e+05 =================================================================================== Ljung-Box (Q): 30.31 Jarque-Bera (JB): 0.85 Prob(Q): 0.87 Prob(JB): 0.66 Heteroskedasticity (H): 0.60 Skew: -0.09 Prob(H) (two-sided): 0.20 Kurtosis: 2.53 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step).
# Create model for SARIMAX(p,1,q)(P,1,Q)7
model3 = pm.auto_arima(df3,
seasonal=True, m=7,
d=1, D=1,
start_p=1, start_q=1,
max_p=1, max_q=1,
max_P=1, max_Q=1,
trace=True,
error_action='ignore',
suppress_warnings=True)
# Print model summary
print(model3.summary())
Performing stepwise search to minimize aic Fit ARIMA(1,1,1)x(1,1,1,7) [intercept=True]; AIC=1566.379, BIC=1581.510, Time=0.427 seconds Fit ARIMA(0,1,0)x(0,1,0,7) [intercept=True]; AIC=1595.826, BIC=1600.870, Time=0.007 seconds Fit ARIMA(1,1,0)x(1,1,0,7) [intercept=True]; AIC=1585.203, BIC=1595.290, Time=0.093 seconds Fit ARIMA(0,1,1)x(0,1,1,7) [intercept=True]; AIC=1572.745, BIC=1582.832, Time=0.218 seconds Near non-invertible roots for order (0, 1, 1)(0, 1, 1, 7); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.999) Fit ARIMA(0,1,0)x(0,1,0,7) [intercept=False]; AIC=1593.897, BIC=1596.419, Time=0.009 seconds Fit ARIMA(1,1,1)x(0,1,1,7) [intercept=True]; AIC=1558.168, BIC=1570.777, Time=0.298 seconds Near non-invertible roots for order (1, 1, 1)(0, 1, 1, 7); setting score to inf (at least one inverse root too close to the border of the unit circle: 1.000) Fit ARIMA(1,1,1)x(1,1,0,7) [intercept=True]; AIC=1565.768, BIC=1578.377, Time=0.259 seconds Near non-invertible roots for order (1, 1, 1)(1, 1, 0, 7); setting score to inf (at least one inverse root too close to the border of the unit circle: 1.000) Fit ARIMA(1,1,1)x(0,1,0,7) [intercept=True]; AIC=1579.028, BIC=1589.116, Time=0.107 seconds Near non-invertible roots for order (1, 1, 1)(0, 1, 0, 7); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.994) Fit ARIMA(0,1,1)x(1,1,1,7) [intercept=True]; AIC=1571.169, BIC=1583.778, Time=0.323 seconds Fit ARIMA(1,1,0)x(1,1,1,7) [intercept=True]; AIC=1570.775, BIC=1583.384, Time=0.260 seconds Near non-invertible roots for order (1, 1, 0)(1, 1, 1, 7); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.993) Fit ARIMA(0,1,0)x(1,1,1,7) [intercept=True]; AIC=1573.132, BIC=1583.219, Time=0.244 seconds Near non-invertible roots for order (0, 1, 0)(1, 1, 1, 7); setting score to inf (at least one inverse root too close to the border of the unit circle: 0.997) Total fit time: 2.248 seconds SARIMAX Results ========================================================================================= Dep. Variable: y No. Observations: 100 Model: SARIMAX(1, 1, 1)x(0, 1, 1, 7) Log Likelihood -774.084 Date: Tue, 16 Jun 2020 AIC 1558.168 Time: 15:34:58 BIC 1570.777 Sample: 0 HQIC 1563.257 - 100 Covariance Type: opg ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ intercept 1.7464 2.134 0.819 0.413 -2.435 5.928 ar.L1 0.4239 0.127 3.342 0.001 0.175 0.672 ma.L1 -1.0000 0.250 -4.004 0.000 -1.489 -0.510 ma.S.L7 -0.9382 0.231 -4.056 0.000 -1.392 -0.485 sigma2 1.008e+06 2.6e-07 3.88e+12 0.000 1.01e+06 1.01e+06 =================================================================================== Ljung-Box (Q): 272.00 Jarque-Bera (JB): 11.30 Prob(Q): 0.00 Prob(JB): 0.00 Heteroskedasticity (H): 1.07 Skew: 0.85 Prob(H) (two-sided): 0.86 Kurtosis: 2.84 =================================================================================== Warnings: [1] Covariance matrix calculated using the outer product of gradients (complex-step). [2] Covariance matrix is singular or near-singular, with condition number 3.73e+29. Standard errors may be unstable.
Once you have gone through the steps of the Box-Jenkins method and arrived at a model you are happy with, you will want to be able to save that model and also to incorporate new measurements when they are available. This is key part of putting the model into production.
In this exercise you will save a freshly trained model to disk, then reload it to update it with new data.
import joblib
# Set model name
filename='candy_model.pkl'
# Pickle it
joblib.dump(model1, filename)
# Load the model back in
loaded_model = joblib.load(filename)
# Update the model
# loaded_model.update(df)
Usually the next step would be to find the order of differencing and other model orders. However, this time it's already been done for you. The time series is best fit by a $\text{SARIMA}(1, 1, 1)(0, 1, 1)_{12}$ model with an added constant.
In this exercise you will make sure that this is a good model by first fitting it using the SARIMAX
class and going through the normal model diagnostics procedure.
co2 = pd.read_csv('./dataset/co2.csv', index_col='date', parse_dates=True)
co2 = co2.asfreq('MS')
co2.head()
CO2_ppm | |
---|---|
date | |
1958-03-01 | 315.71 |
1958-04-01 | 317.45 |
1958-05-01 | 317.50 |
1958-06-01 | 317.10 |
1958-07-01 | 315.86 |
# Create model object
model = SARIMAX(co2, order=(1, 1, 1), seasonal_order=(0, 1, 1, 12), trend='c')
# Fit model
results = model.fit()
results.summary()
Dep. Variable: | CO2_ppm | No. Observations: | 727 |
---|---|---|---|
Model: | SARIMAX(1, 1, 1)x(0, 1, 1, 12) | Log Likelihood | -179.718 |
Date: | Tue, 16 Jun 2020 | AIC | 369.437 |
Time: | 15:35:01 | BIC | 392.291 |
Sample: | 03-01-1958 | HQIC | 378.263 |
- 09-01-2018 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | 0.0017 | 0.001 | 2.809 | 0.005 | 0.001 | 0.003 |
ar.L1 | 0.2411 | 0.078 | 3.072 | 0.002 | 0.087 | 0.395 |
ma.L1 | -0.6079 | 0.065 | -9.296 | 0.000 | -0.736 | -0.480 |
ma.S.L12 | -0.8820 | 0.019 | -45.520 | 0.000 | -0.920 | -0.844 |
sigma2 | 0.0944 | 0.005 | 20.750 | 0.000 | 0.085 | 0.103 |
Ljung-Box (Q): | 47.00 | Jarque-Bera (JB): | 6.19 |
---|---|---|---|
Prob(Q): | 0.21 | Prob(JB): | 0.05 |
Heteroskedasticity (H): | 1.13 | Skew: | 0.00 |
Prob(H) (two-sided): | 0.33 | Kurtosis: | 3.46 |
# Plot common diagnostics
results.plot_diagnostics(figsize=(20, 15));
plt.tight_layout();
In the previous exercise you confirmed that a SARIMA (1,1,1) x (0,1,1)12 model was a good fit to the CO2 time series by using diagnostic checking.
Now its time to put this model into practice to make future forecasts. Climate scientists tell us that we have until 2030 to drastically reduce our CO2 emissions or we will face major societal challenges.
In this exercise, you will forecast the CO2 time series up to the year 2030 to find the CO2 levels if we continue emitting as usual.
# Create forecast object
forecast_object = results.get_forecast(136)
# Extract predicted mean attribute
mean = forecast_object.predicted_mean
# Calculate the confidence intervals
conf_int = forecast_object.conf_int()
# Extract the forecast dates
dates = mean.index
plt.figure()
# Plot past CO2 levels
plt.plot(co2.index, co2, label='past');
# Plot the prediction means as line
plt.plot(dates, mean, label='predicted');
# Shade between the confidence intervals
plt.fill_between(dates, conf_int.loc[:, 'lower CO2_ppm'],
conf_int.loc[:, 'upper CO2_ppm'], alpha=0.4);
# Plot legend
plt.legend();
plt.savefig('../images/co2_predicted.png')
# Print last predicted mean
print(mean.iloc[-1])
# Print last confidence interval
print(conf_int.iloc[-1])
438.2001450227593 lower CO2_ppm 432.118155 upper CO2_ppm 444.282135 Name: 2030-01-01 00:00:00, dtype: float64
S - seasonal
AR - AutoRegressive
I - Integrated
MA - Moving Average
X - Exogenous