In the 10x series of notebooks, we will look at Time Series modeling in pycaret using univariate data and no exogenous variables. We will use the famous airline dataset for illustration. Our plan of action is as follows:
# Only enable critical logging (Optional)
import os
os.environ["PYCARET_CUSTOM_LOGGING_LEVEL"] = "CRITICAL"
def what_is_installed():
from pycaret import show_versions
show_versions()
try:
what_is_installed()
except ModuleNotFoundError:
!pip install pycaret
what_is_installed()
System: python: 3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)] executable: C:\Users\Nikhil\.conda\envs\pycaret_dev_sktime_16p1\python.exe machine: Windows-10-10.0.19044-SP0 PyCaret required dependencies: pip: 22.3.1 setuptools: 65.6.3 pycaret: 3.0.0rc9 IPython: 8.10.0 ipywidgets: 8.0.4 tqdm: 4.64.1 numpy: 1.23.5 pandas: 1.5.3 jinja2: 3.1.2 scipy: 1.10.0 joblib: 1.2.0 sklearn: 1.2.1 pyod: 1.0.8 imblearn: 0.10.1 category_encoders: 2.6.0 lightgbm: 3.3.5 numba: 0.56.4 requests: 2.28.2 matplotlib: 3.7.0 scikitplot: 0.3.7 yellowbrick: 1.5 plotly: 5.13.0 kaleido: 0.2.1 statsmodels: 0.13.5 sktime: 0.16.1 tbats: 1.1.2 pmdarima: 2.0.2 psutil: 5.9.4 PyCaret optional dependencies: shap: 0.41.0 interpret: Not installed umap: Not installed pandas_profiling: Not installed explainerdashboard: Not installed autoviz: Not installed fairlearn: Not installed xgboost: Not installed catboost: Not installed kmodes: Not installed mlxtend: Not installed statsforecast: Not installed tune_sklearn: Not installed ray: Not installed hyperopt: Not installed optuna: Not installed skopt: Not installed mlflow: 2.1.1 gradio: Not installed fastapi: Not installed uvicorn: Not installed m2cgen: Not installed evidently: Not installed fugue: 0.8.0 streamlit: Not installed prophet: 1.1.2
import time
import numpy as np
import pandas as pd
from pycaret.datasets import get_data
from pycaret.time_series import TSForecastingExperiment
y = get_data('airline', verbose=False)
# We want to forecast the next 12 months of data and we will use 3 fold cross-validation to test the models.
fh = 12 # or alternately fh = np.arange(1,13)
fold = 3
# Global Figure Settings for notebook ----
# Depending on whether you are using jupyter notebook, jupyter lab, Google Colab, you may have to set the renderer appropriately
# NOTE: Setting to a static renderer here so that the notebook saved size is reduced.
fig_kwargs = {
# "renderer": "notebook",
"renderer": "png",
"width": 1000,
"height": 600,
}
pycaret
Time Series Forecasting module provides a conventient interface for perform exploratory analysis using plot_model
.
NOTE:
plot_model
will plot using the original dataset. We will cover this in the current notebook.plot_model
, the the plots are made using the model data (e.g. future forecasts, or analysis on insample residuals). We will cover this in a subsequent notebook.Let's see how this works next.
First, we will plots the original dataset.
eda = TSForecastingExperiment()
eda.setup(data=y, fh=fh, fig_kwargs=fig_kwargs)
Description | Value | |
---|---|---|
0 | session_id | 4531 |
1 | Target | Number of airline passengers |
2 | Approach | Univariate |
3 | Exogenous Variables | Not Present |
4 | Original data shape | (144, 1) |
5 | Transformed data shape | (144, 1) |
6 | Transformed train set shape | (132, 1) |
7 | Transformed test set shape | (12, 1) |
8 | Rows with missing values | 0.0% |
9 | Fold Generator | ExpandingWindowSplitter |
10 | Fold Number | 3 |
11 | Enforce Prediction Interval | False |
12 | Splits used for hyperparameters | all |
13 | User Defined Seasonal Period(s) | None |
14 | Ignore Seasonality Test | False |
15 | Seasonality Detection Algo | auto |
16 | Max Period to Consider | 60 |
17 | Seasonal Period(s) Tested | [12, 24, 36, 11, 48] |
18 | Significant Seasonal Period(s) | [12, 24, 36, 11, 48] |
19 | Significant Seasonal Period(s) without Harmonics | [48, 36, 11] |
20 | Remove Harmonics | False |
21 | Harmonics Order Method | harmonic_max |
22 | Num Seasonalities to Use | 1 |
23 | All Seasonalities to Use | [12] |
24 | Primary Seasonality | 12 |
25 | Seasonality Present | True |
26 | Target Strictly Positive | True |
27 | Target White Noise | No |
28 | Recommended d | 1 |
29 | Recommended Seasonal D | 1 |
30 | Preprocess | False |
31 | CPU Jobs | -1 |
32 | Use GPU | False |
33 | Log Experiment | False |
34 | Experiment Name | ts-default-name |
35 | USI | 964b |
<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x1a586ae8670>
# NOTE: This is the same as `eda.plot_model(plot="ts")`
eda.plot_model()
Before we explore the data further, there are a few minor things to know about how PyCaret prepares a modeling pipeline under the hood. The data being modeled is usually fed through an internal pipeline that has optional steps in the following order:
Data Input (by user) >> Imputation >> Transformation & Scaling >> Model
Prophet
can work with missing data, the need to run statistical tests to extract useful informaton from the data for default model settings necessitates having imputation when data has missing values.We will discuss more about imputation and transformations in in another notebook. For now, our data does not have any missing values or any transformations. So Data Input (by user), i.e. Original data = Imputed data = Transformed data = Data fed to models. We can verify this by plotting the internal datasets by specifying the plot_data_type
data_kwargs
.
NOTE: If plot_data_type
is not provided, each plot type has it's own default data_type that is internally determined, but the users have the ability to always overwrite using plot_data_type
.
eda.plot_model(data_kwargs={"plot_data_type": ["original", "imputed", "transformed"]})
Let's explore the standard ACF and PACF plots for the dataset next
# ACF and PACF for the original dataset
eda.plot_model(plot="acf")
# NOTE: you can customize the plots with kwargs - e.g. number of lags, figure size (width, height), etc
# data_kwargs such as `nlags` are passed to the underlying function that gets the ACF values
# figure kwargs such as `fig_size` & `template` are passed to plotly and can have any value that plotly accepts
eda.plot_model(plot="pacf", data_kwargs={'nlags':36}, fig_kwargs={'height': 500, "width": 800})
Users may also wish to explore the spectrogram or the FFT which are very useful for studying the frequency components in the time series.
For example:
eda.plot_model(plot="periodogram")
eda.plot_model(plot="fft")
In the plots above, we notice
Alternately, the diagnostics
plot provides all these details in a convenient call.
eda.plot_model(plot="diagnostics", fig_kwargs={"height": 800, "width": 1000})
Our diagnosutic plots indicated the need to difference and the presence of a seasonal period of 12. Let's see what happends when we remove this from the model. What other characteristics are left in the model that would need to be taken care of?
This can be achieved through the difference plots. Along with the difference plots, we will plot the corresponding ACF, PACF and Periodogram for further diagnostics.
# Row 1: Original
# Row 2: d = 1
# Row 3: First (d = 1) then (D = 1, s = 12)
# - Corresponds to applying a standard first difference to handle trend, and
# followed by a seasonal difference (at lag 12) to attempt to account for
# seasonal dependence.
# Ref: https://www.sktime.org/en/stable/api_reference/auto_generated/sktime.transformations.series.difference.Differencer.html
eda.plot_model(
plot="diff",
data_kwargs={"lags_list": [[1], [1, 12]], "acf": True, "pacf": True, "periodogram": True},
fig_kwargs={"height": 800, "width": 1500}
)
# ## NOTE: Another way to specify differences is using order_list
# # Row 1: Original
# # Row 2: d = 1
# # Row 3: d = 2
# eda.plot_model(
# plot="diff",
# data_kwargs={
# "order_list": [1, 2],
# "acf": True, "pacf": True, "periodogram": True
# },
# fig_kwargs={"height": 600, "width": 1200}
# )
Observations:
Conclusion
pycaret
time series module will take care of this internally.Let's plot the Time Series Decomposition next (another classical diagnostic plot)
# First, classical decomposition
# By default the seasonal period is the one detected during setup - 12 in this case.
eda.plot_model(plot="decomp", fig_kwargs={"height": 500})
# Users can change the seasonal period to explore what is best for this model.
eda.plot_model(plot="decomp", data_kwargs={'seasonal_period': 24}, fig_kwargs={"height": 500})
# Users may wish to customize the decomposition, for example, in this case multiplicative seasonality
# probably makes more sense since the magnitide of the seasonality increase as the time progresses
eda.plot_model(plot="decomp", data_kwargs={'type': 'multiplicative'}, fig_kwargs={"height": 500})
# Users can also plot STL decomposition
# Reference: https://otexts.com/fpp2/stl.html
eda.plot_model(plot="decomp_stl", fig_kwargs={"height": 500})
Let us look at the various splits of the data used for modeling next.
NOTE:
# Show the train-test splits on the dataset
# Internally split - len(fh) as test set, remaining used as test set
eda.plot_model(plot="train_test_split", fig_kwargs={"height": 400, "width": 900})
# Show the Cross Validation splits inside the train set
# The blue dots represent the training data for each fold.
# The orange dots represent the validation data for each fold
eda.plot_model(plot="cv", fig_kwargs={"height": 400, "width": 900})
Statistical Testing is another important part of time series modeling. This can be achieved easily in pycaret using check_stats
.
Options are:
By default the statistics are run on the "transformed" data, but similar to plots, the user has the abiliy to set this to another data type using the data_type
argument. e.g. eda.check_stats(test="summary", data_type = "original")
# Summary Statistics
eda.check_stats(test="summary")
Test | Test Name | Data | Property | Setting | Value | |
---|---|---|---|---|---|---|
0 | Summary | Statistics | Transformed | Length | 144.000000 | |
1 | Summary | Statistics | Transformed | # Missing Values | 0.000000 | |
2 | Summary | Statistics | Transformed | Mean | 280.298611 | |
3 | Summary | Statistics | Transformed | Median | 265.500000 | |
4 | Summary | Statistics | Transformed | Standard Deviation | 119.966317 | |
5 | Summary | Statistics | Transformed | Variance | 14391.917201 | |
6 | Summary | Statistics | Transformed | Kurtosis | -0.364942 | |
7 | Summary | Statistics | Transformed | Skewness | 0.583160 | |
8 | Summary | Statistics | Transformed | # Distinct Values | 118.000000 |
# Stationarity tests (ADF and KPSS)
# NOTE: Users can also just run a single test by passing either 'adf' or 'kpss' to `check_stats`
eda.check_stats(test='stationarity')
Test | Test Name | Data | Property | Setting | Value | |
---|---|---|---|---|---|---|
0 | Stationarity | ADF | Transformed | Stationarity | {'alpha': 0.05} | False |
1 | Stationarity | ADF | Transformed | p-value | {'alpha': 0.05} | 0.99188 |
2 | Stationarity | ADF | Transformed | Test Statistic | {'alpha': 0.05} | 0.815369 |
3 | Stationarity | ADF | Transformed | Critical Value 1% | {'alpha': 0.05} | -3.481682 |
4 | Stationarity | ADF | Transformed | Critical Value 5% | {'alpha': 0.05} | -2.884042 |
5 | Stationarity | ADF | Transformed | Critical Value 10% | {'alpha': 0.05} | -2.57877 |
6 | Stationarity | KPSS | Transformed | Trend Stationarity | {'alpha': 0.05} | True |
7 | Stationarity | KPSS | Transformed | p-value | {'alpha': 0.05} | 0.1 |
8 | Stationarity | KPSS | Transformed | Test Statistic | {'alpha': 0.05} | 0.09615 |
9 | Stationarity | KPSS | Transformed | Critical Value 10% | {'alpha': 0.05} | 0.119 |
10 | Stationarity | KPSS | Transformed | Critical Value 5% | {'alpha': 0.05} | 0.146 |
11 | Stationarity | KPSS | Transformed | Critical Value 2.5% | {'alpha': 0.05} | 0.176 |
12 | Stationarity | KPSS | Transformed | Critical Value 1% | {'alpha': 0.05} | 0.216 |
The ADF tests shows that the data is not stationary and we saw this in the plots as well (clear trend and seasonal behavior)
# Ljung-Bx test to tests of white noise (whether the data is uncorrelated or not)
eda.check_stats(test='white_noise')
Test | Test Name | Data | Property | Setting | Value | |
---|---|---|---|---|---|---|
0 | White Noise | Ljung-Box | Transformed | Test Statictic | {'alpha': 0.05, 'K': 24} | 1606.083817 |
1 | White Noise | Ljung-Box | Transformed | Test Statictic | {'alpha': 0.05, 'K': 48} | 1933.155822 |
2 | White Noise | Ljung-Box | Transformed | p-value | {'alpha': 0.05, 'K': 24} | 0.0 |
3 | White Noise | Ljung-Box | Transformed | p-value | {'alpha': 0.05, 'K': 48} | 0.0 |
4 | White Noise | Ljung-Box | Transformed | White Noise | {'alpha': 0.05, 'K': 24} | False |
5 | White Noise | Ljung-Box | Transformed | White Noise | {'alpha': 0.05, 'K': 48} | False |
The Ljung-Box tests indicates that the data is not white noise - again something that was clearly visible in the data
# Users have the option to customize the tests such as change the alpha value.
eda.check_stats(test='kpss', alpha = 0.2)
Test | Test Name | Data | Property | Setting | Value | |
---|---|---|---|---|---|---|
0 | Stationarity | KPSS | Transformed | Trend Stationarity | {'alpha': 0.2} | False |
1 | Stationarity | KPSS | Transformed | p-value | {'alpha': 0.2} | 0.1 |
2 | Stationarity | KPSS | Transformed | Test Statistic | {'alpha': 0.2} | 0.09615 |
3 | Stationarity | KPSS | Transformed | Critical Value 10% | {'alpha': 0.2} | 0.119 |
4 | Stationarity | KPSS | Transformed | Critical Value 5% | {'alpha': 0.2} | 0.146 |
5 | Stationarity | KPSS | Transformed | Critical Value 2.5% | {'alpha': 0.2} | 0.176 |
6 | Stationarity | KPSS | Transformed | Critical Value 1% | {'alpha': 0.2} | 0.216 |
Alternately, all the above tests can be done in one shot by not passing any test type.
eda.check_stats()
Test | Test Name | Data | Property | Setting | Value | |
---|---|---|---|---|---|---|
0 | Summary | Statistics | Transformed | Length | 144.0 | |
1 | Summary | Statistics | Transformed | # Missing Values | 0.0 | |
2 | Summary | Statistics | Transformed | Mean | 280.298611 | |
3 | Summary | Statistics | Transformed | Median | 265.5 | |
4 | Summary | Statistics | Transformed | Standard Deviation | 119.966317 | |
5 | Summary | Statistics | Transformed | Variance | 14391.917201 | |
6 | Summary | Statistics | Transformed | Kurtosis | -0.364942 | |
7 | Summary | Statistics | Transformed | Skewness | 0.58316 | |
8 | Summary | Statistics | Transformed | # Distinct Values | 118.0 | |
9 | White Noise | Ljung-Box | Transformed | Test Statictic | {'alpha': 0.05, 'K': 24} | 1606.083817 |
10 | White Noise | Ljung-Box | Transformed | Test Statictic | {'alpha': 0.05, 'K': 48} | 1933.155822 |
11 | White Noise | Ljung-Box | Transformed | p-value | {'alpha': 0.05, 'K': 24} | 0.0 |
12 | White Noise | Ljung-Box | Transformed | p-value | {'alpha': 0.05, 'K': 48} | 0.0 |
13 | White Noise | Ljung-Box | Transformed | White Noise | {'alpha': 0.05, 'K': 24} | False |
14 | White Noise | Ljung-Box | Transformed | White Noise | {'alpha': 0.05, 'K': 48} | False |
15 | Stationarity | ADF | Transformed | Stationarity | {'alpha': 0.05} | False |
16 | Stationarity | ADF | Transformed | p-value | {'alpha': 0.05} | 0.99188 |
17 | Stationarity | ADF | Transformed | Test Statistic | {'alpha': 0.05} | 0.815369 |
18 | Stationarity | ADF | Transformed | Critical Value 1% | {'alpha': 0.05} | -3.481682 |
19 | Stationarity | ADF | Transformed | Critical Value 5% | {'alpha': 0.05} | -2.884042 |
20 | Stationarity | ADF | Transformed | Critical Value 10% | {'alpha': 0.05} | -2.57877 |
21 | Stationarity | KPSS | Transformed | Trend Stationarity | {'alpha': 0.05} | True |
22 | Stationarity | KPSS | Transformed | p-value | {'alpha': 0.05} | 0.1 |
23 | Stationarity | KPSS | Transformed | Test Statistic | {'alpha': 0.05} | 0.09615 |
24 | Stationarity | KPSS | Transformed | Critical Value 10% | {'alpha': 0.05} | 0.119 |
25 | Stationarity | KPSS | Transformed | Critical Value 5% | {'alpha': 0.05} | 0.146 |
26 | Stationarity | KPSS | Transformed | Critical Value 2.5% | {'alpha': 0.05} | 0.176 |
27 | Stationarity | KPSS | Transformed | Critical Value 1% | {'alpha': 0.05} | 0.216 |
28 | Normality | Shapiro | Transformed | Normality | {'alpha': 0.05} | False |
29 | Normality | Shapiro | Transformed | p-value | {'alpha': 0.05} | 0.000068 |
That's it for this notebook. In the next notebook, we will see how we can start to model this data.