In this set of notebooks, we will cover modeling with exogenous variables. Our plan of action is as follows:
# Only enable critical logging (Optional)
import os
os.environ["PYCARET_CUSTOM_LOGGING_LEVEL"] = "CRITICAL"
def what_is_installed():
from pycaret import show_versions
show_versions()
try:
what_is_installed()
except ModuleNotFoundError:
!pip install pycaret
what_is_installed()
System: python: 3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)] executable: C:\Users\Nikhil\.conda\envs\pycaret_dev_sktime_16p1\python.exe machine: Windows-10-10.0.19044-SP0 PyCaret required dependencies: pip: 22.3.1 setuptools: 65.6.3 pycaret: 3.0.0rc9 IPython: 8.10.0 ipywidgets: 8.0.4 tqdm: 4.64.1 numpy: 1.23.5 pandas: 1.5.3 jinja2: 3.1.2 scipy: 1.10.0 joblib: 1.2.0 sklearn: 1.2.1 pyod: 1.0.8 imblearn: 0.10.1 category_encoders: 2.6.0 lightgbm: 3.3.5 numba: 0.56.4 requests: 2.28.2 matplotlib: 3.7.0 scikitplot: 0.3.7 yellowbrick: 1.5 plotly: 5.13.0 kaleido: 0.2.1 statsmodels: 0.13.5 sktime: 0.16.1 tbats: 1.1.2 pmdarima: 2.0.2 psutil: 5.9.4 PyCaret optional dependencies: shap: 0.41.0 interpret: Not installed umap: Not installed pandas_profiling: Not installed explainerdashboard: Not installed autoviz: Not installed fairlearn: Not installed xgboost: Not installed catboost: Not installed kmodes: Not installed mlxtend: Not installed statsforecast: Not installed tune_sklearn: Not installed ray: Not installed hyperopt: Not installed optuna: Not installed skopt: Not installed mlflow: 2.1.1 gradio: Not installed fastapi: Not installed uvicorn: Not installed m2cgen: Not installed evidently: Not installed fugue: 0.8.0 streamlit: Not installed prophet: 1.1.2
import numpy as np
import pandas as pd
from pycaret.datasets import get_data
from pycaret.time_series import TSForecastingExperiment
# Global Figure Settings for notebook ----
# Depending on whether you are using jupyter notebook, jupyter lab, Google Colab, you may have to set the renderer appropriately
# NOTE: Setting to a static renderer here so that the notebook saved size is reduced.
global_fig_settings = {
# "renderer": "notebook",
"renderer": "png",
"width": 1000,
"height": 600,
}
data = get_data("airquality")
data["index"] = pd.to_datetime(data["Date"] + " " + data["Time"])
data.drop(columns=["Date", "Time"], inplace=True)
target = "CO(GT)"
Date | Time | CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2004-03-10 | 18:00:00 | 2.6 | 1360 | 150 | 11.9 | 1046 | 166 | 1056 | 113 | 1692 | 1268 | 13.6 | 48.9 | 0.7578 |
1 | 2004-03-10 | 19:00:00 | 2.0 | 1292 | 112 | 9.4 | 955 | 103 | 1174 | 92 | 1559 | 972 | 13.3 | 47.7 | 0.7255 |
2 | 2004-03-10 | 20:00:00 | 2.2 | 1402 | 88 | 9.0 | 939 | 131 | 1140 | 114 | 1555 | 1074 | 11.9 | 54.0 | 0.7502 |
3 | 2004-03-10 | 21:00:00 | 2.2 | 1376 | 80 | 9.2 | 948 | 172 | 1092 | 122 | 1584 | 1203 | 11.0 | 60.0 | 0.7867 |
4 | 2004-03-10 | 22:00:00 | 1.6 | 1272 | 51 | 6.5 | 836 | 131 | 1205 | 116 | 1490 | 1110 | 11.2 | 59.6 | 0.7888 |
The dataset has missing values tagged as -200. Reference. We should remove these values (replace them with NaN) and let pycaret
handle the imputation appropriately (preventing leakage of data during training).
data[data[target] == -200].head()
CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | index | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | -200.0 | 1011 | 14 | 1.3 | 527 | 21 | 1818 | 34 | 1197 | 445 | 10.1 | 60.5 | 0.7465 | 2004-03-11 04:00:00 |
34 | -200.0 | 831 | 10 | 1.1 | 506 | 21 | 1893 | 32 | 1134 | 384 | 6.1 | 65.9 | 0.6248 | 2004-03-12 04:00:00 |
39 | -200.0 | 1545 | -200 | 22.1 | 1353 | -200 | 767 | -200 | 2058 | 1588 | 9.2 | 56.2 | 0.6561 | 2004-03-12 09:00:00 |
58 | -200.0 | 1147 | 56 | 6.2 | 821 | 109 | 1132 | 83 | 1412 | 992 | 7.0 | 71.1 | 0.7158 | 2004-03-13 04:00:00 |
82 | -200.0 | 1130 | 56 | 5.2 | 773 | 70 | 1130 | 82 | 1452 | 1051 | 12.1 | 61.1 | 0.8603 | 2004-03-14 04:00:00 |
data.replace(-200, np.nan, inplace=True)
data[data[target] == -200]
CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | index |
---|
Now, let's go on to our EDA and modeling.
# Create an EDA experiment ----
eda = TSForecastingExperiment()
eda.setup(
data=data,
target=target,
index="index",
fh=48,
numeric_imputation_target="ffill",
numeric_imputation_exogenous="ffill",
# Set defaults for the plots ----
fig_kwargs=global_fig_settings,
session_id=42,
)
Description | Value | |
---|---|---|
0 | session_id | 42 |
1 | Target | CO(GT) |
2 | Approach | Univariate |
3 | Exogenous Variables | Present |
4 | Original data shape | (9357, 13) |
5 | Transformed data shape | (9357, 13) |
6 | Transformed train set shape | (9309, 13) |
7 | Transformed test set shape | (48, 13) |
8 | Rows with missing values | 91.2% |
9 | Fold Generator | ExpandingWindowSplitter |
10 | Fold Number | 3 |
11 | Enforce Prediction Interval | False |
12 | Splits used for hyperparameters | all |
13 | User Defined Seasonal Period(s) | None |
14 | Ignore Seasonality Test | False |
15 | Seasonality Detection Algo | auto |
16 | Max Period to Consider | 60 |
17 | Seasonal Period(s) Tested | [24, 48, 25, 23, 47, 49, 12, 36, 60] |
18 | Significant Seasonal Period(s) | [24, 48, 25, 23, 47, 49, 12, 36, 60] |
19 | Significant Seasonal Period(s) without Harmonics | [48, 25, 23, 47, 49, 60, 36] |
20 | Remove Harmonics | False |
21 | Harmonics Order Method | harmonic_max |
22 | Num Seasonalities to Use | 1 |
23 | All Seasonalities to Use | [24] |
24 | Primary Seasonality | 24 |
25 | Seasonality Present | True |
26 | Target Strictly Positive | True |
27 | Target White Noise | No |
28 | Recommended d | 1 |
29 | Recommended Seasonal D | 0 |
30 | Preprocess | True |
31 | Numerical Imputation (Target) | ffill |
32 | Transformation (Target) | None |
33 | Scaling (Target) | None |
34 | Feature Engineering (Target) - Reduced Regression | False |
35 | Numerical Imputation (Exogenous) | ffill |
36 | Transformation (Exogenous) | None |
37 | Scaling (Exogenous) | None |
38 | CPU Jobs | -1 |
39 | Use GPU | False |
40 | Log Experiment | False |
41 | Experiment Name | ts-default-name |
42 | USI | 37d7 |
<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x203109abac0>
Even before proceeding, we can observe some useful information here.
# Plot the target variable along with any exogenous variables ----
# Becasue the data is quite huge, plotting an interactive plot can slow down the notebook.
# Hence, we will revert to a static renderer for this plot
eda.plot_model(fig_kwargs={"renderer": "png", "width": 1000, "height": 1200})
# However, when you use the display_format `plotly-dash` or `plotly-widget`,
# The plotly-resampler toolkit will be used as visualization back-end.
# This makes the plotly-visualization snappy and interactive by allowing data-aggregation
# when you zoom in/out, datapoints are re-rendered for the region of interest.
# NOTE: Uncomment out display_format to use "plotly-widget".
eda.plot_model(
plot="ts",
fig_kwargs={
'height': 1200,
# With resampler_kwargs, the constructor of the plotly-resampler object
# can be configured.
"resampler_kwargs": {
"default_n_shown_samples": 1500,
# show_dash kwargs withholds the kwargs of the show_dash (render) method.
"show_dash": {"mode": "inline", "port": 8055},
},
},
# display_format="plotly-dash",
)