Now let's go on to our modeling step. As a reminder, our plan of action was as follows:

Perform EDA on the dataset to extract valuable insight about the process generating the time series (COMPLETED).
Build a baseline model (univariable model without exogenous variables) for benchmarking purposes. (Covered in this notebook)
Build a univariate model with all exogenous variables to check best possible performance. (Covered in this notebook)
Evaluate the model with exogenous variables and discuss any potential issues. (Covered in this notebook)
Overcome issues identified above. (Covered in this notebook)
Make future predictions with the best model.
Replicate flow with Automated Time Series Modeling (AutoML)

In [1]:

# Only enable critical logging (Optional)
import os
os.environ["PYCARET_CUSTOM_LOGGING_LEVEL"] = "CRITICAL"

In [2]:

def what_is_installed():
    from pycaret import show_versions
    show_versions()

try:
    what_is_installed()
except ModuleNotFoundError:
    !pip install pycaret
    what_is_installed()

System:
    python: 3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\Nikhil\.conda\envs\pycaret_dev_sktime_0p11_2\python.exe
   machine: Windows-10-10.0.19044-SP0

PyCaret required dependencies:

C:\Users\Nikhil\.conda\envs\pycaret_dev_sktime_0p11_2\lib\site-packages\_distutils_hack\__init__.py:30: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

                 pip: 21.2.2
          setuptools: 61.2.0
             pycaret: 3.0.0
             ipython: Not installed
          ipywidgets: 7.7.0
               numpy: 1.21.6
              pandas: 1.4.2
              jinja2: 3.1.2
               scipy: 1.8.0
              joblib: 1.1.0
             sklearn: 1.0.2
                pyod: Installed but version unavailable
            imblearn: 0.9.0
   category_encoders: 2.4.1
            lightgbm: 3.3.2
               numba: 0.55.1
            requests: 2.27.1
          matplotlib: 3.5.2
          scikitplot: 0.3.7
         yellowbrick: 1.4
              plotly: 5.8.0
             kaleido: 0.2.1
         statsmodels: 0.13.2
              sktime: 0.11.4
               tbats: Installed but version unavailable
            pmdarima: 1.8.5

PyCaret optional dependencies:
                shap: Not installed
           interpret: Not installed
                umap: Not installed
    pandas_profiling: Not installed
  explainerdashboard: Not installed
             autoviz: Not installed
           fairlearn: Not installed
             xgboost: Not installed
            catboost: Not installed
              kmodes: Not installed
             mlxtend: Not installed
       statsforecast: 0.5.5
        tune_sklearn: Not installed
                 ray: Not installed
            hyperopt: Not installed
              optuna: Not installed
               skopt: Not installed
              mlflow: 1.25.1
              gradio: Not installed
             fastapi: Not installed
             uvicorn: Not installed
              m2cgen: Not installed
           evidently: Not installed
                nltk: Not installed
            pyLDAvis: Not installed
              gensim: Not installed
               spacy: Not installed
           wordcloud: Not installed
            textblob: Not installed
              psutil: 5.9.0
               fugue: Not installed
           streamlit: Not installed
             prophet: Not installed

In [3]:

import numpy as np
import pandas as pd
from pycaret.datasets import get_data
from pycaret.time_series import TSForecastingExperiment

In [4]:

# Global Figure Settings for notebook ----
global_fig_settings = {"renderer": "notebook", "width": 1000, "height": 600}

In [5]:

data = get_data("airquality")
data["index"] = pd.to_datetime(data["Date"] + " " + data["Time"])
data.drop(columns=["Date", "Time"], inplace=True)
data.replace(-200, np.nan, inplace=True)
target = "CO(GT)"

exclude = ['NMHC(GT)', 'AH']
data.drop(columns=exclude, inplace=True)
data.head()

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
0	2004-03-10	18:00:00	2.6	1360	150	11.9	1046	166	1056	113	1692	1268	13.6	48.9	0.7578
1	2004-03-10	19:00:00	2.0	1292	112	9.4	955	103	1174	92	1559	972	13.3	47.7	0.7255
2	2004-03-10	20:00:00	2.2	1402	88	9.0	939	131	1140	114	1555	1074	11.9	54.0	0.7502
3	2004-03-10	21:00:00	2.2	1376	80	9.2	948	172	1092	122	1584	1203	11.0	60.0	0.7867
4	2004-03-10	22:00:00	1.6	1272	51	6.5	836	131	1205	116	1490	1110	11.2	59.6	0.7888

Out[5]:

	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	index
0	2.6	1360.0	11.9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.6	48.9	2004-03-10 18:00:00
1	2.0	1292.0	9.4	955.0	103.0	1174.0	92.0	1559.0	972.0	13.3	47.7	2004-03-10 19:00:00
2	2.2	1402.0	9.0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11.9	54.0	2004-03-10 20:00:00
3	2.2	1376.0	9.2	948.0	172.0	1092.0	122.0	1584.0	1203.0	11.0	60.0	2004-03-10 21:00:00
4	1.6	1272.0	6.5	836.0	131.0	1205.0	116.0	1490.0	1110.0	11.2	59.6	2004-03-10 22:00:00

Step 2: Baseline Model - Univariate forecasting without exogenous variables¶

In [6]:

data_uni = data.copy()
data_uni.set_index("index", inplace=True)
data_uni = data_uni[target]

exp_uni = TSForecastingExperiment()
exp_uni.setup(
    data=data_uni, fh=48,
    numeric_imputation_target="ffill", numeric_imputation_exogenous="ffill",
    fig_kwargs=global_fig_settings, session_id=42
)

	Description	Value
0	session_id	42
1	Target	CO(GT)
2	Approach	Univariate
3	Exogenous Variables	Not Present
4	Original data shape	(9357, 1)
5	Transformed data shape	(9357, 1)
6	Transformed train set shape	(9309, 1)
7	Transformed test set shape	(48, 1)
8	Rows with missing values	18.0%
9	Fold Generator	ExpandingWindowSplitter
10	Fold Number	3
11	Enforce Prediction Interval	False
12	Seasonal Period(s) Tested	24
13	Seasonality Present	True
14	Seasonalities Detected	[24]
15	Primary Seasonality	24
16	Target Strictly Positive	True
17	Target White Noise	No
18	Recommended d	1
19	Recommended Seasonal D	0
20	Preprocess	True
21	Numerical Imputation (Target)	ffill
22	Transformation (Target)	None
23	Scaling (Target)	None
24	CPU Jobs	-1
25	Use GPU	False
26	Log Experiment	False
27	Experiment Name	ts-default-name
28	USI	968c

Out[6]:

<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x1b8dae6dc10>

In [7]:

model = exp_uni.create_model("arima", order=(0,1,0), seasonal_order=(0,1,0,24))

	cutoff	MASE	RMSSE	MAE	RMSE	MAPE	SMAPE	R2
0	2005-03-27 14:00	1.2101	1.1315	1.0340	1.4835	0.5751	0.9044	-1.3625
1	2005-03-29 14:00	2.2086	1.6277	1.8839	2.1316	1.5146	0.7456	-2.8068
2	2005-03-31 14:00	1.1332	0.8553	0.9652	1.1183	1.1796	1.2402	-5.1529
Mean	nan	1.5173	1.2048	1.2944	1.5778	1.0898	0.9634	-3.1074
SD	nan	0.4899	0.3196	0.4178	0.4190	0.3888	0.2062	1.5620

In [8]:

exp_uni.plot_model(model)

On zooming in to the forecasts, we can see that the model is able to capture some of the trends (spikes) in the dataset, but not all. The performance of our baseline model indicates that mean MASE across the CV folds is 1.52 which is not that great. Any value > 1 indicates that the model is performing worse than even a naive model with one step ahead forecasts. This model needs more improvement. Let's see if adding exogenous variables can help improve the model performance.

Step 3: Improved Model - Univariate forecasting with exogenous variables¶

In [9]:

exp_exo = TSForecastingExperiment()
exp_exo.setup(
    data=data, target=target, index="index", fh=48,
    numeric_imputation_target="ffill", numeric_imputation_exogenous="ffill",
    fig_kwargs=global_fig_settings, session_id=42
)

	Description	Value
0	session_id	42
1	Target	CO(GT)
2	Approach	Univariate
3	Exogenous Variables	Present
4	Original data shape	(9357, 11)
5	Transformed data shape	(9357, 11)
6	Transformed train set shape	(9309, 11)
7	Transformed test set shape	(48, 11)
8	Rows with missing values	25.8%
9	Fold Generator	ExpandingWindowSplitter
10	Fold Number	3
11	Enforce Prediction Interval	False
12	Seasonal Period(s) Tested	24
13	Seasonality Present	True
14	Seasonalities Detected	[24]
15	Primary Seasonality	24
16	Target Strictly Positive	True
17	Target White Noise	No
18	Recommended d	1
19	Recommended Seasonal D	0
20	Preprocess	True
21	Numerical Imputation (Target)	ffill
22	Transformation (Target)	None
23	Scaling (Target)	None
24	Numerical Imputation (Exogenous)	ffill
25	Transformation (Exogenous)	None
26	Scaling (Exogenous)	None
27	CPU Jobs	-1
28	Use GPU	False
29	Log Experiment	False
30	Experiment Name	ts-default-name
31	USI	5f62

Out[9]:

<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x1b8b54cfe20>

In [10]:

model_exo = exp_exo.create_model("arima", order=(0,1,0), seasonal_order=(0,1,0,24))

	cutoff	MASE	RMSSE	MAE	RMSE	MAPE	SMAPE	R2
0	2005-03-27 14:00	0.1473	0.1281	0.1259	0.1680	0.0825	0.0824	0.9697
1	2005-03-29 14:00	0.1931	0.1628	0.1647	0.2132	0.1043	0.1112	0.9619
2	2005-03-31 14:00	0.3009	0.2474	0.2563	0.3235	0.2767	0.3375	0.4852
Mean	nan	0.2138	0.1794	0.1823	0.2349	0.1545	0.1770	0.8056
SD	nan	0.0644	0.0501	0.0547	0.0653	0.0869	0.1141	0.2266

In [11]:

exp_exo.plot_model(model_exo)

Step 4: Evaluate Model¶

Not bad, We have managed to improve MASE to ~ 0.21 which is much better than the univariate model and also a large improvement over a naive model. We should be happy with this improvement. Let's finalize the model by training it on the entire dataset so we can make true future forecasts.

In [12]:

final_model_exo = exp_exo.finalize_model(model_exo)

In [13]:

def safe_predict(exp, model):
    """Prediction wrapper for demo purposes."""
    try: 
        exp.predict_model(model)
    except ValueError as exception:
        print(exception)
        exo_vars = exp.exogenous_variables
        print(f"{len(exo_vars)} exogenous variables (X) needed in order to make future predictions:\n{exo_vars}")

safe_predict(exp_exo, final_model_exo)

Model was trained with exogenous variables but you have not passed any for predictions. Please pass exogenous variables to make predictions.
10 exogenous variables (X) needed in order to make future predictions:
['PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH']

As we can see, this approach does not come without side effects. The problem is that we have 10 exogenous variables. Hence in order to get any unknown future values for CO concentration, we will need the future values for all these exogenous variables. This is generally obtained through some forecasting process itself. But each forecast will have errors and these errors can be compounded when there are a lot of exogenous variables.

Let's see if we can trim down these exogenous variables to a handful of useful variables without compromising on forecasting performance.

Step 5: Parsimonious Model - Univariate forecasting with limited exogenous variables¶

From the CCF Analysis, we found that many of the exogenous variables show a very similar correlation structure to the CO concentration. E.g. PT08.S1(CO), NOx(GT), C6H6(GT), PT08.S2(NMHC) values from 24 hours before (lag = 24) show a high positive correlation to CO concentration. Instead of keeping all of them, lets pick the one with the highest positive correlation at lag 24 which is NOx(GT).

Similarly, PT08.S3(NOx) values from 24 hours ago shows the highest negative correlation to CO concentration. Let's keep this variable as well.

Finally, in daily cycles, what happens 12 hours back can also impact the current value (e.g. values last night can impact the next day and vice versa). The variable with the highest correlation to CO concentration at lag = 12 is RH. We will keep this as well.

In [14]:

exp_slim = TSForecastingExperiment()
keep = [target, "index", 'NOx(GT)', "PT08.S3(NOx)", "RH"]
data_slim = data[keep]
exp_slim.setup(
    data=data_slim, target=target, index="index", fh=48,
    numeric_imputation_target="ffill", numeric_imputation_exogenous="ffill",
    fig_kwargs=global_fig_settings, session_id=42 
)

	Description	Value
0	session_id	42
1	Target	CO(GT)
2	Approach	Univariate
3	Exogenous Variables	Present
4	Original data shape	(9357, 4)
5	Transformed data shape	(9357, 4)
6	Transformed train set shape	(9309, 4)
7	Transformed test set shape	(48, 4)
8	Rows with missing values	25.8%
9	Fold Generator	ExpandingWindowSplitter
10	Fold Number	3
11	Enforce Prediction Interval	False
12	Seasonal Period(s) Tested	24
13	Seasonality Present	True
14	Seasonalities Detected	[24]
15	Primary Seasonality	24
16	Target Strictly Positive	True
17	Target White Noise	No
18	Recommended d	1
19	Recommended Seasonal D	0
20	Preprocess	True
21	Numerical Imputation (Target)	ffill
22	Transformation (Target)	None
23	Scaling (Target)	None
24	Numerical Imputation (Exogenous)	ffill
25	Transformation (Exogenous)	None
26	Scaling (Exogenous)	None
27	CPU Jobs	-1
28	Use GPU	False
29	Log Experiment	False
30	Experiment Name	ts-default-name
31	USI	e941

Out[14]:

<pycaret.time_series.forecasting.oop.TSForecastingExperiment at 0x1b884a208b0>

In [15]:

model_slim = exp_slim.create_model("arima", order=(0,1,0), seasonal_order=(0,1,0,24))

	cutoff	MASE	RMSSE	MAE	RMSE	MAPE	SMAPE	R2
0	2005-03-27 14:00	0.2174	0.1891	0.1857	0.2479	0.1339	0.1230	0.9340
1	2005-03-29 14:00	0.2644	0.2209	0.2255	0.2893	0.1358	0.1524	0.9299
2	2005-03-31 14:00	0.2366	0.1972	0.2015	0.2579	0.2503	0.3316	0.6729
Mean	nan	0.2395	0.2024	0.2043	0.2650	0.1733	0.2023	0.8456
SD	nan	0.0193	0.0135	0.0164	0.0177	0.0544	0.0922	0.1221

In [16]:

exp_slim.plot_model(model_slim)

Not bad. MASE has only increased from ~0.21 to ~0.24, but we have managed to cut our exogenous variables down from 13 to 3. This will help us when we make "true" unknonw future predictions since we will need the "unknown" future values of these exogenous variables to make the forecast for the CO concentration.

Finalize the model¶

Train the slim model on the entire dataset so we can make true future forecasts
Save the model as a pickle file for deployment

In [17]:

final_slim_model = exp_slim.finalize_model(model_slim)

In [18]:

exp_slim.save_model(final_slim_model, "final_slim_model")

Transformation Pipeline and Model Successfully Saved

Out[18]:

(ForecastingPipeline(steps=[('transformer_exogenous',
                             TransformerPipeline(steps=[('numerical_imputer',
                                                         Imputer(method='ffill',
                                                                 random_state=42))])),
                            ('forecaster',
                             TransformedTargetForecaster(steps=[('transformer_target',
                                                                 TransformerPipeline(steps=[('numerical_imputer',
                                                                                             Imputer(method='ffill',
                                                                                                     random_state=42))])),
                                                                ('model',
                                                                 ARIMA(order=(0,
                                                                              1,
                                                                              0),
                                                                       seasonal_order=(0,
                                                                                       1,
                                                                                       0,
                                                                                       24)))]))]),
 'final_slim_model.pkl')

In [19]:

safe_predict(exp_slim, final_slim_model)

Model was trained with exogenous variables but you have not passed any for predictions. Please pass exogenous variables to make predictions.
3 exogenous variables (X) needed in order to make future predictions:
['NOx(GT)', 'PT08.S3(NOx)', 'RH']

So we still need future values for 3 exogenous variables. We will get this in the next part using forecasting techniques.