Data-wrangling for forecasting¶

This notebook provides a reference to write forecasting results in a format that can be incorporated into a workflow with modelskill.

In [1]:

import pandas as pd
import modelskill as ms

In [2]:

def path_to_file(filename):
    return f"../tests/testdata/forecast_skill/{filename}"

def melt_df_by_date(df, name):
    melted_df = pd.melt(df.reset_index(), id_vars=["date"], var_name="lead_time", value_name=name)
    melted_df["lead_time"] = melted_df["lead_time"].astype(int)
    return melted_df


def wide_to_long_representation(forecasts):
    def get_horizon_series(row):
        new_index = (pd.to_datetime(row.name) - pd.to_datetime(row.index)).days
        return pd.Series(row.values, index=new_index)

    predictions_by_horizon = forecasts.T.apply(
        get_horizon_series, axis=1
    ).dropna(axis=1, how="all")
    predictions_by_horizon.index.name = "date"

    predictions_by_horizon.columns = predictions_by_horizon.columns.astype(int)

    return predictions_by_horizon

Let´s assume we are developing a model to forecast the next 7 days of a variable X. There are multiple strategies to store the results of such model; the following dataframe shows one arbitrary method. Every row in this dataframe represents a week of forecasted values made the date denoted by the row index. Each of these rows contains only 7 values accounting for the forecast for the next 7 days: Let´s call this the wide representation.

It is important to remark that this strategy might only be reasonable when the dataset is small enough, since this dataframe is very sparse. Still, one can follow the same structure while using a different storage method, e.g., a folder tree with subfolders for every day, where every subfolder contains a file with the forecast values.

In [3]:

forecast_model_1 = pd.read_csv(path_to_file("forecast_model_1.csv"), parse_dates=True, index_col=0)
forecast_model_1.head(10).round(2)

Out[3]:

	2023-01-01	2023-01-02	2023-01-03	2023-01-04	2023-01-05	2023-01-06	2023-01-07	2023-01-08	2023-01-09	2023-01-10	...	2024-01-12	2024-01-13	2024-01-14	2024-01-15	2024-01-16	2024-01-17	2024-01-18	2024-01-19	2024-01-20	2024-01-21
2022-12-31	15521.82	14296.53	13821.23	13312.48	13241.17	12847.62	13003.76	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-01	NaN	13195.95	13037.03	12753.71	12843.03	12563.93	12801.62	13731.58	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-02	NaN	NaN	13207.27	12875.01	12929.46	12625.52	12845.50	13762.85	13145.25	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-03	NaN	NaN	NaN	13342.34	13262.45	12862.78	13014.56	13883.31	13231.08	13000.59	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-04	NaN	NaN	NaN	NaN	13998.87	13387.50	13388.44	14149.71	13420.90	13135.85	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-05	NaN	NaN	NaN	NaN	NaN	13267.53	13302.96	14088.80	13377.50	13104.92	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-06	NaN	NaN	NaN	NaN	NaN	NaN	13804.10	14445.89	13631.93	13286.21	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-07	NaN	NaN	NaN	NaN	NaN	NaN	NaN	14813.45	13893.84	13472.83	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-08	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	14173.23	13671.91	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2023-01-09	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	13939.92	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

10 rows × 386 columns

When evaluating forecasting results, it is interesting to compare the forecast results at different horizons. We can more efficiently write the previous dataframe such that it is sorted by how long in advance were we forecasting each value. We can call this the horizon representation.

In [4]:

results_model_1 = wide_to_long_representation(forecast_model_1).dropna()
results_model_1.head(10).round(2)

Out[4]:

	1	2	3	4	5	6	7
date
2023-01-07	13804.10	13302.96	13388.44	13014.56	12845.50	12801.62	13003.76
2023-01-08	14813.45	14445.89	14088.80	14149.71	13883.31	13762.85	13731.58
2023-01-09	14173.23	13893.84	13631.93	13377.50	13420.90	13231.08	13145.25
2023-01-10	13939.92	13671.91	13472.83	13286.21	13104.92	13135.85	13000.59
2023-01-11	14669.68	13872.50	13681.53	13539.68	13406.71	13277.53	13299.57
2023-01-12	13441.08	14314.18	13746.16	13610.09	13509.02	13414.27	13322.23
2023-01-13	14470.25	13174.05	13796.16	13391.43	13294.47	13222.46	13154.95
2023-01-14	13611.65	14044.00	13120.41	13563.69	13275.30	13206.22	13154.91
2023-01-15	15539.59	14242.69	14550.75	13892.66	14208.51	14003.03	13953.80
2023-01-16	14242.70	14503.05	13578.96	13798.46	13329.56	13554.61	13408.19

Notice that, despite we are working with a univariate time series, we need two dimensions to evaluate the forecast results. One dimension (the rows) marks the date at which we aim to forecast the value of X; this dimensions is sometimes called the analysis time.The other dimension (the columns) represents the horizon of our forecast, i.e. how many days in advance are we forecasting X. The horizon of a forecast will often impact significantly the model results as it is normally easier to predict values that are close.

In order to compare these results with results from a different model using modelskill, we need to write the horizon representation in long format, so we can use the lead time as an auxiliary feature. We can call this, the long representation. Finally, we need to include the actual observations of X during the analysis time to assess the performance of the model.

In [5]:

melted_results_model_1 = melt_df_by_date(results_model_1, name="model_1")
observations = pd.read_csv(path_to_file("observations.csv"), parse_dates=True)
melted_results_model_1 = pd.merge(left=melted_results_model_1, right=observations, on="date")
melted_results_model_1.head().round(2)

Out[5]:

	date	lead_time	model_1	observation
0	2023-01-07	1	13804.10	14319.96
1	2023-01-08	1	14813.45	15205.57
2	2023-01-09	1	14173.23	14549.37
3	2023-01-10	1	13939.92	15058.72
4	2023-01-11	1	14669.68	13444.34

The following image shows an schematic of the previous workflow, where the labels A, B denote the wide representation, the horizon representation. The label C represents the matched representation, which is described in the following section. The green box to the right contains the actual observations of X during the analysis time.

(For the sake of brevity, the image shows an example where the horizon is 3 time-steps, instead of 7.)

Data wrangling

Using modelskill¶

After we have written the data in the long format, we can already call modelskill to evaluate the performance of the model. As mentioned above, we are separating the forecast results by the lead time (or horizon). We see that all metrics get worse as the lead time increases, as expected.

In [6]:

cmp = ms.from_matched(melted_results_model_1, mod_items=["model_1"], aux_items=["lead_time"], obs_item="observation")
cmp.skill(by=["model", "lead_time"]).sort_index().round(2).style()

Out[6]:

		observation	n	bias	rmse	urmse	mae	cc	si	r2
model	lead_time
model_1	1	observation	374	-172.710	811.960	793.380	635.330	0.900	0.050	0.800
	2	observation	374	-296.600	996.070	950.890	794.870	0.850	0.060	0.690
	3	observation	374	-384.350	1184.790	1120.720	950.580	0.780	0.070	0.560
	4	observation	374	-446.930	1323.600	1245.860	1080.360	0.720	0.080	0.460
	5	observation	374	-491.980	1412.620	1324.180	1166.210	0.680	0.090	0.380
	6	observation	374	-523.710	1475.760	1379.710	1221.850	0.650	0.090	0.320
	7	observation	374	-545.460	1521.410	1420.270	1267.010	0.630	0.090	0.280

Comparing two models¶

Typically we will be interested in comparing different model candidates. In order to do that, we need to load the results from a different model and combine the results of the two models into a single dataframe object.

In [7]:

forecast_model_2 = pd.read_csv(path_to_file("forecast_model_2.csv"), parse_dates=True, index_col=0)
results_model_2 = wide_to_long_representation(forecast_model_2).dropna()
melted_results_model_2 = melt_df_by_date(results_model_2, name="model_2")
melted_results_model_2 = pd.merge(left=melted_results_model_2, right=observations, on="date")

matched_model_results = melted_results_model_1.merge(
    melted_results_model_2,
    how="inner",
    on=["date","lead_time", "observation"]
    ).sort_values(
        by=["date", "lead_time"]
        ).set_index(
            "date"
            )

matched_model_results = matched_model_results.reindex(
                sorted(matched_model_results.columns),
                axis=1
                )

# We save the results as well for later use
matched_model_results.to_csv(path_to_file("matched_model_results.csv"))
matched_model_results.head().round(2)

Out[7]:

	lead_time	model_1	model_2	observation
date
2023-01-07	1	13804.10	12663.04	14319.96
2023-01-07	2	13302.96	12898.16	14319.96
2023-01-07	3	13388.44	13505.24	14319.96
2023-01-07	4	13014.56	13444.72	14319.96
2023-01-07	5	12845.50	13086.09	14319.96

Let's call this format, the matched representation, as previously introduced.

With the previous dataframe object, it we can easily compare the performance of our two candidates.

In [8]:

cmp = ms.from_matched(matched_model_results, mod_items=["model_1", "model_2"], aux_items=["lead_time"], obs_item="observation")
cmp.skill(by=["model", "lead_time"]).sort_index().round(2).style()

Out[8]:

		observation	n	bias	rmse	urmse	mae	cc	si	r2
model	lead_time
model_1	1	observation	374	-172.710	811.960	793.380	635.330	0.900	0.050	0.800
	2	observation	374	-296.600	996.070	950.890	794.870	0.850	0.060	0.690
	3	observation	374	-384.350	1184.790	1120.720	950.580	0.780	0.070	0.560
	4	observation	374	-446.930	1323.600	1245.860	1080.360	0.720	0.080	0.460
	5	observation	374	-491.980	1412.620	1324.180	1166.210	0.680	0.090	0.380
	6	observation	374	-523.710	1475.760	1379.710	1221.850	0.650	0.090	0.320
	7	observation	374	-545.460	1521.410	1420.270	1267.010	0.630	0.090	0.280
model_2	1	observation	374	-1281.480	2020.670	1562.340	1512.510	0.500	0.100	-0.270
	2	observation	374	-1186.980	1967.200	1568.750	1459.470	0.490	0.100	-0.200
	3	observation	374	-506.580	1669.490	1590.770	1256.730	0.470	0.100	0.140
	4	observation	374	-773.380	1810.800	1637.340	1323.620	0.420	0.110	-0.020
	5	observation	374	-1268.830	2069.900	1635.410	1510.860	0.420	0.110	-0.330
	6	observation	374	-770.870	1832.630	1662.620	1337.590	0.390	0.110	-0.040
	7	observation	374	-1176.300	2048.960	1677.660	1488.200	0.370	0.110	-0.300