This notebook presents different methods to evaluate and visualize the skill of a forecasting model.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
import seaborn as sns
import modelskill as ms
my_color_dict = {
"dummy": "#768082",
"model_1": "#ff6700",
"model_2": "#004E98",
"observation": "#55a630",
"other": "#2d2d2c",
}
def path_to_file(filename):
return f"../tests/testdata/forecast_skill/{filename}"
matched_model_results = pd.read_csv(path_to_file("matched_model_results.csv"), parse_dates=True, index_col=0)
matched_model_results.round(2).head(7)
lead_time | model_1 | model_2 | observation | |
---|---|---|---|---|
date | ||||
2023-01-07 | 1 | 13804.10 | 12663.04 | 14319.96 |
2023-01-07 | 2 | 13302.96 | 12898.16 | 14319.96 |
2023-01-07 | 3 | 13388.44 | 13505.24 | 14319.96 |
2023-01-07 | 4 | 13014.56 | 13444.72 | 14319.96 |
2023-01-07 | 5 | 12845.50 | 13086.09 | 14319.96 |
2023-01-07 | 6 | 12801.62 | 13413.75 | 14319.96 |
2023-01-07 | 7 | 13003.76 | 13204.51 | 14319.96 |
We depart from a dataframe containing matched results from two models; each model allows us to forecast the value of an arbitrary variable X up to 7 days in advance. The lead_time
column in the previous dataframe represents how many days in advance did we forecast the value of X. To see how to generate such dataframe, consider visiting the data wrangling notebook.
Our goal is to assess which of the two model candidates (model_1
and model_2
) better represent an underlying observed system. When comparing models, we often want to have a simple benchmark model to guarantee that our candidate models improve other naive prediction methods based on simple heuristics. For instance, a model that is commonly used as benchmark is a model that returns the mean of our available observations. modelskill
provides tools to generate such a naive model by using the DummyModelResult
class, as can be seen below.
observations = matched_model_results["observation"].drop_duplicates().to_frame()
obs = ms.PointObservation(observations)
mr = ms.DummyModelResult(strategy='mean')
pmr = mr.extract(obs).to_dataframe()
fig, ax = plt.subplots(1, 1, figsize=(12, 3))
pmr.join(observations, on="time").plot(ax=ax, color=my_color_dict)
ax.set_xlabel("")
ax.legend(frameon=False)
plt.show()
Now, we proceed to merge the results of this dummy model with our matched results.
pmr.index.name = "date"
matched_model_results = matched_model_results.join(pmr, on="date", how="inner")
Finally, we include the results of the model at lead time equal to zero to improve the visualization of the change of skill with the horizon increase. Notice that, for all models except the dummy, we are copying the values of the observations since we are assuming that, at the time of the forecast we have already observed the value of X. The dummy model is excluded from this operation since that candidate is only used as benchmark.
observations_ = observations["observation"]
dummy_ = matched_model_results[["dummy"]].reset_index().drop_duplicates().set_index("date").sort_index()["dummy"]
lead_time_zero = pd.DataFrame(
{
"lead_time": 0,
"model_1": observations_,
"model_2": observations_,
"observation": observations_,
"dummy": dummy_,
},
index=observations.index
)
lead_time_zero.index.name = "date"
matched_model_results = pd.concat([matched_model_results, lead_time_zero]).sort_values(["date", "lead_time"])
matched_model_results.round(2).head(8)
lead_time | model_1 | model_2 | observation | dummy | |
---|---|---|---|---|---|
date | |||||
2023-01-07 | 0 | 14319.96 | 14319.96 | 14319.96 | 15237.57 |
2023-01-07 | 1 | 13804.10 | 12663.04 | 14319.96 | 15237.57 |
2023-01-07 | 2 | 13302.96 | 12898.16 | 14319.96 | 15237.57 |
2023-01-07 | 3 | 13388.44 | 13505.24 | 14319.96 | 15237.57 |
2023-01-07 | 4 | 13014.56 | 13444.72 | 14319.96 | 15237.57 |
2023-01-07 | 5 | 12845.50 | 13086.09 | 14319.96 | 15237.57 |
2023-01-07 | 6 | 12801.62 | 13413.75 | 14319.96 | 15237.57 |
2023-01-07 | 7 | 13003.76 | 13204.51 | 14319.96 | 15237.57 |
With the previous dataframe, it is straightforward to generate the following skill table. Recall that we want to analyze how the models performs for different horizons (from 1 to 7 days ahead), for this reason we pass the column with the lead time as aux_items
. Then, we can filter the results by model and by lead time.
cmp = ms.from_matched(matched_model_results, mod_items=["model_1", "model_2", "dummy"], aux_items=["lead_time"], obs_item="observation")
metrics = ["bias", "rmse", "mae", "mape"]
sk = cmp.skill(by=["model", "lead_time"], metrics=metrics).sort_index()
sk.round(2).style()
observation | n | bias | rmse | mae | mape | ||
---|---|---|---|---|---|---|---|
model | lead_time | ||||||
dummy | 0 | observation | 374 | 0.000 | 1796.130 | 1348.010 | 8.500 |
1 | observation | 374 | 0.000 | 1796.130 | 1348.010 | 8.500 | |
2 | observation | 374 | 0.000 | 1796.130 | 1348.010 | 8.500 | |
3 | observation | 374 | 0.000 | 1796.130 | 1348.010 | 8.500 | |
4 | observation | 374 | 0.000 | 1796.130 | 1348.010 | 8.500 | |
5 | observation | 374 | 0.000 | 1796.130 | 1348.010 | 8.500 | |
6 | observation | 374 | 0.000 | 1796.130 | 1348.010 | 8.500 | |
7 | observation | 374 | 0.000 | 1796.130 | 1348.010 | 8.500 | |
model_1 | 0 | observation | 374 | 0.000 | 0.000 | 0.000 | 0.000 |
1 | observation | 374 | -172.710 | 811.960 | 635.330 | 4.100 | |
2 | observation | 374 | -296.600 | 996.070 | 794.870 | 5.060 | |
3 | observation | 374 | -384.350 | 1184.790 | 950.580 | 6.030 | |
4 | observation | 374 | -446.930 | 1323.600 | 1080.360 | 6.850 | |
5 | observation | 374 | -491.980 | 1412.620 | 1166.210 | 7.390 | |
6 | observation | 374 | -523.710 | 1475.760 | 1221.850 | 7.750 | |
7 | observation | 374 | -545.460 | 1521.410 | 1267.010 | 8.040 | |
model_2 | 0 | observation | 374 | 0.000 | 0.000 | 0.000 | 0.000 |
1 | observation | 374 | -1281.480 | 2020.670 | 1512.510 | 9.260 | |
2 | observation | 374 | -1186.980 | 1967.200 | 1459.470 | 8.910 | |
3 | observation | 374 | -506.580 | 1669.490 | 1256.730 | 7.790 | |
4 | observation | 374 | -773.380 | 1810.800 | 1323.620 | 8.100 | |
5 | observation | 374 | -1268.830 | 2069.900 | 1510.860 | 9.170 | |
6 | observation | 374 | -770.870 | 1832.630 | 1337.590 | 8.180 | |
7 | observation | 374 | -1176.300 | 2048.960 | 1488.200 | 9.040 |
Now we can directly use the plotting functionality from modelskill
, with minimal extra formatting, to visualize the skill table as follows.
def _format_ax(ax, xlims, ylims):
ax.set_xlim(*xlims)
ax.set_ylim(*ylims)
ax.set_xlabel("Lead time [days]", fontsize=12)
ax.set_ylabel("MAPE [%]", fontsize=12)
ax.legend(frameon=False, fontsize=12, bbox_to_anchor=(1.25, 0.7)),
ax.tick_params(axis="both", which="major", labelsize=12)
ax.set_title("Skill by lead time")
ax.figure.autofmt_xdate(rotation=0, ha='center')
fig, ax = plt.subplots(1, 1, figsize=(8, 5))
sk.mape.plot.line(ax=ax, color=my_color_dict)
_format_ax(ax, xlims=(0, 7), ylims=(0, 10))
plt.show()
Notice that, in this case, we have chosen the MAPE as the skill measure.
The previous figure suggests that model 2 is not better than the naive approach, on the other, hand model 1 is a clear improvement. Still, the skill of Model 1 decreases with lead time and, as we get close to the 7 days horizon, the three models have very similar performance. It is worth mentioning that the trend in model 1 is often observed when analyzing forecasting models, as the most recent observations contain more information about future values of X.
The results in the previous section are generated by computing the skill aggregating errors per lead time. However, we might instead be interested in how the skill is distributed rather than evaluating it as a single value. In order to do this, we define the following function, which extends the matched results with three new columns: error
, absolute_error
and percentage_error
.
def generate_melted_error_df(matched_results):
melted_model_results = pd.melt(matched_results.reset_index(), id_vars=["time", "lead_time", "observation"], var_name="model")
melted_model_results["error"] = melted_model_results["observation"] - melted_model_results["value"]
melted_model_results["absolute_error"] = abs(melted_model_results["error"])
melted_model_results["percentage_error"] = 100 * melted_model_results["absolute_error"] / melted_model_results["observation"]
return melted_model_results
Next, we define a function that, given the extended matched results, plots the distribution of errors by model and lead time.
def plot_error_distribution_array(mmr, my_color_dict, diag_ylims=(0, 20)):
def plot_skill_quantile_by_leadtime(df, ax, color=None):
for interval in [25, 50, 75]:
sns.lineplot(
df,
x="lead_time",
y="percentage_error",
estimator="median",
errorbar=("pi", interval),
ax=ax,
linewidth=2.5,
err_kws=dict(linewidth=0, alpha=0.15),
color=color
)
ax.legend([], [], frameon=False)
ax.set_xlabel("Lead time [days]", fontsize=12)
ax.set_ylabel(ylabel_dict[errorcol], fontsize=12)
def plot_model_comparison_with_violinplot(df, ax):
sns.violinplot(
data=df,
x="lead_time",
y=errorcol,
hue="model",
split=True,
gap=0.15,
inner="quarts",
ax=ax,
alpha=0.9,
saturation=0.7,
bw_adjust=0.6,
common_norm=True,
density_norm="area",
cut=0,
palette=my_color_dict
)
ax.spines[['right', 'top']].set_visible(False)
ax.legend(frameon=False, ncols = 2)
ax.set_xlabel("")
ax.set_ylabel("")
def create_empty_plot_with_legend(ax):
color = my_color_dict["other"]
rect1 = Patch(facecolor=color, alpha=0.45, label='25%')
rect2 = Patch(facecolor=color, alpha=0.30, label='50%')
rect3 = Patch(facecolor=color, alpha=0.15, label='75%')
line = Line2D([0], [0], color=color, label='Median')
ax.legend(handles=[rect1, rect2, rect3, line], loc='upper right', title='Error quantiles:', frameon=False)
ax.set_axis_off()
ylabel_dict = {"error": "Error", "absolute_error": "Absolute error", "percentage_error": "Percentage error [%]"}
errorcol = "percentage_error"
models = mmr["model"].unique().tolist()
fig, axes = plt.subplots(3, 3, figsize=(12, 10))
for i, model_i in enumerate(models):
df_ii = mmr.loc[mmr["model"] == model_i, :].copy()
plot_skill_quantile_by_leadtime(df_ii, axes[i, i], color=my_color_dict[model_i])
axes[i, i].tick_params(axis="both", which="major", labelsize=12)
axes[i, i].set_xlim(1, 7)
axes[i, i].set_ylim(*diag_ylims)
other_models = [m for m in models if m != model_i]
for model_j in other_models:
j = models.index(model_j)
if j > i:
comb = (model_i, model_j)
df_ij = mmr.loc[mmr["model"].isin(comb), :].copy()
plot_model_comparison_with_violinplot(df_ij, axes[i, j])
elif j == (i-1):
create_empty_plot_with_legend(axes[i, j])
else:
axes[i, j].set_axis_off()
axes[i, j].tick_params(axis="both", which="major", labelsize=12)
plt.tight_layout()
plt.show()
Lastly, we execute the previous function calling the matched results as follows:
melted_results = generate_melted_error_df(matched_model_results)
plot_error_distribution_array(melted_results, my_color_dict)
The previous image shows how the errors are spread at different horizons. In addition, the models are compared in the upper right subplots. It is easy to see that model 2 presents a significantly wider distribution of errors, even though the median is low enough. These results suggest that model 2 will miss the target more often than model 1.
This conclusion can be further validated looking at the violin plots where it can be seen that the distribution of errros from model 2 has always a longer tail than model 1. Interestingly, it can be noted that, even though model 2 distribution is wider than model 1, after lead time equal to 5 there is a shift and the median of errors is slightly lower in model 2 compared to model 1.
All previous results have been computed using all available data. However, we might be interested in finetuning models to specific cases. For example, let´s assume it is critical that our model has a good performance when the value of X is above a certain threshold. Then, we can filter the comparison with modelskill
by doing:
threshold = 15000
skill_high_values = cmp.where(cmp.data.Observation > threshold).skill(by=["model", "lead_time"], metrics=metrics).sort_index()
where, the actual values used for the skill assessment are depicted in the image below.
fig, ax = plt.subplots(1, 1, figsize=(14, 3))
observations_.plot(ax=ax, color=my_color_dict["other"], zorder=0, label="")
observations_[observations_ > threshold].to_frame().reset_index().plot(kind="scatter", x="date", y="observation", label="Used in Comparer", ax=ax, color=my_color_dict["observation"])
ax.axhline(y=threshold, alpha=0.7, color="grey", linestyle="--")
ax.set_ylabel("Observations")
ax.set_ylim(12000, 23000)
ax.legend(frameon=False)
plt.show()
Now, it is straightforward to repeat the previous analysis.
fig, ax = plt.subplots(1, 1, figsize=(8, 5))
skill_high_values.mape.plot.line(ax=ax, figsize=(8, 4), color=my_color_dict)
_format_ax(ax, xlims=(0, 7), ylims=(0, 17))
plt.show()
filtered_matched_results = matched_model_results.loc[matched_model_results["observation"] > threshold,:].copy()
melted_results = generate_melted_error_df(filtered_matched_results)
plot_error_distribution_array(melted_results, my_color_dict, diag_ylims=(0, 25))
In this case, we observe that, compared to the dummy model, model 1 underperforms for longer lead times. Still, the tails of model 1 distribution is always the shortest. Depending on the application we might want to go back to improving model 1 to ensure a better performance for all horizons.