The FIxed Risk Multicategorical (FIRM) Score¶

Taggart, R., Loveday, N. and Griffiths, D., 2022. A scoring framework for tiered warnings and multicategorical forecasts based on fixed risk measures. Quarterly Journal of the Royal Meteorological Society, 148(744), pp.1389-1406.

The FIRM score is a scoring/verification framework for multicategorical forecast and warnings. The framework is tied to a risk threshold (or probabilistic decision threshold). This is in contrast to most other verification scores for multicategorical forecasts, where the optimal probability/risk threshold depends on the sample climatological frequency. Many of these other verification scores for multicategorical forecasts would encourage forecasters to over-warn/over-forecast more extreme, rarer events, causing many false alarms, particularly for longer lead days, which would erode the confidence that users have in forecasts.

Lower FIRM scores are better (zero is best), similar to how lower MSE scores are better.

How does the FIRM framework work?¶

A user needs to specify the following:

An increasing sequence of category thresholds.
The weights for the relative importance of each decision threshold.
The risk threshold. This is a value between 0 and 1 and indicates the cost of a miss relative to the cost of a false alarm.
Optional: A discount distance parameter. There is the opportunity to set the discount the penalty for near misses and near false alarms.

Example¶

Let's say that we have a wind warning service that has three categories; no warning, gale warning and storm warning. This means that our category thresholds are [34, 48] where values correspond to wind magnitude forecasts (knots). It's twice as important to get the warning vs no warning decision threshold correct rather than the gale vs storm warning decision correct, so our weights are [2, 1]. The risk threshold of the service is that the cost of a miss to the cost of a false alarm is equal to 0.7. To optimise the expected FIRM score, one must forecast the highest category that has at least a 30% chance of occuring. This is calculated by $P (event) \ge 1 - risk \: threshold$. In this wind warning service we don't discount the penalty for near misses or near fale alarms.

To summarise our FIRM framework parameters:

Our category thresholds values are [34, 48]
Our threshold weights are [2, 1]
Our risk threshold is 0.7

Our scoring matrix based on these parameters is | | None | Gale | Storm | |-------|------|------|-------| | None | 0 | 1.4 | 2.3 | | Gale | 0.6 | 0 | 0.7 | | Storm | 0.9 | 0.3 | 0 |

where the column headers correspond to the observed category and the row headers correspond to the forecast category. We can see that:

There is no penalty along the diagonal if you forecast the correct category.
You get a bigger penalty for missing an event than a false alarm. This is due to the risk threshold being set at 0.7.
You get twice the penalty for forecasting none and observing gales compared to forecasting gales and observing storm force conditions due to the threshold weights. These two penalties sum together when being out by two categories (i.e., forecasting None and observing storm force conditions).

For a detailed explanation of the calculations and various extensions, please read the paper.

Python example¶

Let's verify two synthetic wind warning services based on the hypothetical service listed above. Forecast A's wind forecasts are unbiased compared to observations. Forecast B's wind forecasts have been biased up by 5%. This is to see if we can improve our forecast score by biasing up our forecasts slightly since the forecast service directive is to forecast the highest category that has at least a 30% chance of occuring.

We assume that our categorical warning service is derived by converting continuous obs/forecasts into our 3 categories. This is handled within the implementation of FIRM within scores.

In [26]:

import xarray as xr
import numpy as np
import pandas as pd

from scores.categorical import firm
from scipy.stats import norm

In [82]:

# Multiplicative bias factor
BIAS_FACTOR = 1.05

In [83]:

# Create observations for 100 dates
obs = 50 * np.random.random((100, 100))
obs = xr.DataArray(
    data=obs, 
    dims=["time", "x"],
    coords={"time": pd.date_range("2023-01-01", "2023-04-10"), "x": np.arange(0, 100)}
)

# Create forecasts for 7 lead days
fcst = xr.DataArray(data=[1]*7, dims="lead_day", coords={"lead_day": np.arange(1, 8)})
fcst = fcst * obs

# Create two forecasts. Forecast A has no bias compared to the observations, but 
# Forecast B is biased upwards to align better with forecast directive of forecast 
# the highest category that has at least a 30% chance of occuring
noise = 4 * norm.rvs(size=(7, 100, 100))
fcst_a = fcst + noise
fcst_b = BIAS_FACTOR * fcst + noise

In [75]:

# Calculate FIRM scores for both forecasts
fcst_a_firm = firm(fcst_a, obs, 0.7, [34, 48], [2, 1], preserve_dims="lead_day")
fcst_b_firm = firm(fcst_b, obs, 0.7, [34, 48], [2, 1], preserve_dims="lead_day")

In [80]:

# View results for lead day 1
fcst_a_firm.sel(lead_day=1)

In [81]:

fcst_b_firm.sel(lead_day=1)

We can see that the FIRM score from Forecast B was lower (better) than Forecast A. This is because we biased our forecast up slightly to align better with the service directive of "forecast the highest category that has at least a 30% chance of occuring".

We can also see the overforecast and underforecast penalties in the output. This allows ups to see the relative contribution to the overall scores due to over/under forecasting.

Further extensions¶

Test the impact of varying the risk threshold
Solve for the optimal way to bias the forecasts to improve the verification score
Test out the impact of the discount distance parameter
Explore some of the extensions in the FIRM paper