The FIRM score is a scoring/verification framework for multicategorical forecast and warnings. The framework is tied to a risk threshold (or probabilistic decision threshold). This is in contrast to most other verification scores for multicategorical forecasts, where the optimal probability/risk threshold depends on the sample climatological frequency. Many of these other verification scores for multicategorical forecasts would encourage forecasters to over-warn/over-forecast more extreme, rarer events, causing many false alarms, particularly for longer lead days, which would erode the confidence that users have in forecasts.
Lower FIRM scores are better (zero is best), similar to how lower MSE scores are better.
A user needs to specify the following:
Let's say that we have a wind warning service that has three categories; no warning, gale warning and storm warning. This means that our category thresholds are [34, 48]
where values correspond to wind magnitude forecasts (knots). It's twice as important to get the warning vs no warning decision threshold correct rather than the gale vs storm warning decision correct, so our weights are [2, 1]
. The risk threshold of the service is that the cost of a miss to the cost of a false alarm is equal to 0.7. To optimise the expected FIRM score, one must forecast the highest category that has at least a 30% chance of occuring. This is calculated by $P (event) \ge 1 - risk \: threshold$. In this wind warning service we don't discount the penalty for near misses or near fale alarms.
To summarise our FIRM framework parameters:
[34, 48]
[2, 1]
0.7
Our scoring matrix based on these parameters is | | None | Gale | Storm | |-------|------|------|-------| | None | 0 | 1.4 | 2.3 | | Gale | 0.6 | 0 | 0.7 | | Storm | 0.9 | 0.3 | 0 |
where the column headers correspond to the observed category and the row headers correspond to the forecast category. We can see that:
0.7
.For a detailed explanation of the calculations and various extensions, please read the paper.
Let's verify two synthetic wind warning services based on the hypothetical service listed above. Forecast A's wind forecasts are unbiased compared to observations. Forecast B's wind forecasts have been biased up by 5%. This is to see if we can improve our forecast score by biasing up our forecasts slightly since the forecast service directive is to forecast the highest category that has at least a 30% chance of occuring.
We assume that our categorical warning service is derived by converting continuous obs/forecasts into our 3 categories. This is handled within the implementation of FIRM
within scores
.
import xarray as xr
import numpy as np
import pandas as pd
from scores.categorical import firm
from scipy.stats import norm
# Multiplicative bias factor
BIAS_FACTOR = 1.05
# Create observations for 100 dates
obs = 50 * np.random.random((100, 100))
obs = xr.DataArray(
data=obs,
dims=["time", "x"],
coords={"time": pd.date_range("2023-01-01", "2023-04-10"), "x": np.arange(0, 100)}
)
# Create forecasts for 7 lead days
fcst = xr.DataArray(data=[1]*7, dims="lead_day", coords={"lead_day": np.arange(1, 8)})
fcst = fcst * obs
# Create two forecasts. Forecast A has no bias compared to the observations, but
# Forecast B is biased upwards to align better with forecast directive of forecast
# the highest category that has at least a 30% chance of occuring
noise = 4 * norm.rvs(size=(7, 100, 100))
fcst_a = fcst + noise
fcst_b = BIAS_FACTOR * fcst + noise
# Calculate FIRM scores for both forecasts
fcst_a_firm = firm(fcst_a, obs, 0.7, [34, 48], [2, 1], preserve_dims="lead_day")
fcst_b_firm = firm(fcst_b, obs, 0.7, [34, 48], [2, 1], preserve_dims="lead_day")
# View results for lead day 1
fcst_a_firm.sel(lead_day=1)
<xarray.Dataset> Dimensions: () Coordinates: lead_day int64 1 Data variables: firm_score float64 0.08464 overforecast_penalty float64 0.02808 underforecast_penalty float64 0.05656
fcst_b_firm.sel(lead_day=1)
<xarray.Dataset> Dimensions: () Coordinates: lead_day int64 1 Data variables: firm_score float64 0.07648 overforecast_penalty float64 0.04722 underforecast_penalty float64 0.02926
We can see that the FIRM score from Forecast B was lower (better) than Forecast A. This is because we biased our forecast up slightly to align better with the service directive of "forecast the highest category that has at least a 30% chance of occuring".
We can also see the overforecast and underforecast penalties in the output. This allows ups to see the relative contribution to the overall scores due to over/under forecasting.