The Brier score is the most commonly used verification metric for evaluating a probability of a binary outcome forecast, such as a "chance of rainfall" forecast.
Probabilistic forecasts can be expressed as values between 0 and 1, or they could be expressed as an ensemble. We can use scores
to evaluate both kinds of forecasts.
Probabilistic forecasts of binary events can be expressed with values between 0 and 1, and observations are exactly 0 (event did not occur), or 1 (event occurred).
The metric is then calculated the same way as MSE and is defined as
s(x,y)=(x−y)2Where
The Brier score is a strictly proper scoring rule where lower values are better (it is negatively oriented), where a perfect score is 0 and the worst score is 1.
from scores.probability import brier_score
from scipy.stats import beta, binom
import numpy as np
import xarray as xr
np.random.seed(100)
# To learn more about the implementation of the Brier score, uncomment the following
# help(brier_score)
We generate two synthetic forecasts. By design, fcst1
is a good forecast, while fcst2
is a poor forecast. We measure the difference in skill by calculating and comparing their Brier scores.
fcst1 = beta.rvs(2, 1, size=1000)
obs = binom.rvs(1, fcst1)
fcst2 = beta.rvs(0.5, 1, size=1000)
fcst1 = xr.DataArray(data=fcst1, dims="time", coords={"time": np.arange(0, 1000)})
fcst2 = xr.DataArray(data=fcst2, dims="time", coords={"time": np.arange(0, 1000)})
obs = xr.DataArray(data=obs, dims="time", coords={"time": np.arange(0, 1000)})
brier_fcst1 = brier_score(fcst1, obs)
brier_fcst2 = brier_score(fcst2, obs)
print(f"Brier score for fcst1 = {brier_fcst1.item():.2f}")
print(f"Brier score for fcst2 = {brier_fcst2.item():.2f}")
Brier score for fcst1 = 0.17 Brier score for fcst2 = 0.43
As expected, fcst1
has the lower Brier score quantifying the degree to which it is better than fcst2
.
check_args
arg to False
in brier_score
.scores
can also be used to evaluate ensembles using the Brier score. By default, it calculates the fair Brier score which is defined as
Where:
It is possible to calculate the Brier score without the fair correction (by setting fair_correction=False
). In this case, the Brier score is calculated as
The use of the Brier score without the fair adjustment may favour ensembles that have members that have the behaviour of being sampled from a different distribution than the observations. For more information, see Ferro (2014).
Let's first do a synthetic experiment with 10,000 timesteps to demonstrate the impact of the fair correction using the scores.probability.brier_score_for_ensemble
function. Suppose the observations are randomly drawn from the distribution Pr(y=1)=0.5. Let's now create two ensemble forecasts. The first (we will call fcst_a
) has 2 ensemble members where the value for each member is randomly drawn from the same distribution that the observations are drawn from and is an ideal forecast. The second ensemble (called fcst_b
) has 20 ensemble members but is biased. The value for each member is randomly drawn from the distribution Pr(y=1)=0.4.
Let's now calculate the Brier score with and without the fair adjustment.
from scores.probability import brier_score_for_ensemble
# Uncomment the following line to learn more about the ensemble Brier score
# help(brier_score_for_ensemble)
N = 10000
M1 = 2
M2 = 20
obs = np.random.choice([0, 1], size=N, p=[0.5, 0.5])
time = np.arange(N)
obs = xr.DataArray(obs, dims=["time"], coords={"time": time})
fcst_a = np.random.choice([0, 1], size=(M1, N), p=[0.5, 0.5])
ensemble = np.arange(M1)
fcst_a = xr.DataArray(fcst_a, dims=["ensemble", "time"], coords={"ensemble": ensemble, "time": time})
fcst_b = np.random.choice([0, 1], size=(M2, N), p=[0.6, 0.4])
ensemble = np.arange(M2)
fcst_b = xr.DataArray(fcst_b, dims=["ensemble", "time"], coords={"ensemble": ensemble, "time": time})
First we calculate the fair Brier score for both ensembles. The thresholds
arg defines the threshold that an event occurs. By default, events are inclusive of the exact value of the threshold.
brier_score_for_ensemble(fcst_a, obs, event_thresholds=0.5, ensemble_member_dim="ensemble")
<xarray.DataArray (threshold: 1)> array([0.2532]) Coordinates: * threshold (threshold) float64 0.5
brier_score_for_ensemble(fcst_b, obs, event_thresholds=0.5, ensemble_member_dim="ensemble")
<xarray.DataArray (threshold: 1)> array([0.26047579]) Coordinates: * threshold (threshold) float64 0.5
As expected, fcst_a
had a lower (better) Brier score than fcst_b
. Now, let's calculate the Brier score without the fair adjustment.
brier_score_for_ensemble(fcst_a, obs, event_thresholds=0.5, ensemble_member_dim="ensemble", fair_correction=False)
<xarray.DataArray (threshold: 1)> array([0.37765]) Coordinates: * threshold (threshold) float64 0.5
brier_score_for_ensemble(fcst_b, obs, event_thresholds=0.5, ensemble_member_dim="ensemble", fair_correction=False)
<xarray.DataArray (threshold: 1)> array([0.27247375]) Coordinates: * threshold (threshold) float64 0.5
We can see that not using the fair correction with brier_score_for_ensemble
ranked the biased fcst_b
much better than fcst_a
. This shows the importance of using fair scores with ensembles to avoid getting misleading results!
Note, if an ensemble is passed into the function with only one ensemble member, the un-adjusted Brier score is returned in order to avoid a divide by zero error.
You can also calculate the Brier score for multiple thresholds
brier_score_for_ensemble(fcst_a, obs, event_thresholds=[-1, 0.5, 1, 3], ensemble_member_dim="ensemble")
<xarray.DataArray (threshold: 4)> array([0. , 0.2532, 0.2532, 0. ]) Coordinates: * threshold (threshold) float64 -1.0 0.5 1.0 3.0
threshold
arg if you want to calculate the Brier score for multiple thresholds.event_threshold_mode=operator.gt
if you don't want the exact threshold to be included as an event.