This notebook shows how to use probabilistic metrics in a typical data science task where the data is a pandas.DataFrame.
The metric Continuous Ranked Probability Score (CRPS) is used to verify multiple forecasts for the same target.
import xarray as xr
import pandas as pd
import numpy as np
import xskillscore as xs
Use the same data as in 01_Deterministic.ipynb
stores = np.arange(4)
skus = np.arange(3)
dates = pd.date_range("1/1/2020", "1/5/2020", freq="D")
rows = []
for _, date in enumerate(dates):
for _, store in enumerate(stores):
for _, sku in enumerate(skus):
rows.append(
dict(
{
"DATE": date,
"STORE": store,
"SKU": sku,
"QUANTITY_SOLD": np.random.randint(9) + 1,
}
)
)
df = pd.DataFrame(rows)
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)
df.head()
y | |||
---|---|---|---|
DATE | STORE | SKU | |
2020-01-01 | 0 | 0 | 6 |
1 | 9 | ||
2 | 2 | ||
1 | 0 | 6 | |
1 | 8 |
Instread of making a single prediction as in 01_Deterministic.ipynb we will make multiple forecasts (ensemble forecast). This is akin to more complex methods such as bagging, boosting and stacking.
Do 6 forecasts and append them to the pandas.DataFrame
using an extra field called member
. This will be saved in a new pandas.DataFrame
called df_yhat
:
tmp = df.copy()
for i in range(1, 7):
tmp['member'] = i
noise = np.random.uniform(-1, 1, size=len(df['y']))
tmp['yhat'] = (df['y'] + (df['y'] * noise)).astype(int)
if i == 1:
df_yhat = tmp.copy()
else:
df_yhat = df_yhat.append(tmp)
df_yhat
y | member | yhat | |||
---|---|---|---|---|---|
DATE | STORE | SKU | |||
2020-01-01 | 0 | 0 | 6 | 1 | 4 |
1 | 9 | 1 | 7 | ||
2 | 2 | 1 | 1 | ||
1 | 0 | 6 | 1 | 1 | |
1 | 8 | 1 | 5 | ||
... | ... | ... | ... | ... | ... |
2020-01-05 | 2 | 1 | 3 | 6 | 0 |
2 | 1 | 6 | 1 | ||
3 | 0 | 9 | 6 | 7 | |
1 | 3 | 6 | 3 | ||
2 | 2 | 6 | 0 |
360 rows × 3 columns
Drop the y
column from df_yhat
and add member
to the MultiIndex:
df_yhat.drop('y', axis=1, inplace=True)
df_yhat.set_index(['member'], append=True, inplace=True)
df_yhat
yhat | ||||
---|---|---|---|---|
DATE | STORE | SKU | member | |
2020-01-01 | 0 | 0 | 1 | 4 |
1 | 1 | 7 | ||
2 | 1 | 1 | ||
1 | 0 | 1 | 1 | |
1 | 1 | 5 | ||
... | ... | ... | ... | ... |
2020-01-05 | 2 | 1 | 6 | 0 |
2 | 6 | 1 | ||
3 | 0 | 6 | 7 | |
1 | 6 | 3 | ||
2 | 6 | 0 |
360 rows × 1 columns
Convert the target pandas.DataFrame
(df
) to an xarray.Dataset
:
ds = df.to_xarray()
ds
array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000', '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000', '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([0, 1, 2, 3])
array([0, 1, 2])
array([[[6, 9, 2], [6, 8, 8], [2, 3, 8], [8, 2, 7]], [[2, 4, 9], [3, 4, 6], [7, 2, 2], [2, 9, 1]], [[6, 5, 8], [3, 1, 9], [6, 1, 9], [3, 9, 8]], [[4, 3, 7], [1, 1, 3], [8, 5, 7], [3, 7, 3]], [[1, 5, 6], [9, 2, 9], [7, 3, 1], [9, 3, 2]]])
Now add the predicted pandas.DataFrame
(df
) as an xarray.DataArray
called yhat
to the xarray.Dataset
:
ds['yhat'] = df_yhat.to_xarray()['yhat']
ds
array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000', '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000', '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([0, 1, 2, 3])
array([0, 1, 2])
array([1, 2, 3, 4, 5, 6])
array([[[6, 9, 2], [6, 8, 8], [2, 3, 8], [8, 2, 7]], [[2, 4, 9], [3, 4, 6], [7, 2, 2], [2, 9, 1]], [[6, 5, 8], [3, 1, 9], [6, 1, 9], [3, 9, 8]], [[4, 3, 7], [1, 1, 3], [8, 5, 7], [3, 7, 3]], [[1, 5, 6], [9, 2, 9], [7, 3, 1], [9, 3, 2]]])
array([[[[ 4, 3, 3, 5, 2, 9], [ 7, 16, 14, 14, 7, 15], [ 1, 1, 0, 1, 0, 3]], [[ 1, 9, 11, 6, 10, 3], [ 5, 4, 1, 14, 5, 14], [ 0, 4, 12, 4, 8, 0]], [[ 3, 2, 0, 3, 3, 2], [ 5, 1, 0, 0, 2, 4], [ 7, 4, 1, 7, 9, 1]], [[ 6, 3, 4, 9, 12, 12], [ 3, 2, 1, 3, 2, 1], [ 6, 6, 2, 7, 10, 1]]], [[[ 1, 0, 1, 3, 1, 0], [ 0, 7, 3, 6, 3, 2], [11, 12, 11, 11, 6, 9]], [[ 3, 0, 3, 4, 5, 1], [ 5, 4, 7, 0, 6, 0], [ 1, 8, 2, 1, 2, 3]], [[ 4, 1, 9, 5, 12, 11], [ 1, 0, 3, 2, 1, 0], [ 2, 0, 1, 2, 2, 3]], [[ 3, 0, 0, 3, 2, 0], [14, 13, 5, 9, 17, 11], [ 1, 1, 1, 1, 0, 1]]], [[[10, 4, 0, 1, 7, 3], [ 3, 7, 3, 2, 2, 2], [ 4, 8, 11, 5, 8, 4]], [[ 1, 0, 3, 0, 0, 4], [ 1, 0, 0, 1, 1, 0], [13, 6, 7, 6, 3, 4]], [[ 3, 8, 2, 2, 11, 9], [ 0, 0, 0, 1, 0, 1], [14, 5, 5, 9, 4, 12]], [[ 4, 2, 3, 4, 0, 5], [15, 9, 6, 6, 16, 2], [ 7, 13, 0, 9, 4, 12]]], [[[ 5, 0, 1, 7, 3, 6], [ 3, 1, 3, 0, 1, 4], [ 1, 8, 4, 9, 4, 0]], [[ 0, 0, 1, 0, 0, 1], [ 0, 0, 0, 0, 1, 1], [ 1, 1, 5, 1, 5, 3]], [[ 4, 4, 3, 1, 7, 2], [ 2, 0, 2, 5, 5, 5], [ 0, 12, 12, 2, 6, 6]], [[ 3, 2, 0, 3, 5, 0], [ 4, 0, 4, 5, 1, 6], [ 1, 3, 0, 1, 4, 2]]], [[[ 0, 0, 1, 0, 0, 1], [ 6, 5, 2, 4, 2, 5], [ 3, 10, 5, 8, 8, 7]], [[ 5, 13, 16, 16, 2, 3], [ 0, 0, 3, 0, 0, 2], [13, 1, 8, 2, 0, 8]], [[10, 8, 11, 5, 1, 0], [ 0, 3, 2, 2, 4, 0], [ 0, 1, 1, 1, 0, 1]], [[ 8, 15, 14, 4, 5, 7], [ 2, 5, 3, 1, 2, 3], [ 1, 1, 3, 2, 2, 0]]]])
Notice how an xarray.Dataset
can handle Data variables which have different shape but share some dimenstions
Continuous Ranked Probability Score (CRPS) can also be considered as the probabilistic Mean Absolute Error. It compares the empirical distribution of an ensemble forecast to a scalar observation it is given as:
where where F(f)
is the cumulative distribution function (CDF) of the forecast and H()
is the Heaviside step function where the value is 1 if the argument is positive (the prediction overestimates the target or 0 (the prediction is equal to or lower than the target data).
See https://climpred.readthedocs.io/en/stable/metrics.html#continuous-ranked-probability-score-crps for further documentation.
It is not a common verification metric and in most cases the predictions are averaged then verified using deterministic metrics.
For example, you can see averaging on the member
dimension gives a better prediction than any indivdual prediction:
Note: for this we will use the function itself insead of the Accessor method:
avg_member_rmse = xs.rmse(
ds["y"], ds["yhat"].mean(dim="member"), ["DATE", "STORE", "SKU"]
)
print("avg_member_rmse: ", avg_member_rmse.values)
for i in range(len(ds.coords["member"])):
print(f"member {i + 1}:")
ind_member_rmse = xs.rmse(
ds["y"], ds["yhat"].sel(member=i + 1), ["DATE", "STORE", "SKU"]
)
print("ind_member_rmse: ", ind_member_rmse.values)
assert avg_member_rmse < ind_member_rmse
avg_member_rmse: 1.5715939066461817 member 1: ind_member_rmse: 2.972092416687835 member 2: ind_member_rmse: 3.1754264805429417 member 3: ind_member_rmse: 3.3216461782274966 member 4: ind_member_rmse: 2.851899951494325 member 5: ind_member_rmse: 3.286335345030997 member 6: ind_member_rmse: 3.361547262794322
However, you will see it appear in some kaggle compeitions such as the NFL Big Data Bowl and the Second Annual Data Science Bowl so it's good to have in your arsenal.
The CRPS is only valid over the member
dimension and therefore only takes 2 arguments:
ds.xs.crps_ensemble('y', 'yhat')
array([[[1.5 , 2.58333333, 0.83333333], [1.27777778, 2.19444444, 2.33333333], [0.30555556, 0.94444444, 1.80555556], [1.33333333, 0.22222222, 1. ]], [[0.83333333, 0.86111111, 1. ], [0.38888889, 0.83333333, 2.69444444], [1.44444444, 0.58333333, 0.16666667], [0.61111111, 1.69444444, 0.02777778]], [[1.58333333, 1.69444444, 0.94444444], [1.16666667, 0.25 , 2.19444444], [1.52777778, 0.44444444, 1.41666667], [0.44444444, 1.55555556, 1.30555556]], [[0.88888889, 0.55555556, 1.83333333], [0.44444444, 0.44444444, 0.72222222], [3.47222222, 0.80555556, 1.5 ], [0.52777778, 2.5 , 0.75 ]], [[0.44444444, 0.5 , 0.91666667], [2.58333333, 0.91666667, 2.44444444], [1.47222222, 0.69444444, 0.11111111], [1.52777778, 0.33333333, 0.30555556]]])
array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000', '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000', '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([0, 1, 2, 3])
array([0, 1, 2])
To return an overal CRPS it is recommened averaging over all dimensions before using crps
:
y = ds['y'].mean(dim=['DATE', 'STORE', 'SKU'])
yhat = ds['yhat'].mean(dim=['DATE', 'STORE', 'SKU'])
xs.crps_ensemble(y, yhat)
array(0.80694444)