This notebook shows how to use probabilistic metrics in a typical data science task where the data is a pandas.DataFrame.
The metric Continuous Ranked Probability Score (CRPS) is used to verify multiple forecasts for the same target.
import xarray as xr
import pandas as pd
import numpy as np
import xskillscore as xs
Use the same data as in 01_Deterministic.ipynb:
stores = np.arange(4)
skus = np.arange(3)
dates = pd.date_range("1/1/2020", "1/5/2020", freq="D")
rows = []
for _, date in enumerate(dates):
for _, store in enumerate(stores):
for _, sku in enumerate(skus):
rows.append(
dict(
{
"DATE": date,
"STORE": store,
"SKU": sku,
"QUANTITY_SOLD": np.random.randint(9) + 1,
}
)
)
df = pd.DataFrame(rows)
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)
df.head()
y | |||
---|---|---|---|
DATE | STORE | SKU | |
2020-01-01 | 0 | 0 | 9 |
1 | 2 | ||
2 | 4 | ||
1 | 0 | 1 | |
1 | 5 |
Instead of making a single prediction as in 01_Deterministic.ipynb we will make multiple forecasts (ensemble forecast). This is akin to more complex methods such as bagging, boosting and stacking.
Do 6 forecasts and append them to the pandas.DataFrame
using an extra field called member
. This will be saved in a new pandas.DataFrame
called df_yhat
:
tmp = df.copy()
for i in range(1, 7):
tmp['member'] = i
noise = np.random.uniform(-1, 1, size=len(df['y']))
tmp['yhat'] = (df['y'] + (df['y'] * noise)).astype(int)
if i == 1:
df_yhat = tmp.copy()
else:
df_yhat = df_yhat.append(tmp)
df_yhat
y | member | yhat | |||
---|---|---|---|---|---|
DATE | STORE | SKU | |||
2020-01-01 | 0 | 0 | 9 | 1 | 10 |
1 | 2 | 1 | 0 | ||
2 | 4 | 1 | 2 | ||
1 | 0 | 1 | 1 | 0 | |
1 | 5 | 1 | 5 | ||
... | ... | ... | ... | ... | ... |
2020-01-05 | 2 | 1 | 8 | 6 | 7 |
2 | 2 | 6 | 0 | ||
3 | 0 | 8 | 6 | 8 | |
1 | 3 | 6 | 1 | ||
2 | 5 | 6 | 9 |
360 rows × 3 columns
Drop the y
column from df_yhat
and add member
to the MultiIndex:
df_yhat.drop('y', axis=1, inplace=True)
df_yhat.set_index(['member'], append=True, inplace=True)
df_yhat
yhat | ||||
---|---|---|---|---|
DATE | STORE | SKU | member | |
2020-01-01 | 0 | 0 | 1 | 10 |
1 | 1 | 0 | ||
2 | 1 | 2 | ||
1 | 0 | 1 | 0 | |
1 | 1 | 5 | ||
... | ... | ... | ... | ... |
2020-01-05 | 2 | 1 | 6 | 7 |
2 | 6 | 0 | ||
3 | 0 | 6 | 8 | |
1 | 6 | 1 | ||
2 | 6 | 9 |
360 rows × 1 columns
Convert the target pandas.DataFrame
(df
) to an xarray.Dataset
:
ds = df.to_xarray()
ds
array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000', '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000', '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([0, 1, 2, 3])
array([0, 1, 2])
array([[[9, 2, 4], [1, 5, 8], [6, 3, 1], [6, 4, 5]], [[3, 4, 5], [5, 6, 8], [5, 4, 1], [8, 2, 7]], [[8, 3, 2], [4, 4, 2], [2, 3, 5], [1, 4, 6]], [[2, 3, 9], [2, 7, 1], [4, 1, 9], [4, 3, 9]], [[2, 8, 7], [7, 3, 4], [8, 8, 2], [8, 3, 5]]])
Now add the predicted pandas.DataFrame
(df
) as an xarray.DataArray
called yhat
to the xarray.Dataset
:
ds['yhat'] = df_yhat.to_xarray()['yhat']
ds
array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000', '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000', '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([0, 1, 2, 3])
array([0, 1, 2])
array([1, 2, 3, 4, 5, 6])
array([[[9, 2, 4], [1, 5, 8], [6, 3, 1], [6, 4, 5]], [[3, 4, 5], [5, 6, 8], [5, 4, 1], [8, 2, 7]], [[8, 3, 2], [4, 4, 2], [2, 3, 5], [1, 4, 6]], [[2, 3, 9], [2, 7, 1], [4, 1, 9], [4, 3, 9]], [[2, 8, 7], [7, 3, 4], [8, 8, 2], [8, 3, 5]]])
array([[[[10, 10, 1, 10, 16, 11], [ 0, 3, 1, 0, 2, 0], [ 2, 5, 7, 2, 0, 1]], [[ 0, 1, 0, 0, 0, 0], [ 5, 0, 8, 8, 9, 1], [ 1, 4, 15, 2, 1, 10]], [[ 0, 3, 5, 4, 5, 5], [ 3, 2, 0, 1, 4, 1], [ 1, 1, 0, 0, 0, 1]], [[10, 3, 11, 10, 10, 0], [ 3, 5, 3, 3, 1, 2], [ 4, 3, 6, 4, 0, 2]]], [[[ 4, 3, 0, 1, 0, 0], [ 5, 3, 7, 6, 1, 1], [ 0, 7, 8, 6, 5, 1]], [[ 5, 1, 8, 6, 3, 6], [ 6, 9, 11, 4, 10, 6], [ 0, 5, 5, 6, 3, 15]], [[ 2, 5, 9, 2, 4, 5], [ 2, 3, 1, 3, 6, 6], [ 0, 0, 1, 1, 0, 0]], [[15, 2, 13, 10, 4, 4], [ 0, 2, 0, 0, 3, 2], [ 6, 7, 8, 1, 10, 7]]], [[[ 6, 1, 12, 7, 5, 10], [ 3, 0, 0, 1, 1, 3], [ 1, 2, 0, 2, 2, 3]], [[ 6, 5, 5, 7, 1, 1], [ 6, 3, 2, 3, 1, 2], [ 0, 0, 3, 0, 1, 3]], [[ 1, 3, 0, 1, 0, 3], [ 0, 5, 0, 2, 2, 0], [ 7, 9, 6, 5, 2, 1]], [[ 0, 1, 0, 1, 0, 0], [ 4, 5, 6, 1, 7, 4], [ 0, 7, 4, 6, 5, 0]]], [[[ 0, 3, 0, 1, 1, 2], [ 3, 4, 3, 5, 5, 0], [ 2, 15, 11, 16, 11, 14]], [[ 3, 2, 2, 3, 1, 1], [ 8, 9, 2, 6, 0, 12], [ 0, 0, 0, 1, 0, 1]], [[ 5, 0, 2, 1, 3, 2], [ 1, 0, 1, 0, 0, 0], [ 7, 8, 4, 5, 6, 4]], [[ 7, 6, 5, 5, 3, 7], [ 0, 0, 4, 2, 0, 3], [ 5, 10, 1, 8, 0, 0]]], [[[ 3, 1, 2, 3, 1, 3], [ 8, 11, 15, 6, 11, 13], [12, 7, 13, 5, 12, 8]], [[ 6, 6, 4, 1, 9, 13], [ 3, 1, 4, 0, 4, 3], [ 6, 7, 0, 7, 4, 3]], [[ 1, 7, 11, 0, 11, 2], [ 7, 0, 3, 15, 6, 7], [ 3, 1, 2, 2, 1, 0]], [[ 6, 2, 1, 2, 9, 8], [ 5, 0, 4, 1, 2, 1], [ 5, 4, 5, 8, 1, 9]]]])
Notice how an xarray.Dataset
can handle Data variables which have different shape but share some dimensions
Continuous Ranked Probability Score (CRPS) can also be considered as the probabilistic Mean Absolute Error. It compares the empirical distribution of an ensemble forecast to a scalar observation it is given as:
where where F(f)
is the cumulative distribution function (CDF) of the forecast and H()
is the Heaviside step function where the value is 1 if the argument is positive (the prediction overestimates the target or 0 (the prediction is equal to or lower than the target data).
See https://climpred.readthedocs.io/en/stable/metrics.html#continuous-ranked-probability-score-crps for further documentation.
It is not a common verification metric and in most cases the predictions are averaged then verified using deterministic metrics.
For example, you can see averaging on the member
dimension gives a better prediction than any individual prediction:
Note: for this we will use the function itself instead of the Accessor method:
avg_member_rmse = xs.rmse(
ds["y"], ds["yhat"].mean(dim="member"), ["DATE", "STORE", "SKU"]
)
print("avg_member_rmse: ", avg_member_rmse.values)
for i in range(len(ds.coords["member"])):
print(f"member {i + 1}:")
ind_member_rmse = xs.rmse(
ds["y"], ds["yhat"].sel(member=i + 1), ["DATE", "STORE", "SKU"]
)
print("ind_member_rmse: ", ind_member_rmse.values)
assert avg_member_rmse < ind_member_rmse
avg_member_rmse: 1.5302505241517372 member 1: ind_member_rmse: 2.958039891549808 member 2: ind_member_rmse: 2.717228980659034 member 3: ind_member_rmse: 3.361547262794322 member 4: ind_member_rmse: 2.869378562220979 member 5: ind_member_rmse: 3.082207001484488 member 6: ind_member_rmse: 3.119829055146024
However, you will see it appear in some Kaggle competitions such as the NFL Big Data Bowl and the Second Annual Data Science Bowl so it's good to have in your arsenal.
The CRPS is only valid over the member
dimension and therefore only takes 2 arguments:
ds.xs.crps_ensemble('y', 'yhat')
array([[[1.16666667, 0.72222222, 1.19444444], [0.69444444, 1.25 , 2.75 ], [1.44444444, 0.75 , 0.25 ], [2.22222222, 0.86111111, 1.13888889]], [[1.16666667, 0.86111111, 0.86111111], [0.58333333, 0.94444444, 2.33333333], [0.58333333, 0.80555556, 0.44444444], [1.94444444, 0.52777778, 0.41666667]], [[1.19444444, 1. , 0.16666667], [0.91666667, 1.02777778, 0.80555556], [0.66666667, 1.25 , 0.77777778], [0.44444444, 0.47222222, 1.16666667]], [[0.58333333, 0.44444444, 2.47222222], [0.22222222, 1.19444444, 0.44444444], [1.30555556, 0.44444444, 2.5 ], [1.08333333, 0.97222222, 3.16666667]], [[0.36111111, 1.66666667, 1.52777778], [1.08333333, 0.36111111, 0.80555556], [2.16666667, 1.55555556, 0.30555556], [1.94444444, 0.86111111, 0.55555556]]])
array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000', '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000', '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([0, 1, 2, 3])
array([0, 1, 2])
To return an overall CRPS it is recommended to average over all dimensions before using crps
:
y = ds['y'].mean(dim=['DATE', 'STORE', 'SKU'])
yhat = ds['yhat'].mean(dim=['DATE', 'STORE', 'SKU'])
xs.crps_ensemble(y, yhat)
array(0.56481481)