02_Probabilistic.ipynb ¶

This notebook shows how to use probabilistic metrics in a typical data science task where the data is a pandas.DataFrame.

The metric Continuous Ranked Probability Score (CRPS) is used to verify multiple forecasts for the same target.

In [1]:

import xarray as xr
import pandas as pd
import numpy as np
import xskillscore as xs

Use the same data as in 01_Deterministic.ipynb

In [2]:

stores = np.arange(4)
skus = np.arange(3)
dates = pd.date_range("1/1/2020", "1/5/2020", freq="D")

rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            rows.append(
                dict(
                    {
                        "DATE": date,
                        "STORE": store,
                        "SKU": sku,
                        "QUANTITY_SOLD": np.random.randint(9) + 1,
                    }
                )
            )
df = pd.DataFrame(rows)
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)
df.head()

Out[2]:

			y
DATE	STORE	SKU
2020-01-01	0	0	6
		1	9
		2	2
	1	0	6
	1	1	8

Instread of making a single prediction as in 01_Deterministic.ipynb we will make multiple forecasts (ensemble forecast). This is akin to more complex methods such as bagging, boosting and stacking.

Do 6 forecasts and append them to the pandas.DataFrame using an extra field called member. This will be saved in a new pandas.DataFrame called df_yhat:

In [4]:

tmp = df.copy()
for i in range(1, 7):
    tmp['member'] = i
    noise = np.random.uniform(-1, 1, size=len(df['y']))
    tmp['yhat'] = (df['y'] + (df['y'] * noise)).astype(int)
    if i == 1:
        df_yhat = tmp.copy()
    else:
        df_yhat = df_yhat.append(tmp)
df_yhat

Out[4]:

			y	member	yhat
DATE	STORE	SKU
2020-01-01	0	0	6	1	4
		1	9	1	7
		2	2	1	1
	1	0	6	1	1
	1	1	8	1	5
...	...	...	...	...	...
2020-01-05	2	1	3	6	0
	2	2	1	6	1
	3	0	9	6	7
		1	3	6	3
		2	2	6	0

360 rows × 3 columns

Drop the y column from df_yhat and add member to the MultiIndex:

In [5]:

df_yhat.drop('y', axis=1, inplace=True)
df_yhat.set_index(['member'], append=True, inplace=True)
df_yhat

Out[5]:

				yhat
DATE	STORE	SKU	member
2020-01-01	0	0	1	4
		1	1	7
		2	1	1
	1	0	1	1
	1	1	1	5
...	...	...	...	...
2020-01-05	2	1	6	0
	2	2	6	1
	3	0	6	7
		1	6	3
		2	6	0

360 rows × 1 columns

Convert the target pandas.DataFrame (df) to an xarray.Dataset:

In [6]:

ds = df.to_xarray()
ds

Now add the predicted pandas.DataFrame (df) as an xarray.DataArray called yhat to the xarray.Dataset:

In [7]:

ds['yhat'] = df_yhat.to_xarray()['yhat']
ds

Notice how an xarray.Dataset can handle Data variables which have different shape but share some dimenstions

Using xskillscore - CRPS¶

Continuous Ranked Probability Score (CRPS) can also be considered as the probabilistic Mean Absolute Error. It compares the empirical distribution of an ensemble forecast to a scalar observation it is given as:

\begin{align} CRPS = \int_{-\infty}^{\infty} (F(f) - H(f - o))^{2} df \end{align}

where where F(f) is the cumulative distribution function (CDF) of the forecast and H() is the Heaviside step function where the value is 1 if the argument is positive (the prediction overestimates the target or 0 (the prediction is equal to or lower than the target data).

See https://climpred.readthedocs.io/en/stable/metrics.html#continuous-ranked-probability-score-crps for further documentation.

It is not a common verification metric and in most cases the predictions are averaged then verified using deterministic metrics.

For example, you can see averaging on the member dimension gives a better prediction than any indivdual prediction:

Note: for this we will use the function itself insead of the Accessor method:

In [35]:

avg_member_rmse = xs.rmse(
    ds["y"], ds["yhat"].mean(dim="member"), ["DATE", "STORE", "SKU"]
)
print("avg_member_rmse: ", avg_member_rmse.values)
for i in range(len(ds.coords["member"])):
    print(f"member {i + 1}:")
    ind_member_rmse = xs.rmse(
        ds["y"], ds["yhat"].sel(member=i + 1), ["DATE", "STORE", "SKU"]
    )
    print("ind_member_rmse: ", ind_member_rmse.values)
    
    assert avg_member_rmse < ind_member_rmse

avg_member_rmse:  1.5715939066461817
member 1:
ind_member_rmse:  2.972092416687835
member 2:
ind_member_rmse:  3.1754264805429417
member 3:
ind_member_rmse:  3.3216461782274966
member 4:
ind_member_rmse:  2.851899951494325
member 5:
ind_member_rmse:  3.286335345030997
member 6:
ind_member_rmse:  3.361547262794322

However, you will see it appear in some kaggle compeitions such as the NFL Big Data Bowl and the Second Annual Data Science Bowl so it's good to have in your arsenal.

The CRPS is only valid over the member dimension and therefore only takes 2 arguments:

In [31]:

ds.xs.crps_ensemble('y', 'yhat')

Out[31]:

xarray.DataArray

DATE: 5
STORE: 4
SKU: 3

1.5 2.583 0.8333 1.278 2.194 ... 0.6944 0.1111 1.528 0.3333 0.3056

array([[[1.5       , 2.58333333, 0.83333333],
        [1.27777778, 2.19444444, 2.33333333],
        [0.30555556, 0.94444444, 1.80555556],
        [1.33333333, 0.22222222, 1.        ]],

       [[0.83333333, 0.86111111, 1.        ],
        [0.38888889, 0.83333333, 2.69444444],
        [1.44444444, 0.58333333, 0.16666667],
        [0.61111111, 1.69444444, 0.02777778]],

       [[1.58333333, 1.69444444, 0.94444444],
        [1.16666667, 0.25      , 2.19444444],
        [1.52777778, 0.44444444, 1.41666667],
        [0.44444444, 1.55555556, 1.30555556]],

       [[0.88888889, 0.55555556, 1.83333333],
        [0.44444444, 0.44444444, 0.72222222],
        [3.47222222, 0.80555556, 1.5       ],
        [0.52777778, 2.5       , 0.75      ]],

       [[0.44444444, 0.5       , 0.91666667],
        [2.58333333, 0.91666667, 2.44444444],
        [1.47222222, 0.69444444, 0.11111111],
        [1.52777778, 0.33333333, 0.30555556]]])

Coordinates: (3)

DATE

(DATE)

datetime64[ns]

2020-01-01 ... 2020-01-05

array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000',
       '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000',
       '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')

STORE
(STORE)
int64
0 1 2 3
```
array([0, 1, 2, 3])
```
SKU
(SKU)
int64
0 1 2
```
array([0, 1, 2])
```

Attributes: (0)

To return an overal CRPS it is recommened averaging over all dimensions before using crps:

In [30]:

y = ds['y'].mean(dim=['DATE', 'STORE', 'SKU'])
yhat = ds['yhat'].mean(dim=['DATE', 'STORE', 'SKU'])
xs.crps_ensemble(y, yhat)

Out[30]:

xarray.DataArray

0.8069
```
array(0.80694444)
```
Coordinates: (0)
Attributes: (0)

In [ ]:

02_Probabilistic.ipynb¶

Using xskillscore - CRPS¶

02_Probabilistic.ipynb ¶