In this notebook I show how xskillscore
can be dropped in a typical data science task where the data is a pandas.DataFrame
.
I use the metric RMSE to verify forecasts of items sold.
I also show how you can apply weights to the verification and handle missing values.
Import the necessary packages
import xarray as xr
import pandas as pd
import numpy as np
import xskillscore as xs
Let's say you are a data scientist who works for a company which owns four stores which each sell three items (Store Keeping Units).
Set up stores
and skus
arrays:
stores = np.arange(4)
skus = np.arange(3)
and you are tracking daily perfomane of items sold between Jan 1st and Jan 5th 2020.
Setup up dates
array:
dates = pd.date_range("1/1/2020", "1/5/2020", freq="D")
Generate a pandas.DataFrame
to show the number of items that were sold during this period. The number of items sold will be a random number between 1 and 10.
This may be something you would obtain from querying a database:
rows = []
for _, date in enumerate(dates):
for _, store in enumerate(stores):
for _, sku in enumerate(skus):
rows.append(
dict(
{
"DATE": date,
"STORE": store,
"SKU": sku,
"QUANTITY_SOLD": np.random.randint(9) + 1,
}
)
)
df = pd.DataFrame(rows)
Pring the first 5 rows of the pandas.DataFrame
:
df.head()
DATE | STORE | SKU | QUANTITY_SOLD | |
---|---|---|---|---|
0 | 2020-01-01 | 0 | 0 | 9 |
1 | 2020-01-01 | 0 | 1 | 2 |
2 | 2020-01-01 | 0 | 2 | 2 |
3 | 2020-01-01 | 1 | 0 | 3 |
4 | 2020-01-01 | 1 | 1 | 1 |
Your boss has asked you to use this data to predict the number of items sold for each store and sku level for the next 5 days.
The prediction is outside of the scope of the tutorial but we will use xskillscore
to tell us how good our prediction may be .
First, rename the target variable to y
:
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.head()
DATE | STORE | SKU | y | |
---|---|---|---|---|
0 | 2020-01-01 | 0 | 0 | 9 |
1 | 2020-01-01 | 0 | 1 | 2 |
2 | 2020-01-01 | 0 | 2 | 2 |
3 | 2020-01-01 | 1 | 0 | 3 |
4 | 2020-01-01 | 1 | 1 | 1 |
Use pandas MultiIndex to help handle the granularity of the forecast:
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)
This also displays the data in a cleaner foremat in the notebook:
df.head()
y | |||
---|---|---|---|
DATE | STORE | SKU | |
2020-01-01 | 0 | 0 | 9 |
1 | 2 | ||
2 | 2 | ||
1 | 0 | 3 | |
1 | 1 |
Time for your prediction! As mentioned, this is outside of the scope of this tutorial.
In our case we are going to generate data to mimic a prediction by taking y
and perturbing randomly. This will provide a middle ground of creating a prediction which is not overfitting the data (being very similar to y
) and the other extreme of random numbers for which the skill will be 0.
The perturbations will scale y
between -100% and 100% using a uniform distribution. For example, a value of 5 in y
will be between 0 and 10 in the prediction (yhat
).
Setup the perturbation array:
noise = np.random.uniform(-1, 1, size=len(df['y']))
Name the prediction yhat
and append it to the pandas.DataFrame
.
Lastly, convert it is an int
to match the same format as the target (y
):
df['yhat'] = (df['y'] + (df['y'] * noise)).astype(int)
df.head()
y | yhat | |||
---|---|---|---|---|
DATE | STORE | SKU | ||
2020-01-01 | 0 | 0 | 9 | 13 |
1 | 2 | 3 | ||
2 | 2 | 2 | ||
1 | 0 | 3 | 4 | |
1 | 1 | 0 |
RMSE (root-mean-squre error) is the square root of the average of the squared differences between forecasts and verification data:
Because the error is squared is it sensitive to outliers and is a more conservative metric than mean-absolute error.
See https://climpred.readthedocs.io/en/stable/metrics.html#root-mean-square-error-rmse for further documentation
Most data scientists are familiar with using scikit-learn
for verifying forecasts, especially if you used scikit-learn
for the prediction.
To obtain RMSE from scikit-learn
import mean_squared_error
and specify squared=False
:
from sklearn.metrics import mean_squared_error
mean_squared_error(df['y'], df['yhat'], squared=False)
2.932575659723036
While skikit-learn
is simple it doesn't give the flexibility of that given in xskillscore
.
Note: xskillscore
does use the same metrics as in scikit-learn
such as the r2_score
, which is called r2
in xskillscore
.
To use xskillscore
you first have to put your data into an xarray
object.
Because xarray
is part of the PyData stack it integrates will other Python data science packages.
pandas
has a convenient to_xarray
which makes going from pandas
to xarray
seamless.
Use to_xarray
to convert the pandas.Dataframe
to an xarray.Dataset
:
ds = df.to_xarray()
ds
array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000', '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000', '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([0, 1, 2, 3])
array([0, 1, 2])
array([[[9, 2, 2], [3, 1, 6], [4, 7, 6], [7, 6, 2]], [[3, 4, 1], [6, 6, 3], [2, 5, 7], [6, 5, 8]], [[6, 2, 9], [2, 9, 3], [7, 8, 2], [9, 7, 8]], [[4, 4, 1], [8, 8, 3], [6, 7, 1], [2, 3, 8]], [[5, 1, 2], [2, 5, 2], [6, 6, 8], [1, 1, 3]]])
array([[[13, 3, 2], [ 4, 0, 7], [ 7, 7, 9], [ 3, 11, 2]], [[ 2, 3, 0], [10, 1, 2], [ 2, 9, 10], [11, 6, 14]], [[11, 3, 7], [ 3, 7, 3], [ 6, 4, 3], [12, 9, 9]], [[ 7, 5, 1], [12, 14, 2], [ 8, 7, 0], [ 0, 0, 3]], [[ 8, 0, 1], [ 0, 9, 1], [ 2, 3, 0], [ 0, 0, 0]]])
As seen above, xarray
has a very nice html representation of xarray.Dataset
objects.
Click on the data symbol (the cylinder) to the see the data associated with the Coordinates
and the Data
.
You now have one variable (ds
) which houses the data and the associated meta data. You can also use the Attributes
for handling things like units. (this is why xarray
was developed!).
If you would like to know more about xarray
check out this overview.
We can use xskillscore
on this xarray.Dataset
via xarray
's Accessor method.
xskillscore
expects at least 3 arguments for most functions. These are y
: the target variable; yhat
: the predicted variable and dim(s)
the dimension(s) for which to apply the verification metric over.
To replicate the scikit-learn
metric above, apply RMSE over all the dimensions [DATE, STORE, SKU]
. RMSE is called rmse
in xskillscore. #Lastly call .values
on the object to obtain the data as a np.array
...
rmse = ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU'])
rmse
array(2.93257566)
If you want just the data from the xarray.DataArray
you can .values
on it.
rmse.values
array(2.93257566)
xskillscore
allows you apply the metric over any combination of dimensions (think of pandas.groupby.apply
but faster).
For example, your boss has asked you how good are your predictions at store level.
In this case, apply the metrics over the DATE
and SKU
dimensions:
rmse = ds.xs.rmse('y', 'yhat', ['DATE', 'SKU'])
rmse
array([2.17562252, 2.87518115, 3.21455025, 3.32665999])
array([0, 1, 2, 3])
We can use xarray
a bit further to explore our results.
Let find out which store had the best forecast and which store had the worst forecast:
print('Our forecast performed well for store:')
print(rmse.where(rmse==rmse.min(), drop=True).coords)
print('')
print('Our forecast struggled with store:')
print(rmse.where(rmse==rmse.max(), drop=True).coords)
Our forecast performed well for store: Coordinates: * STORE (STORE) int64 0 Our forecast struggled with store: Coordinates: * STORE (STORE) int64 3
You can specify weights when calculating skill metrics. Here I will go through an example demonstrating why you may want to apply weights when verifying your forecast.
You boss has asked for you to create a prediction for the next five days. You will update this prediction everyday and there is a larger focus on the performance of the model for the subsequent day and less of a focus on the fifth day.
In this case you can weight your metric so the performance of day 1 has a larger influence than day 5. Here we will apply a linear scaling from 1 to 0 with day 1 having a weight of 1. and day 5 having a weight of 0..
Generate the weights the same size as the DATE
dimension and put it into an xarray.DataArray
:
dim = 'DATE'
np_weights = np.linspace(1, 0, num=len(ds[dim]))
weights = xr.DataArray(np_weights, dims=dim)
print(weights)
<xarray.DataArray (DATE: 5)> array([1. , 0.75, 0.5 , 0.25, 0. ]) Dimensions without coordinates: DATE
Now simply add the variable to the weights
argument:
ds.xs.rmse('y', 'yhat', 'DATE', weights=weights)
array([[3.54964787, 1. , 1.04880885], [2.64575131, 3.50713558, 0.89442719], [2.04939015, 2.82842712, 2.56904652], [4.01248053, 3.46410162, 3.67423461]])
array([0, 1, 2, 3])
array([0, 1, 2])
and you can compare to the result without the weights:
ds.xs.rmse('y', 'yhat', 'DATE')
array([[3.46410162, 1. , 1.09544512], [2.75680975, 4.04969135, 0.89442719], [2.44948974, 2.86356421, 4.09878031], [3.31662479, 2.82842712, 3.76828874]])
array([0, 1, 2, 3])
array([0, 1, 2])
There may be no purchases for certain items in certain stores on certain dates. These entries will be blank in the query from the database.
To mimic data like this create the same type of data structure as before but randomly suppress each row. I have created a simply if
statement that will drop the row with a probability of 0.2 (20%):
random_number_threshold = 0.8
rows = []
for _, date in enumerate(dates):
for _, store in enumerate(stores):
for _, sku in enumerate(skus):
if np.random.rand(1) < random_number_threshold:
rows.append(
dict(
{
"DATE": date,
"STORE": store,
"SKU": sku,
"QUANTITY_SOLD": np.random.randint(9) + 1,
}
)
)
df = pd.DataFrame(rows)
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)
df.head(10)
y | |||
---|---|---|---|
DATE | STORE | SKU | |
2020-01-01 | 0 | 0 | 7 |
2 | 1 | ||
1 | 0 | 2 | |
1 | 5 | ||
2 | 2 | ||
2 | 0 | 6 | |
2 | 4 | ||
3 | 0 | 6 | |
2 | 8 | ||
2020-01-02 | 0 | 0 | 3 |
Converting the pandas.DataFrame
to an xarray.Dataset
is very handy in this case because it will infer the missing entries as nans
(as long as all indexes are present in the pandas.DataFrame
):
ds = df.to_xarray()
ds
array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000', '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000', '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
array([0, 1, 2, 3])
array([0, 1, 2])
array([[[ 7., nan, 1.], [ 2., 5., 2.], [ 6., nan, 4.], [ 6., nan, 8.]], [[ 3., 8., nan], [ 9., 6., 4.], [ 4., 3., 7.], [ 7., 1., nan]], [[ 2., nan, 2.], [nan, 3., 2.], [ 5., nan, 8.], [ 7., nan, 5.]], [[ 7., 8., 9.], [ 4., nan, 3.], [ 2., 8., 6.], [ 2., nan, 9.]], [[ 5., 9., nan], [ 5., nan, 5.], [ 8., 3., nan], [ 3., 3., 8.]]])
Click on the data symbol associated with the y
Data variable to see the nans
.
You can also use this step in your workflow if simply want to continue working with the pandas.DataFrame
:
df_with_nans = ds.to_dataframe()
df_with_nans.head(10)
y | |||
---|---|---|---|
DATE | SKU | STORE | |
2020-01-01 | 0 | 0 | 7.0 |
1 | 2.0 | ||
2 | 6.0 | ||
3 | 6.0 | ||
1 | 0 | NaN | |
1 | 5.0 | ||
2 | NaN | ||
3 | NaN | ||
2 | 0 | 1.0 | |
1 | 2.0 |
Note: xarray returns the fields alphabetically but it still shows the nans
.
In most cases you will not know a priori, if there will be no purchases for a particular item in a certain store during a day. Therefore, your prediction will not contain nans
but you would hope the value is low.
Append a prediction column as was done previously:
df_with_nans['yhat'] = df_with_nans['y'] + (df_with_nans['y'] * noise)
df_with_nans.head()
y | yhat | |||
---|---|---|---|---|
DATE | SKU | STORE | ||
2020-01-01 | 0 | 0 | 7.0 | 10.446753 |
1 | 2.0 | 3.411661 | ||
2 | 6.0 | 7.584959 | ||
3 | 6.0 | 9.522168 | ||
1 | 0 | NaN | NaN |
Our prediction contains nans
so to mimic a realistic prediction replace these with values:
yhat = df_with_nans['yhat']
yhat.loc[pd.isna(yhat)] = yhat[pd.isna(yhat)].apply(lambda x: np.random.randint(9) + 1)
df_with_nans['yhat'] = yhat
df_with_nans.head()
y | yhat | |||
---|---|---|---|---|
DATE | SKU | STORE | ||
2020-01-01 | 0 | 0 | 7.0 | 10.446753 |
1 | 2.0 | 3.411661 | ||
2 | 6.0 | 7.584959 | ||
3 | 6.0 | 9.522168 | ||
1 | 0 | NaN | 2.000000 |
Now if we try using scikit-learn
:
mean_squared_error(df_with_nans['y'], df_with_nans['yhat'], squared=False)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-25-aaf7ba8b32ad> in <module> ----> 1 mean_squared_error(df_with_nans['y'], df_with_nans['yhat'], squared=False) ~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/metrics/_regression.py in mean_squared_error(y_true, y_pred, sample_weight, multioutput, squared) 249 250 """ --> 251 y_type, y_true, y_pred, multioutput = _check_reg_targets( 252 y_true, y_pred, multioutput) 253 check_consistent_length(y_true, y_pred, sample_weight) ~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/metrics/_regression.py in _check_reg_targets(y_true, y_pred, multioutput, dtype) 83 """ 84 check_consistent_length(y_true, y_pred) ---> 85 y_true = check_array(y_true, ensure_2d=False, dtype=dtype) 86 y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype) 87 ~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 575 576 if force_all_finite: --> 577 _assert_all_finite(array, 578 allow_nan=force_all_finite == 'allow-nan') 579 ~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype) 55 not allow_nan and not np.isfinite(X).all()): 56 type_err = 'infinity' if allow_nan else 'NaN, infinity' ---> 57 raise ValueError( 58 msg_err.format 59 (type_err, ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
you get a ValueError: Input contains NaN
.
In xskillscore
you don't need to worry about this and simply specify skipna=True
:
ds = df_with_nans.to_xarray()
ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU'], skipna=True)
array(3.44001822)
You can specify weights and skipna together for powerful analysis..
ds.xs.rmse('y', 'yhat', 'DATE', weights=weights, skipna=True)
array([[3.03824381, 1.10544282, 1.46657789, 4.96241146], [6.44982206, 1.03585395, 2.0144671 , 0.87421379], [1.19256817, 2.41157233, 3.45932053, 3.49296654]])
array([0, 1, 2])
array([0, 1, 2, 3])