• In this notebook I show how xskillscore can be dropped in a typical data science task where the data is a pandas.DataFrame.

  • I use the metric RMSE to verify forecasts of items sold.

  • I also show how you can apply weights to the verification and handle missing values.

Import the necessary packages

In [1]:
import xarray as xr
import pandas as pd
import numpy as np
import xskillscore as xs

Let's say you are a data scientist who works for a company which owns four stores which each sell three items (Store Keeping Units).

Set up stores and skus arrays:

In [2]:
stores = np.arange(4)
skus = np.arange(3)

and you are tracking daily perfomane of items sold between Jan 1st and Jan 5th 2020.

Setup up dates array:

In [3]:
dates = pd.date_range("1/1/2020", "1/5/2020", freq="D")

Generate a pandas.DataFrame to show the number of items that were sold during this period. The number of items sold will be a random number between 1 and 10.

This may be something you would obtain from querying a database:

In [4]:
rows = []
for _, date in enumerate(dates):
    for _, store in enumerate(stores):
        for _, sku in enumerate(skus):
            rows.append(
                dict(
                    {
                        "DATE": date,
                        "STORE": store,
                        "SKU": sku,
                        "QUANTITY_SOLD": np.random.randint(9) + 1,
                    }
                )
            )
df = pd.DataFrame(rows)

Pring the first 5 rows of the pandas.DataFrame:

In [5]:
df.head()
Out[5]:
DATE STORE SKU QUANTITY_SOLD
0 2020-01-01 0 0 9
1 2020-01-01 0 1 2
2 2020-01-01 0 2 2
3 2020-01-01 1 0 3
4 2020-01-01 1 1 1

Your boss has asked you to use this data to predict the number of items sold for each store and sku level for the next 5 days.

The prediction is outside of the scope of the tutorial but we will use xskillscore to tell us how good our prediction may be .

First, rename the target variable to y:

In [6]:
df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
df.head()
Out[6]:
DATE STORE SKU y
0 2020-01-01 0 0 9
1 2020-01-01 0 1 2
2 2020-01-01 0 2 2
3 2020-01-01 1 0 3
4 2020-01-01 1 1 1

Use pandas MultiIndex to help handle the granularity of the forecast:

In [7]:
df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)

This also displays the data in a cleaner foremat in the notebook:

In [8]:
df.head()
Out[8]:
y
DATE STORE SKU
2020-01-01 0 0 9
1 2
2 2
1 0 3
1 1

Time for your prediction! As mentioned, this is outside of the scope of this tutorial.

In our case we are going to generate data to mimic a prediction by taking y and perturbing randomly. This will provide a middle ground of creating a prediction which is not overfitting the data (being very similar to y) and the other extreme of random numbers for which the skill will be 0.

The perturbations will scale y between -100% and 100% using a uniform distribution. For example, a value of 5 in y will be between 0 and 10 in the prediction (yhat).

Setup the perturbation array:

In [9]:
noise = np.random.uniform(-1, 1, size=len(df['y']))

Name the prediction yhat and append it to the pandas.DataFrame.

Lastly, convert it is an int to match the same format as the target (y):

In [10]:
df['yhat'] = (df['y'] + (df['y'] * noise)).astype(int)
df.head()
Out[10]:
y yhat
DATE STORE SKU
2020-01-01 0 0 9 13
1 2 3
2 2 2
1 0 3 4
1 1 0

Using xskillscore - RMSE

RMSE (root-mean-squre error) is the square root of the average of the squared differences between forecasts and verification data:

\begin{align} RMSE = \sqrt{\overline{(f - o)^{2}}} \end{align}

Because the error is squared is it sensitive to outliers and is a more conservative metric than mean-absolute error.

See https://climpred.readthedocs.io/en/stable/metrics.html#root-mean-square-error-rmse for further documentation

sklearn

Most data scientists are familiar with using scikit-learn for verifying forecasts, especially if you used scikit-learn for the prediction.

To obtain RMSE from scikit-learn import mean_squared_error and specify squared=False:

In [11]:
from sklearn.metrics import mean_squared_error
mean_squared_error(df['y'], df['yhat'], squared=False)
Out[11]:
2.932575659723036

While skikit-learn is simple it doesn't give the flexibility of that given in xskillscore.

Note: xskillscore does use the same metrics as in scikit-learn such as the r2_score, which is called r2 in xskillscore.

xskillscore

To use xskillscore you first have to put your data into an xarray object.

Because xarray is part of the PyData stack it integrates will other Python data science packages.

pandas has a convenient to_xarray which makes going from pandas to xarray seamless.

Use to_xarray to convert the pandas.Dataframe to an xarray.Dataset:

In [12]:
ds = df.to_xarray()
ds
Out[12]:
Show/Hide data repr Show/Hide attributes
xarray.Dataset
    • DATE: 5
    • SKU: 3
    • STORE: 4
    • DATE
      (DATE)
      datetime64[ns]
      2020-01-01 ... 2020-01-05
      array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000',
             '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000',
             '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
    • STORE
      (STORE)
      int64
      0 1 2 3
      array([0, 1, 2, 3])
    • SKU
      (SKU)
      int64
      0 1 2
      array([0, 1, 2])
    • y
      (DATE, STORE, SKU)
      int64
      9 2 2 3 1 6 4 7 ... 5 2 6 6 8 1 1 3
      array([[[9, 2, 2],
              [3, 1, 6],
              [4, 7, 6],
              [7, 6, 2]],
      
             [[3, 4, 1],
              [6, 6, 3],
              [2, 5, 7],
              [6, 5, 8]],
      
             [[6, 2, 9],
              [2, 9, 3],
              [7, 8, 2],
              [9, 7, 8]],
      
             [[4, 4, 1],
              [8, 8, 3],
              [6, 7, 1],
              [2, 3, 8]],
      
             [[5, 1, 2],
              [2, 5, 2],
              [6, 6, 8],
              [1, 1, 3]]])
    • yhat
      (DATE, STORE, SKU)
      int64
      13 3 2 4 0 7 7 7 ... 1 2 3 0 0 0 0
      array([[[13,  3,  2],
              [ 4,  0,  7],
              [ 7,  7,  9],
              [ 3, 11,  2]],
      
             [[ 2,  3,  0],
              [10,  1,  2],
              [ 2,  9, 10],
              [11,  6, 14]],
      
             [[11,  3,  7],
              [ 3,  7,  3],
              [ 6,  4,  3],
              [12,  9,  9]],
      
             [[ 7,  5,  1],
              [12, 14,  2],
              [ 8,  7,  0],
              [ 0,  0,  3]],
      
             [[ 8,  0,  1],
              [ 0,  9,  1],
              [ 2,  3,  0],
              [ 0,  0,  0]]])

As seen above, xarray has a very nice html representation of xarray.Dataset objects.

Click on the data symbol (the cylinder) to the see the data associated with the Coordinates and the Data.

You now have one variable (ds) which houses the data and the associated meta data. You can also use the Attributes for handling things like units. (this is why xarray was developed!).

If you would like to know more about xarray check out this overview.

We can use xskillscore on this xarray.Dataset via xarray's Accessor method.

xskillscore expects at least 3 arguments for most functions. These are y: the target variable; yhat: the predicted variable and dim(s) the dimension(s) for which to apply the verification metric over.

To replicate the scikit-learn metric above, apply RMSE over all the dimensions [DATE, STORE, SKU]. RMSE is called rmse in xskillscore. #Lastly call .values on the object to obtain the data as a np.array...

In [13]:
rmse = ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU'])
rmse
Out[13]:
Show/Hide data repr Show/Hide attributes
xarray.DataArray
  • 2.933
    array(2.93257566)

    If you want just the data from the xarray.DataArray you can .values on it.

    In [14]:
    rmse.values
    
    Out[14]:
    array(2.93257566)

    xskillscore allows you apply the metric over any combination of dimensions (think of pandas.groupby.apply but faster).

    For example, your boss has asked you how good are your predictions at store level.

    In this case, apply the metrics over the DATE and SKU dimensions:

    In [15]:
    rmse = ds.xs.rmse('y', 'yhat', ['DATE', 'SKU'])
    rmse
    
    Out[15]:
    Show/Hide data repr Show/Hide attributes
    xarray.DataArray
    • STORE: 4
    • 2.176 2.875 3.215 3.327
      array([2.17562252, 2.87518115, 3.21455025, 3.32665999])
      • STORE
        (STORE)
        int64
        0 1 2 3
        array([0, 1, 2, 3])

    We can use xarray a bit further to explore our results.

    Let find out which store had the best forecast and which store had the worst forecast:

    In [16]:
    print('Our forecast performed well for store:')
    print(rmse.where(rmse==rmse.min(), drop=True).coords)
    print('')
    print('Our forecast struggled with store:')
    print(rmse.where(rmse==rmse.max(), drop=True).coords)
    
    Our forecast performed well for store:
    Coordinates:
      * STORE    (STORE) int64 0
    
    Our forecast struggled with store:
    Coordinates:
      * STORE    (STORE) int64 3
    

    Providing weights to the verification metrics

    You can specify weights when calculating skill metrics. Here I will go through an example demonstrating why you may want to apply weights when verifying your forecast.

    You boss has asked for you to create a prediction for the next five days. You will update this prediction everyday and there is a larger focus on the performance of the model for the subsequent day and less of a focus on the fifth day.

    In this case you can weight your metric so the performance of day 1 has a larger influence than day 5. Here we will apply a linear scaling from 1 to 0 with day 1 having a weight of 1. and day 5 having a weight of 0..

    Generate the weights the same size as the DATE dimension and put it into an xarray.DataArray:

    In [17]:
    dim = 'DATE'
    np_weights = np.linspace(1, 0, num=len(ds[dim]))
    weights = xr.DataArray(np_weights, dims=dim)
    print(weights)
    
    <xarray.DataArray (DATE: 5)>
    array([1.  , 0.75, 0.5 , 0.25, 0.  ])
    Dimensions without coordinates: DATE
    

    Now simply add the variable to the weights argument:

    In [18]:
    ds.xs.rmse('y', 'yhat', 'DATE', weights=weights)
    
    Out[18]:
    Show/Hide data repr Show/Hide attributes
    xarray.DataArray
    • STORE: 4
    • SKU: 3
    • 3.55 1.0 1.049 2.646 3.507 0.8944 2.049 2.828 2.569 4.012 3.464 3.674
      array([[3.54964787, 1.        , 1.04880885],
             [2.64575131, 3.50713558, 0.89442719],
             [2.04939015, 2.82842712, 2.56904652],
             [4.01248053, 3.46410162, 3.67423461]])
      • STORE
        (STORE)
        int64
        0 1 2 3
        array([0, 1, 2, 3])
      • SKU
        (SKU)
        int64
        0 1 2
        array([0, 1, 2])

    and you can compare to the result without the weights:

    In [19]:
    ds.xs.rmse('y', 'yhat', 'DATE')
    
    Out[19]:
    Show/Hide data repr Show/Hide attributes
    xarray.DataArray
    • STORE: 4
    • SKU: 3
    • 3.464 1.0 1.095 2.757 4.05 0.8944 2.449 2.864 4.099 3.317 2.828 3.768
      array([[3.46410162, 1.        , 1.09544512],
             [2.75680975, 4.04969135, 0.89442719],
             [2.44948974, 2.86356421, 4.09878031],
             [3.31662479, 2.82842712, 3.76828874]])
      • STORE
        (STORE)
        int64
        0 1 2 3
        array([0, 1, 2, 3])
      • SKU
        (SKU)
        int64
        0 1 2
        array([0, 1, 2])

    Handle missing values

    There may be no purchases for certain items in certain stores on certain dates. These entries will be blank in the query from the database.

    To mimic data like this create the same type of data structure as before but randomly suppress each row. I have created a simply if statement that will drop the row with a probability of 0.2 (20%):

    In [20]:
    random_number_threshold = 0.8
    
    rows = []
    for _, date in enumerate(dates):
        for _, store in enumerate(stores):
            for _, sku in enumerate(skus):
                if np.random.rand(1) < random_number_threshold:
                    rows.append(
                        dict(
                            {
                                "DATE": date,
                                "STORE": store,
                                "SKU": sku,
                                "QUANTITY_SOLD": np.random.randint(9) + 1,
                            }
                        )
                    )
    df = pd.DataFrame(rows)
    df.rename(columns={"QUANTITY_SOLD": "y"}, inplace=True)
    df.set_index(['DATE', 'STORE', 'SKU'], inplace=True)
    df.head(10)
    
    Out[20]:
    y
    DATE STORE SKU
    2020-01-01 0 0 7
    2 1
    1 0 2
    1 5
    2 2
    2 0 6
    2 4
    3 0 6
    2 8
    2020-01-02 0 0 3

    Converting the pandas.DataFrame to an xarray.Dataset is very handy in this case because it will infer the missing entries as nans (as long as all indexes are present in the pandas.DataFrame):

    In [21]:
    ds = df.to_xarray()
    ds
    
    Out[21]:
    Show/Hide data repr Show/Hide attributes
    xarray.Dataset
      • DATE: 5
      • SKU: 3
      • STORE: 4
      • DATE
        (DATE)
        datetime64[ns]
        2020-01-01 ... 2020-01-05
        array(['2020-01-01T00:00:00.000000000', '2020-01-02T00:00:00.000000000',
               '2020-01-03T00:00:00.000000000', '2020-01-04T00:00:00.000000000',
               '2020-01-05T00:00:00.000000000'], dtype='datetime64[ns]')
      • STORE
        (STORE)
        int64
        0 1 2 3
        array([0, 1, 2, 3])
      • SKU
        (SKU)
        int64
        0 1 2
        array([0, 1, 2])
      • y
        (DATE, STORE, SKU)
        float64
        7.0 nan 1.0 2.0 ... nan 3.0 3.0 8.0
        array([[[ 7., nan,  1.],
                [ 2.,  5.,  2.],
                [ 6., nan,  4.],
                [ 6., nan,  8.]],
        
               [[ 3.,  8., nan],
                [ 9.,  6.,  4.],
                [ 4.,  3.,  7.],
                [ 7.,  1., nan]],
        
               [[ 2., nan,  2.],
                [nan,  3.,  2.],
                [ 5., nan,  8.],
                [ 7., nan,  5.]],
        
               [[ 7.,  8.,  9.],
                [ 4., nan,  3.],
                [ 2.,  8.,  6.],
                [ 2., nan,  9.]],
        
               [[ 5.,  9., nan],
                [ 5., nan,  5.],
                [ 8.,  3., nan],
                [ 3.,  3.,  8.]]])

    Click on the data symbol associated with the y Data variable to see the nans.

    You can also use this step in your workflow if simply want to continue working with the pandas.DataFrame:

    In [22]:
    df_with_nans = ds.to_dataframe()
    df_with_nans.head(10)
    
    Out[22]:
    y
    DATE SKU STORE
    2020-01-01 0 0 7.0
    1 2.0
    2 6.0
    3 6.0
    1 0 NaN
    1 5.0
    2 NaN
    3 NaN
    2 0 1.0
    1 2.0

    Note: xarray returns the fields alphabetically but it still shows the nans.

    In most cases you will not know a priori, if there will be no purchases for a particular item in a certain store during a day. Therefore, your prediction will not contain nans but you would hope the value is low.

    Append a prediction column as was done previously:

    In [23]:
    df_with_nans['yhat'] = df_with_nans['y'] + (df_with_nans['y'] * noise)
    df_with_nans.head()
    
    Out[23]:
    y yhat
    DATE SKU STORE
    2020-01-01 0 0 7.0 10.446753
    1 2.0 3.411661
    2 6.0 7.584959
    3 6.0 9.522168
    1 0 NaN NaN

    Our prediction contains nans so to mimic a realistic prediction replace these with values:

    In [24]:
    yhat = df_with_nans['yhat']
    
    yhat.loc[pd.isna(yhat)] = yhat[pd.isna(yhat)].apply(lambda x: np.random.randint(9) + 1)
    
    df_with_nans['yhat'] = yhat
    df_with_nans.head()
    
    Out[24]:
    y yhat
    DATE SKU STORE
    2020-01-01 0 0 7.0 10.446753
    1 2.0 3.411661
    2 6.0 7.584959
    3 6.0 9.522168
    1 0 NaN 2.000000

    Now if we try using scikit-learn:

    In [25]:
    mean_squared_error(df_with_nans['y'], df_with_nans['yhat'], squared=False)
    
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-25-aaf7ba8b32ad> in <module>
    ----> 1 mean_squared_error(df_with_nans['y'], df_with_nans['yhat'], squared=False)
    
    ~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/metrics/_regression.py in mean_squared_error(y_true, y_pred, sample_weight, multioutput, squared)
        249 
        250     """
    --> 251     y_type, y_true, y_pred, multioutput = _check_reg_targets(
        252         y_true, y_pred, multioutput)
        253     check_consistent_length(y_true, y_pred, sample_weight)
    
    ~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/metrics/_regression.py in _check_reg_targets(y_true, y_pred, multioutput, dtype)
         83     """
         84     check_consistent_length(y_true, y_pred)
    ---> 85     y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
         86     y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)
         87 
    
    ~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
        575 
        576         if force_all_finite:
    --> 577             _assert_all_finite(array,
        578                                allow_nan=force_all_finite == 'allow-nan')
        579 
    
    ~/local/bin/anaconda3/envs/xskillscore-tutorial/lib/python3.8/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
         55                 not allow_nan and not np.isfinite(X).all()):
         56             type_err = 'infinity' if allow_nan else 'NaN, infinity'
    ---> 57             raise ValueError(
         58                     msg_err.format
         59                     (type_err,
    
    ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

    you get a ValueError: Input contains NaN.

    In xskillscore you don't need to worry about this and simply specify skipna=True:

    In [26]:
    ds = df_with_nans.to_xarray()
    ds.xs.rmse('y', 'yhat', ['DATE', 'STORE', 'SKU'], skipna=True)
    
    Out[26]:
    Show/Hide data repr Show/Hide attributes
    xarray.DataArray
    • 3.44
      array(3.44001822)

      Handle weights and missing values

      You can specify weights and skipna together for powerful analysis..

      In [27]:
      ds.xs.rmse('y', 'yhat', 'DATE', weights=weights, skipna=True)
      
      Out[27]:
      Show/Hide data repr Show/Hide attributes
      xarray.DataArray
      • SKU: 3
      • STORE: 4
      • 3.038 1.105 1.467 4.962 6.45 1.036 ... 0.8742 1.193 2.412 3.459 3.493
        array([[3.03824381, 1.10544282, 1.46657789, 4.96241146],
               [6.44982206, 1.03585395, 2.0144671 , 0.87421379],
               [1.19256817, 2.41157233, 3.45932053, 3.49296654]])
        • SKU
          (SKU)
          int64
          0 1 2
          array([0, 1, 2])
        • STORE
          (STORE)
          int64
          0 1 2 3
          array([0, 1, 2, 3])
      In [ ]: