03_Big_Data.ipynb ¶

In this notebook I verify 12 million forecasts in a couple of seconds using the RMSE metric on a dask.array.

In [2]:

import xarray as xr
import pandas as pd
import numpy as np
import xskillscore as xs
import dask.array as da
from dask.distributed import Client

By default the dask.distributed.Client uses a LocalCluster

cluster = LocalCluster()
client = Client(cluster)

However, this code can easily be adapted to scale on massive datasets using distributed computing via various methods of deployment:

or vendor products:

SaturnCloud

If anyone does run this example on a large cluster I would be curious how big you can scale nstores and nskus and how long it takes to run rmse. You are welcome to post it in the issue section following this link.

Setup the client (i.e. connect to the scheduler):

In [3]:

client = Client()
client

Out[3]:

Client

Scheduler: tcp://127.0.0.1:33685
Dashboard: http://127.0.0.1:8787/status

Cluster

Workers: 3
Cores: 6
Memory: 16.73 GB

Due to the success of your previous forecast (and verification using xskillscore!) the company you work for has expanded. They have grown to 4,000 stores each with 3,000 products:

In [21]:

nstores = 4000
nskus = 3000

nforecasts = nstores * nskus
print(f"That's {nforecasts:,d} different forecasts to verify!")

stores = np.arange(nstores)
skus = np.arange(nskus)

That's 12,000,000 different forecasts to verify!

The time period of interest is the same dates but for 2021:

In [16]:

dates = pd.date_range("1/1/2021", "1/5/2021", freq="D")

Setup the data as a dask.array of dates x stores x skus.

dask uses similar functions as numpy. In this case switch the np. to da. to generate random numbers between 1 and 10:

In [17]:

data = da.random.randint(9, size=(len(dates), len(stores), len(skus))) + 1
data

Out[17]:

	Array	Chunk
Bytes	4.80 GB	60.00 MB
Shape	(5, 4000, 30000)	(5, 1000, 1500)
Count	160 Tasks	80 Chunks
Type	int64	numpy.ndarray

Put this into an xarray.DataArray and specify the Coordinates and dimensions:

In [18]:

y = xr.DataArray(data, coords=[dates, stores, skus], dims=["DATE", "STORE", "SKU"])
y

Out[18]:

xarray.DataArray

'add-dfe18897e28896b92dffd2271a18e7b8'

DATE: 5
STORE: 4000
SKU: 30000

dask.array<chunksize=(5, 1000, 1500), meta=np.ndarray>





  
      Array  Chunk 
  
  
     Bytes  4.80 GB   60.00 MB 
     Shape  (5, 4000, 30000)   (5, 1000, 1500) 
     Count  160 Tasks  80 Chunks 
     Type  int64  numpy.ndarray

Coordinates: (3)

DATE

(DATE)

datetime64[ns]

2021-01-01 ... 2021-01-05

array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000',
       '2021-01-03T00:00:00.000000000', '2021-01-04T00:00:00.000000000',
       '2021-01-05T00:00:00.000000000'], dtype='datetime64[ns]')

STORE
(STORE)
int64
0 1 2 3 4 ... 3996 3997 3998 3999
```
array([   0,    1,    2, ..., 3997, 3998, 3999])
```

SKU

(SKU)

int64

0 1 2 3 ... 29996 29997 29998 29999

array([    0,     1,     2, ..., 29997, 29998, 29999])

Attributes: (0)

Create a prediction array similar to that in 01_Deterministic.ipynb:

In [19]:

noise = da.random.uniform(-1, 1, size=(len(dates), len(stores), len(skus)))
yhat = y + (y * noise)
yhat

Out[19]:

xarray.DataArray

DATE: 5
STORE: 4000
SKU: 30000

dask.array<chunksize=(5, 1000, 1500), meta=np.ndarray>





  
      Array  Chunk 
  
  
     Bytes  4.80 GB   60.00 MB 
     Shape  (5, 4000, 30000)   (5, 1000, 1500) 
     Count  400 Tasks  80 Chunks 
     Type  float64  numpy.ndarray

Coordinates: (3)

DATE

(DATE)

datetime64[ns]

2021-01-01 ... 2021-01-05

array(['2021-01-01T00:00:00.000000000', '2021-01-02T00:00:00.000000000',
       '2021-01-03T00:00:00.000000000', '2021-01-04T00:00:00.000000000',
       '2021-01-05T00:00:00.000000000'], dtype='datetime64[ns]')

STORE
(STORE)
int64
0 1 2 3 4 ... 3996 3997 3998 3999
```
array([   0,    1,    2, ..., 3997, 3998, 3999])
```

SKU

(SKU)

int64

0 1 2 3 ... 29996 29997 29998 29999

array([    0,     1,     2, ..., 29997, 29998, 29999])

Attributes: (0)

Finally caculate RMSE at the store and sku level.

Use the .compute() method to return the values:

In [20]:

%time xs.rmse(y, yhat, 'DATE').compute()

CPU times: user 1.52 s, sys: 964 ms, total: 2.49 s
Wall time: 9.63 s

Out[20]:

xarray.DataArray

STORE: 4000
SKU: 30000

1.531 3.147 3.151 5.253 3.666 3.23 ... 3.369 1.381 4.099 3.46 3.736

array([[1.53123262, 3.14700356, 3.15132902, ..., 4.94581499, 3.27822252,
        4.06088894],
       [1.67277937, 4.08686635, 3.66413793, ..., 2.41712605, 2.57601639,
        2.88718054],
       [2.76270445, 3.33921586, 3.06681608, ..., 3.34186527, 1.32741548,
        2.08743438],
       ...,
       [2.35981265, 2.1617547 , 4.92081192, ..., 1.98393152, 3.1364395 ,
        3.59346663],
       [1.26363135, 2.77340328, 2.7967874 , ..., 2.8342274 , 3.01885276,
        0.62828305],
       [4.29771572, 3.92254418, 2.15708334, ..., 4.099115  , 3.45980968,
        3.73594864]])

Coordinates: (2)
- STORE
  (STORE)
  int64
  0 1 2 3 4 ... 3996 3997 3998 3999
```
array([   0,    1,    2, ..., 3997, 3998, 3999])
```
- SKU
  (SKU)
  int64
  0 1 2 3 ... 29996 29997 29998 29999
```
array([    0,     1,     2, ..., 29997, 29998, 29999])
```
Attributes: (0)

In [ ]:

03_Big_Data.ipynb¶

Client

Cluster

03_Big_Data.ipynb ¶