Percentile climatology with a window

We'd like to caluclate a 90th percentile climatology of a dataset, with the added complexity that rather than just grouping all the Jan 1st of each year together we want to add a window of 15 days either side, so that the Jan 1st result is the 90th percentile of all the Dec 16th to Jan 16th of all years.

In [1]:
# Imports
import xarray
In [2]:
# Start a dask client
import climtas.nci
climtas.nci.Client()
Out[2]:

Client

Client-982468aa-0f71-11ec-89ad-fa163eaaaec0

Connection method: Cluster object Cluster type: distributed.LocalCluster
Dashboard: /node/ood-vn8/17103/proxy/8787/status

Cluster Info

Sample data

For the demo I'll use ACCESS CMIP6 surface temperature. I've chosen a chunking in time as that's usually how the data is laid out in the file - you have the whole grid for Jan 1st, then the whole grid for Jan 2nd and so on. It's faster to read data if it's close together in the file, so we don't want to hop all over getting all the times for a single grid point.

The chunk size is aiming to get the chunk bytes in the array table to very roughly around 100 mb - 30 mb is fine for this (I chose 300 to start with because that's roughly a year)

In [3]:
ds = xarray.open_mfdataset('/g/data/fs38/publications/CMIP6/CMIP/CSIRO/ACCESS-ESM1-5/amip/r1i1p1f1/day/tas/gn/latest/*.nc', chunks={'time': 300})
ds.tas
Out[3]:
<xarray.DataArray 'tas' (time: 13149, lat: 145, lon: 192)>
dask.array<open_dataset-de3bca054c2ed43774883de8240854a6tas, shape=(13149, 145, 192), dtype=float32, chunksize=(300, 145, 192), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 1979-01-01T12:00:00 ... 2014-12-31T12:00:00
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 ...
Attributes:
    standard_name:  air_temperature
    long_name:      Near-Surface Air Temperature
    comment:        near-surface (usually, 2 meter) air temperature
    units:          K
    cell_methods:   area: time: mean
    cell_measures:  area: areacella
    history:        2019-11-14T05:08:45Z altered by CMOR: Treated scalar dime...

31 day window

We'll start out with the windowing, gathering together 15 days either side of each day. We can do this with '.rolling()'. We give it the total size of the window, here 31 for the day plus 15 days either side.

In [4]:
ds.tas.rolling(time=31, center=True)
Out[4]:
DataArrayRolling [time->31(center)]

if we liked we could get stats for each window just by calling '.mean()' etc. here. We're wanting to do something a bit more complex however with a climatology, so we'll use the construct() function. This rearranges the array to add a new axis that contains all of the samples for the window

In [5]:
ds.tas.rolling(time=31, center=True).construct(time='window')
Out[5]:
<xarray.DataArray 'tas' (time: 13149, lat: 145, lon: 192, window: 31)>
dask.array<sliding_window_view, shape=(13149, 145, 192, 31), dtype=float32, chunksize=(315, 145, 192, 31), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 1979-01-01T12:00:00 ... 2014-12-31T12:00:00
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 ...
Dimensions without coordinates: window
Attributes:
    standard_name:  air_temperature
    long_name:      Near-Surface Air Temperature
    comment:        near-surface (usually, 2 meter) air temperature
    units:          K
    cell_methods:   area: time: mean
    cell_measures:  area: areacella
    history:        2019-11-14T05:08:45Z altered by CMOR: Treated scalar dime...

See how at the top there is now a 'window' dimension of length 31? That's all of the samples in the window for that (time, lat, lon) location. Calling ds.tas.rolling(time=31, center=True).mean() is equivalent to ds.tas.rolling(time=31, center=True).construct(time='window').mean('window')

Let's save this as a variable

In [6]:
tas_window = ds.tas.rolling(time=31, center=True).construct(time='window')

Percentile climatology

To calculate the climatology in xarray we use .groupby('time.dayofyear'). This groups the days by number of days since the start of the year, so for leap years Feb 29th will be grouped together with Mar 1st of non leap-years.

In [7]:
tas_window.groupby('time.dayofyear')
Out[7]:
DataArrayGroupBy, grouped over 'dayofyear'
366 groups with labels 1, 2, 3, 4, 5, ..., 363, 364, 365, 366.

If we look at a single day of year here, you can see that we've got two axes we need to gather over. 'time' is now the 36 years in the dataset, and 'window' is the nearby days.

In [8]:
for doy, data in tas_window.groupby('time.dayofyear'):
    display(data)
    break
<xarray.DataArray 'tas' (time: 36, lat: 145, lon: 192, window: 31)>
dask.array<getitem, shape=(36, 145, 192, 31), dtype=float32, chunksize=(1, 145, 192, 31), chunktype=numpy.ndarray>
Coordinates:
  * time     (time) datetime64[ns] 1979-01-01T12:00:00 ... 2014-01-01T12:00:00
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 ...
Dimensions without coordinates: window
Attributes:
    standard_name:  air_temperature
    long_name:      Near-Surface Air Temperature
    comment:        near-surface (usually, 2 meter) air temperature
    units:          K
    cell_methods:   area: time: mean
    cell_measures:  area: areacella
    history:        2019-11-14T05:08:45Z altered by CMOR: Treated scalar dime...

If we just wanted the mean we could do

In [9]:
tas_window.groupby('time.dayofyear').mean(['time', 'window'])
Out[9]:
<xarray.DataArray 'tas' (dayofyear: 366, lat: 145, lon: 192)>
dask.array<stack, shape=(366, 145, 192), dtype=float32, chunksize=(1, 145, 192), chunktype=numpy.ndarray>
Coordinates:
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 ...
  * dayofyear  (dayofyear) int64 1 2 3 4 5 6 7 8 ... 360 361 362 363 364 365 366

If we try the same with quantile (xarray's version of percentile) however we get an error

In [10]:
try:
    tas_window.groupby('time.dayofyear').quantile(0.9, ['time', 'window'])
except Exception as e:
    print('ERROR:', e)
ERROR: dimension time on 0th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single dask array chunk along this dimension, i.e., ``.chunk(dict(time=-1))``, or pass ``allow_rechunk=True`` in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.

Percentiles are hard to calculate on large distributed datasets - you need to have the entire dataset in memory, sort the data, then find the value at whatever percent along the sorted data. We need to help out Dask by manually loading the data we need for each calculation, so we'll go back to a loop and call '.load()' at each day of year, storing the result in a list.

In [11]:
%%time

result = []

for doy, data in tas_window.groupby('time.dayofyear'):
    percentile = data.load().quantile(0.9, ['time', 'window'])
    
    result.append(percentile)
CPU times: user 52min 1s, sys: 5min 31s, total: 57min 32s
Wall time: 1h 18min 1s

Once that's finished the values in the list can be combined with xarray.concat

In [12]:
percentile_climatology = xarray.concat(result, dim='dayofyear')
percentile_climatology
Out[12]:
<xarray.DataArray 'tas' (dayofyear: 366, lat: 145, lon: 192)>
array([[[250.67286682, 250.67286682, 250.67286682, ..., 250.67286682,
         250.67286682, 250.67286682],
        [251.57595825, 251.54759216, 251.49716187, ..., 251.66184998,
         251.64952087, 251.61529541],
        [252.19511414, 252.02455139, 251.96104431, ..., 252.57232666,
         252.41696167, 252.32958984],
        ...,
        [255.79222107, 255.88395691, 256.19177246, ..., 255.33990479,
         255.59410095, 255.73428345],
        [254.41465759, 254.30558777, 254.33512878, ..., 254.36961365,
         254.33735657, 254.32498169],
        [252.45547485, 252.45547485, 252.45547485, ..., 252.45547485,
         252.45547485, 252.45547485]],

       [[250.6728241 , 250.6728241 , 250.6728241 , ..., 250.6728241 ,
         250.6728241 , 250.6728241 ],
        [251.59952393, 251.54962463, 251.53194733, ..., 251.68025208,
         251.64960327, 251.61642151],
        [252.20888824, 252.06634064, 251.96010895, ..., 252.58323059,
         252.44898682, 252.3343277 ],
...
        [255.82539368, 255.98866272, 256.29229736, ..., 255.67448425,
         255.68865967, 255.80763245],
        [254.47499084, 254.69308472, 254.74490356, ..., 254.39610291,
         254.50894165, 254.38670349],
        [252.46173096, 252.46173096, 252.46173096, ..., 252.46173096,
         252.46173096, 252.46173096]],

       [[251.06691589, 251.06691589, 251.06691589, ..., 251.06691589,
         251.06691589, 251.06691589],
        [251.23405151, 251.20076904, 251.19150391, ..., 251.38855896,
         251.34030457, 251.29734802],
        [251.88744507, 251.86525269, 251.8250824 , ..., 252.6361908 ,
         252.57929688, 252.17391968],
        ...,
        [254.95679626, 255.5537323 , 255.38959045, ..., 254.92901306,
         254.91663208, 254.88747559],
        [254.23662109, 254.21036072, 254.17580261, ..., 253.96891174,
         254.10254822, 254.2250061 ],
        [252.87672119, 252.87672119, 252.87672119, ..., 252.87672119,
         252.87672119, 252.87672119]]])
Coordinates:
  * lat       (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon       (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    quantile  float64 0.9
Dimensions without coordinates: dayofyear