Accessing NEX-GDDP-CMIP6 data with the Planetary Computer STAC API¶

The NEX-GDDP-CMIP6 dataset offers global downscaled climate scenarios derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 6 (CMIP6) and across two of the four “Tier 1” greenhouse gas emissions scenarios known as Shared Socioeconomic Pathways (SSPs). The purpose of this dataset is to provide a set of global, high resolution, bias-corrected climate change projections that can be used to evaluate climate change impacts on processes that are sensitive to finer-scale climate gradients and the effects of local topography on climate conditions.

This dataset uses a Bias-Correction Spatial Disaggregation method to downscale the original General Circulation Model runs to the finer 0.25° resolution. See the tech note from the product homepage for more details.

The NEX-GDDP-CMIP6 files are stored as NetCDF in Azure Blob Storage. Each STAC Item in this collection describes a single year for one scenario for one model.

In [1]:

import planetary_computer
import xarray as xr
import fsspec
import pystac_client

Data access¶

The datasets hosted by the Planetary Computer are available from Azure Blob Storage. We'll use pystac-client to search the Planetary Computer's STAC API for the subset of the data that we care about, and then we'll load the data directly from Azure Blob Storage. We'll specify a modifier so that we can access the data stored in the Planetary Computer's private Blob Storage Containers. See Reading from the STAC API and Using tokens for data access for more.

In [2]:

catalog = pystac_client.Client.open(
    "https://planetarycomputer-test.microsoft.com/stac",
    modifier=planetary_computer.sign_inplace,
)

Understanding the metadata¶

The STAC metadata on the Collection, items, and assets provide information on what data is available.

In [3]:

collection = catalog.get_collection("nasa-nex-gddp-cmip6")

As usual, the collection object contains information about the dataset including its spatio-temporal extent, license, and so on. We also have information unique to CMIP6. The collection is organized by {model}-{scenario}-{year}: there's is a single STAC item for each (valid) combination (data is not available for some; see Table 1 in the tech note for more). The valid values for each of these are stored in the collection's summaries:

In [4]:

# List the models. There are ~30 in total.
collection.summaries.get_list("cmip6:model")[:5]

Out[4]:

['ACCESS-CM2', 'ACCESS-ESM1-5', 'BCC-CSM2-MR', 'CESM2', 'CESM2-WACCM']

In [5]:

# List the scenarios
collection.summaries.get_list("cmip6:scenario")

Out[5]:

['historical', 'ssp245', 'ssp585']

The "historical" scenario covers the years 1950 - 2014 (inclusive). The "ssp245" and "ssp585" cover the years 2015 - 2100 (inclusive).

Each item includes a handful of assets, one per variable, where each asset is a single NetCDF file with the data for that variable for that model-scenario-year.

In [6]:

# list the variables
collection.summaries.get_list("cmip6:variable")

Out[6]:

['hurs', 'huss', 'pr', 'rlds', 'rsds', 'sfcWind', 'tas', 'tasmax', 'tasmin']

Querying the STAC API¶

Each STAC item covers the same spatial region, so when using the STAC API you're likely filtering on some combination of time, model, and scenario. For example, we can get the STAC items for the "ACCESS-CM2" model for the years 1950 - 2000.

In [7]:

search = catalog.search(
    collections=["nasa-nex-gddp-cmip6"],
    datetime="1950/1960",
    query={"cmip6:model": {"eq": "ACCESS-CM2"}},
)
items = search.item_collection()
len(items)

Out[7]:

Each of these items has nine assets, one per variable, which point to the NetCDF files in Azure Blob Storage:

In [8]:

item = items[0]
list(item.assets)

Out[8]:

['pr', 'tas', 'hurs', 'huss', 'rlds', 'rsds', 'tasmax', 'tasmin', 'sfcWind']

Loading data¶

Once you have a STAC item or items, you can load the data directly from Blob Storage using xarray.

In [9]:

hurs = xr.open_dataset(fsspec.open(item.assets["hurs"].href).open())
hurs

Or you can use xarray.open_mfdataset to load all the variables for an item, which will combine each of the variables.

In [10]:

%%time
ds = xr.open_mfdataset(
    [fsspec.open(asset.href).open() for asset in item.assets.values()]
)
ds

CPU times: user 1.41 s, sys: 379 ms, total: 1.79 s
Wall time: 6.26 s

Out[10]:

<xarray.Dataset>
Dimensions:  (time: 366, lat: 600, lon: 1440)
Coordinates:
  * time     (time) datetime64[ns] 1960-01-01T12:00:00 ... 1960-12-31T12:00:00
  * lat      (lat) float64 -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
Data variables:
    hurs     (time, lat, lon) float32 dask.array<chunksize=(366, 600, 1440), meta=np.ndarray>
    huss     (time, lat, lon) float32 dask.array<chunksize=(366, 600, 1440), meta=np.ndarray>
    pr       (time, lat, lon) float32 dask.array<chunksize=(366, 600, 1440), meta=np.ndarray>
    rlds     (time, lat, lon) float32 dask.array<chunksize=(366, 600, 1440), meta=np.ndarray>
    rsds     (time, lat, lon) float32 dask.array<chunksize=(366, 600, 1440), meta=np.ndarray>
    sfcWind  (time, lat, lon) float32 dask.array<chunksize=(366, 600, 1440), meta=np.ndarray>
    tas      (time, lat, lon) float32 dask.array<chunksize=(366, 600, 1440), meta=np.ndarray>
    tasmax   (time, lat, lon) float32 dask.array<chunksize=(366, 600, 1440), meta=np.ndarray>
    tasmin   (time, lat, lon) float32 dask.array<chunksize=(366, 600, 1440), meta=np.ndarray>
Attributes: (12/22)
    activity:              NEX-GDDP-CMIP6
    contact:               Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget...
    Conventions:           CF-1.7
    creation_date:         2021-10-04T13:59:24.689966+00:00
    frequency:             day
    institution:           NASA Earth Exchange, NASA Ames Research Center, Mo...
    ...                    ...
    history:               2021-10-04T13:59:24.689966+00:00: install global a...
    disclaimer:            This data is considered provisional and subject to...
    external_variables:    areacella
    cmip6_source_id:       ACCESS-CM2
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0

Note that opening all those variables is relatively slow. See below for an alternative.

We can plot all the variables for a single day with xarray, matplotlib, and cartopy.

In [11]:

import warnings
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import pandas as pd

warnings.filterwarnings("ignore", message="__len__")
warnings.filterwarnings("ignore", message="Iteration")


fig, axes = plt.subplots(
    figsize=(16, 9),
    ncols=3,
    nrows=3,
    subplot_kw=dict(projection=ccrs.Robinson()),
    sharex=True,
    sharey=True,
)

day = ds.isel(time=0)

for i, (v, data) in enumerate(day.data_vars.items()):
    ax = axes.ravel()[i]
    r = data.plot(ax=ax, transform=ccrs.PlateCarree(), add_colorbar=False)
    ax.set(title=v)

fig.suptitle(pd.to_datetime(day.time.data).strftime("%Y-%m-%d"))
plt.tight_layout()

Load timeseries in a cloud-optimized fashion with kerchunk indices¶

The dataset has an internal index for each file stored as kerchunk metadata. These indices provide a large speed up in both loading (~100x) and processing (~5x) large datasets in Azure Storage. One entire CMIP6 dataset is ~135 Gb in size By using kerchunk indices, files do not need to be downloaded locally.

Here is a sample of metadata from kerchunk indices:

In [12]:

for i in range(len(items[0].properties["kerchunk:indices"]["refs"])):
    if i % 100 == 1:
        print(
            list(items[0].properties["kerchunk:indices"]["refs"].keys())[i],
            items[0].properties["kerchunk:indices"]["refs"][
                list(items[0].properties["kerchunk:indices"]["refs"].keys())[i]
            ],
        )

lon/0 base64:eF7ty2lI1GEQBnDNDjssMzM103Vd13VdVzMzr3TzzszUzLzyytTUzLwQkYiIiIiQEAkRiZAQkYgQiRCRiIgIEYmQiIiQiBCREAmR6A3m03xwPvyJJJ/fMAzDzGNnBwAAAAAAAAAAAAAAAADr1TXGNrF66coefNbSzqFZjV1Pphb+1XR0N0Wm5de037o/+Oz1zLef623fuMPVy2AJj03KyC2urG/pvHGnu29geGTs5dt3H2fnFldw/7/vGzZv2+ni5umtN5qtYRHRcQkp6Zk5eYUlFVW1DU1tHVev37x9915Pb//DR0OPn44+H3/x6s3k9PsPn758/T7/Y2n5F/J/Nz8xMzv358/eYdOWrduddjnvdtnjutdtn7uH536vA94+Op2vr17v52cw+PsbjQEBJlNgoNkcFGSxBAdbrSEhocpBJUw5pIQrh5UI5YgSqUQp0UqMEqscVeKUeMVGjpEEkkiSSDJJIakkjRwn6eQEySAnSSY5RbKYbCaHOc3kMmeYPOYsk88UMIVMEVPMnGNKmFKmTFAuqBCcF1QKLgiqBNWCGsFFQa2gTlAvuCRoEFwWNGp0RaMmjZo1atGoVaM2QTyxAQDAmvIbfhZI4w==
pr/96.0.0 ['{{c}}', 44581952, 457852]
pr/164.0.0 ['{{c}}', 78301605, 562268]
pr/264.0.0 ['{{c}}', 130063094, 507995]
pr/364.0.0 ['{{c}}', 178726659, 479519]
tas/88.0.0 ['{{g}}', 57697985, 665911]
hurs/98.0.0 ['{{a}}', 74742402, 754652]
rlds/14.0.0 ['{{d}}', 11142792, 795029]
rsds/24.0.0 ['{{e}}', 17124177, 736305]
tas/122.0.0 ['{{g}}', 80307877, 663300]
tas/222.0.0 ['{{g}}', 146405536, 660577]
tas/322.0.0 ['{{g}}', 213314582, 668735]
hurs/154.0.0 ['{{a}}', 117098519, 757833]
hurs/254.0.0 ['{{a}}', 193230570, 762747]
hurs/354.0.0 ['{{a}}', 269434676, 767670]
huss/186.0.0 ['{{b}}', 159631784, 846847]
huss/286.0.0 ['{{b}}', 244697078, 857822]
rlds/118.0.0 ['{{d}}', 93069136, 778260]
rlds/218.0.0 ['{{d}}', 170107139, 764772]
rlds/318.0.0 ['{{d}}', 247717736, 787726]
rsds/150.0.0 ['{{e}}', 111892540, 743668]
rsds/250.0.0 ['{{e}}', 186968605, 760755]
rsds/350.0.0 ['{{e}}', 261700052, 690231]
tasmax/62.0.0 ['{{h}}', 41816906, 673342]
tasmin/72.0.0 ['{{i}}', 49262774, 681570]
sfcWind/82.0.0 ['{{f}}', 68838219, 838208]
tasmax/180.0.0 ['{{h}}', 120718611, 666940]
tasmax/280.0.0 ['{{h}}', 187351196, 666895]
tasmin/112.0.0 ['{{i}}', 76346255, 673218]
tasmin/212.0.0 ['{{i}}', 143290240, 667705]
tasmin/312.0.0 ['{{i}}', 210341912, 677588]
sfcWind/144.0.0 ['{{f}}', 120769791, 838465]
sfcWind/244.0.0 ['{{f}}', 204792155, 840131]
sfcWind/344.0.0 ['{{f}}', 288844276, 839500]

In [13]:

%%time
from kerchunk.combine import MultiZarrToZarr

single_ref_sets = []
sas_token = items[0].assets["pr"].href.split("?")[1]
for d in [item.properties["kerchunk:indices"] for item in items]:
    for key in d["templates"]:
        d["templates"][key] = d["templates"][key] + "?" + sas_token
    single_ref_sets.append(d)
mzz = MultiZarrToZarr(
    single_ref_sets, concat_dims=["time"], identical_dims=["lat", "lon"]
)
d = mzz.translate()
m = fsspec.get_mapper("reference://", fo=d)
m.fs.clear_instance_cache()
ds = xr.open_dataset(
    m, engine="zarr", consolidated=False, decode_times=True, chunks="auto"
)
ds = ds.convert_calendar(calendar="gregorian", align_on="date", missing=-99)
ds

CPU times: user 1.01 s, sys: 88.5 ms, total: 1.1 s
Wall time: 1.09 s

Out[13]:

<xarray.Dataset>
Dimensions:  (time: 4018, lat: 600, lon: 1440)
Coordinates:
  * lat      (lat) float64 -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[ns] 1950-01-01T12:00:00 ... 1960-12-31T12:00:00
Data variables:
    hurs     (time, lat, lon) float32 dask.array<chunksize=(38, 600, 1440), meta=np.ndarray>
    huss     (time, lat, lon) float32 dask.array<chunksize=(38, 600, 1440), meta=np.ndarray>
    pr       (time, lat, lon) float32 dask.array<chunksize=(38, 600, 1440), meta=np.ndarray>
    rlds     (time, lat, lon) float32 dask.array<chunksize=(38, 600, 1440), meta=np.ndarray>
    rsds     (time, lat, lon) float32 dask.array<chunksize=(38, 600, 1440), meta=np.ndarray>
    sfcWind  (time, lat, lon) float32 dask.array<chunksize=(38, 600, 1440), meta=np.ndarray>
    tas      (time, lat, lon) float32 dask.array<chunksize=(38, 600, 1440), meta=np.ndarray>
    tasmax   (time, lat, lon) float32 dask.array<chunksize=(38, 600, 1440), meta=np.ndarray>
    tasmin   (time, lat, lon) float32 dask.array<chunksize=(38, 600, 1440), meta=np.ndarray>
Attributes: (12/22)
    Conventions:           CF-1.7
    activity:              NEX-GDDP-CMIP6
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    cmip6_source_id:       ACCESS-CM2
    contact:               Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget...
    ...                    ...
    scenario:              historical
    source:                BCSD
    title:                 ACCESS-CM2, r1i1p1f1, historical, global downscale...
    tracking_id:           cefd5411-1f81-4f48-b9bf-8b38c3ecceb1
    variant_label:         r1i1p1f1
    version:               1.0

In [14]:

import matplotlib.pyplot as plt
from dask.diagnostics import ProgressBar

with ProgressBar():
    plt.figure(figsize=(30, 5))
    ds.isel(lat=22, lon=13).where(ds.isel(lat=22, lon=13).hurs > 0).hurs.plot()
    plt.ylabel("Relative Humidity")
    plt.xlabel("Date")
    plt.title(
        "Relative Humidity Vs Time @ ({},{}) for model {}".format(
            ds.lat.values[22], ds.lon.values[13], items[0].properties["cmip6:model"]
        )
    )
    plt.show()

[########################################] | 100% Completed | 40.93 s

Using a Reference File¶

Note: the approach described here is experimental and may change without warning.

In the previous section, we created a single xarray.Dataset from many NetCDF files (either many variables, or a timeseries for a single variable). Reading the metadata of a NetCDF / HDF5 file over the network is somewhat slow, making the open_mfdataset operation take about 20-30 seconds, just to read the metadata.

So in addition to the NetCDF files, we provide a reference file which stores the positions of each variable in each NetCDF file's binary stream. This reference file can be opened with fsspec and zarr and used normally with xarray.

In [15]:

import requests

references = requests.get(collection.assets["ACCESS-CM2.historical"].href).json()
references = planetary_computer.sign(references)

This reference file contains links to the original NetCDF files, along with the positions and lengths of each variable in the files. As usual, we need to sign the URLs to include short-lived tokens so that we can read the data.

We can pass that set of references to fsspec's ReferenceFileSystem:

In [16]:

reference_filesystem = fsspec.filesystem("reference", fo=references)

And (quickly!) open the referenced files with xarray and Zarr.

In [17]:

%%time
ds = xr.open_dataset(
    reference_filesystem.get_mapper("/"),
    engine="zarr",
    backend_kwargs={"consolidated": False},
    chunks={},
)
ds

CPU times: user 726 ms, sys: 32.4 ms, total: 759 ms
Wall time: 773 ms

Out[17]:

<xarray.Dataset>
Dimensions:  (time: 23741, lat: 600, lon: 1440)
Coordinates:
  * lat      (lat) float64 -59.88 -59.62 -59.38 -59.12 ... 89.38 89.62 89.88
  * lon      (lon) float64 0.125 0.375 0.625 0.875 ... 359.1 359.4 359.6 359.9
  * time     (time) datetime64[us] 1950-01-01T12:00:00 ... 2014-12-31T12:00:00
Data variables:
    hurs     (time, lat, lon) float32 dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    huss     (time, lat, lon) float32 dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    pr       (time, lat, lon) float32 dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    rlds     (time, lat, lon) float32 dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    rsds     (time, lat, lon) float32 dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    sfcWind  (time, lat, lon) float32 dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    tas      (time, lat, lon) float32 dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    tasmax   (time, lat, lon) float32 dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
    tasmin   (time, lat, lon) float32 dask.array<chunksize=(1, 600, 1440), meta=np.ndarray>
Attributes: (12/22)
    Conventions:           CF-1.7
    activity:              NEX-GDDP-CMIP6
    cmip6_institution_id:  CSIRO-ARCCSS
    cmip6_license:         CC-BY-SA 4.0
    cmip6_source_id:       ACCESS-CM2
    contact:               Dr. Rama Nemani: rama.nemani@nasa.gov, Dr. Bridget...
    ...                    ...
    scenario:              historical
    source:                BCSD
    title:                 ACCESS-CM2, r1i1p1f1, historical, global downscale...
    tracking_id:           16d27564-470f-41ea-8077-f4cc3efa5bfe
    variant_label:         r1i1p1f1
    version:               1.0

This reference file system includes all the variables for all the years covered by the specific scenario.

Next Steps¶

For more on the NEX-GDDP-CMIP6 dataset, visit the dataset's homepage. If you're working with large subsets of this data, you might want to scale with Dask.