#!/usr/bin/env python
# coding: utf-8

# ## Accessing NEX-GDDP-CMIP6 data with the Planetary Computer STAC API
# 
# The [NEX-GDDP-CMIP6 dataset](https://planetarycomputer.microsoft.com/dataset/nasa-nex-gddp-cmip6) offers global downscaled climate scenarios derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 6 (CMIP6) and across two of the four “Tier 1” greenhouse gas emissions scenarios known as Shared Socioeconomic Pathways (SSPs). The purpose of this dataset is to provide a set of global, high resolution, bias-corrected climate change projections that can be used to evaluate climate change impacts on processes that are sensitive to finer-scale climate gradients and the effects of local topography on climate conditions.
# 
# This dataset uses a Bias-Correction Spatial Disaggregation method to downscale the original General Circulation Model runs to the finer 0.25° resolution. See the [tech note](https://www.nccs.nasa.gov/sites/default/files/NEX-GDDP-CMIP6-Tech_Note.pdf) from the [product homepage](https://www.nccs.nasa.gov/services/data-collections/land-based-products/nex-gddp-cmip6) for more details.
# 
# The NEX-GDDP-CMIP6 files are stored as NetCDF in Azure Blob Storage. Each STAC Item in this collection describes a single year for one scenario for one model.

# In[1]:


import planetary_computer
import xarray as xr
import fsspec
import pystac_client


# ### Data access
# 
# The datasets hosted by the Planetary Computer are available from [Azure Blob Storage](https://docs.microsoft.com/en-us/azure/storage/blobs/). We'll use [pystac-client](https://pystac-client.readthedocs.io/) to search the Planetary Computer's [STAC API](https://planetarycomputer.microsoft.com/api/stac/v1/docs) for the subset of the data that we care about, and then we'll load the data directly from Azure Blob Storage. We'll specify a `modifier` so that we can access the data stored in the Planetary Computer's private Blob Storage Containers. See [Reading from the STAC API](https://planetarycomputer.microsoft.com/docs/quickstarts/reading-stac/) and [Using tokens for data access](https://planetarycomputer.microsoft.com/docs/concepts/sas/) for more.

# In[2]:


catalog = pystac_client.Client.open(
    "https://planetarycomputer-test.microsoft.com/stac",
    modifier=planetary_computer.sign_inplace,
)


# ### Understanding the metadata
# 
# The STAC metadata on the Collection, items, and assets provide information on what data is available.

# In[3]:


collection = catalog.get_collection("nasa-nex-gddp-cmip6")


# As usual, the collection object contains information about the dataset including its spatio-temporal extent, license, and so on. We also have information unique to CMIP6. The collection is organized by `{model}-{scenario}-{year}`: there's is a single STAC item for each (valid) combination (data is not available for some; see Table 1 in the [tech note](https://www.nccs.nasa.gov/sites/default/files/NEX-GDDP-CMIP6-Tech_Note.pdf) for more). The valid values for each of these are stored in the collection's summaries:

# In[4]:


# List the models. There are ~30 in total.
collection.summaries.get_list("cmip6:model")[:5]


# In[5]:


# List the scenarios
collection.summaries.get_list("cmip6:scenario")


# The "historical" scenario covers the years 1950 - 2014 (inclusive). The "ssp245" and "ssp585" cover the years 2015 - 2100 (inclusive).
# 
# Each item includes a handful of assets, one per variable, where each asset is a single NetCDF file with the data for that variable for that model-scenario-year.

# In[6]:


# list the variables
collection.summaries.get_list("cmip6:variable")


# ### Querying the STAC API
# 
# Each STAC item covers the same spatial region, so when using the STAC API you're likely filtering on some combination of time, model, and scenario. For example, we can get the STAC items for the "ACCESS-CM2" model for the years 1950 - 2000.

# In[7]:


search = catalog.search(
    collections=["nasa-nex-gddp-cmip6"],
    datetime="1950/1960",
    query={"cmip6:model": {"eq": "ACCESS-CM2"}},
)
items = search.item_collection()
len(items)


# Each of these items has nine assets, one per variable, which point to the NetCDF files in Azure Blob Storage:

# In[8]:


item = items[0]
list(item.assets)


# ### Loading data
# 
# Once you have a STAC item or items, you can load the data directly from Blob Storage using xarray.

# In[9]:


hurs = xr.open_dataset(fsspec.open(item.assets["hurs"].href).open())
hurs


# Or you can use `xarray.open_mfdataset` to load all the variables for an item, which will combine each of the variables.

# In[10]:


get_ipython().run_cell_magic('time', '', 'ds = xr.open_mfdataset(\n    [fsspec.open(asset.href).open() for asset in item.assets.values()]\n)\nds\n')


# *Note that opening all those variables is relatively slow. See [below](#Using-a-Reference-File) for an alternative.*
# 
# We can plot all the variables for a single day with xarray, matplotlib, and cartopy.

# In[11]:


import warnings
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import pandas as pd

warnings.filterwarnings("ignore", message="__len__")
warnings.filterwarnings("ignore", message="Iteration")


fig, axes = plt.subplots(
    figsize=(16, 9),
    ncols=3,
    nrows=3,
    subplot_kw=dict(projection=ccrs.Robinson()),
    sharex=True,
    sharey=True,
)

day = ds.isel(time=0)

for i, (v, data) in enumerate(day.data_vars.items()):
    ax = axes.ravel()[i]
    r = data.plot(ax=ax, transform=ccrs.PlateCarree(), add_colorbar=False)
    ax.set(title=v)

fig.suptitle(pd.to_datetime(day.time.data).strftime("%Y-%m-%d"))
plt.tight_layout()


# ### Load timeseries in a cloud-optimized fashion with kerchunk indices

# The dataset has an internal index for each file stored as kerchunk metadata. These indices provide a large speed up in both loading (\~100x) and processing (\~5x) large datasets in Azure Storage. One entire CMIP6 dataset is ~135 Gb in size By using kerchunk indices, files do not need to be downloaded locally.

# Here is a sample of metadata from kerchunk indices:

# In[12]:


for i in range(len(items[0].properties["kerchunk:indices"]["refs"])):
    if i % 100 == 1:
        print(
            list(items[0].properties["kerchunk:indices"]["refs"].keys())[i],
            items[0].properties["kerchunk:indices"]["refs"][
                list(items[0].properties["kerchunk:indices"]["refs"].keys())[i]
            ],
        )


# In[13]:


get_ipython().run_cell_magic('time', '', 'from kerchunk.combine import MultiZarrToZarr\n\nsingle_ref_sets = []\nsas_token = items[0].assets["pr"].href.split("?")[1]\nfor d in [item.properties["kerchunk:indices"] for item in items]:\n    for key in d["templates"]:\n        d["templates"][key] = d["templates"][key] + "?" + sas_token\n    single_ref_sets.append(d)\nmzz = MultiZarrToZarr(\n    single_ref_sets, concat_dims=["time"], identical_dims=["lat", "lon"]\n)\nd = mzz.translate()\nm = fsspec.get_mapper("reference://", fo=d)\nm.fs.clear_instance_cache()\nds = xr.open_dataset(\n    m, engine="zarr", consolidated=False, decode_times=True, chunks="auto"\n)\nds = ds.convert_calendar(calendar="gregorian", align_on="date", missing=-99)\nds\n')


# In[14]:


import matplotlib.pyplot as plt
from dask.diagnostics import ProgressBar

with ProgressBar():
    plt.figure(figsize=(30, 5))
    ds.isel(lat=22, lon=13).where(ds.isel(lat=22, lon=13).hurs > 0).hurs.plot()
    plt.ylabel("Relative Humidity")
    plt.xlabel("Date")
    plt.title(
        "Relative Humidity Vs Time @ ({},{}) for model {}".format(
            ds.lat.values[22], ds.lon.values[13], items[0].properties["cmip6:model"]
        )
    )
    plt.show()


# #### Using a Reference File
# 
# *Note: the approach described here is experimental and may change without warning.*
# 
# In the previous section, we created a single `xarray.Dataset` from many NetCDF files (either many variables, or a timeseries for a single variable). Reading the metadata of a NetCDF / HDF5 file over the network is somewhat slow, making the `open_mfdataset` operation take about 20-30 seconds, *just to read the metadata*.
# 
# So in addition to the NetCDF files, we provide a **reference file** which stores the positions of each variable in each NetCDF file's binary stream. This reference file can be opened with `fsspec` and `zarr` and used normally with xarray.

# In[15]:


import requests

references = requests.get(collection.assets["ACCESS-CM2.historical"].href).json()
references = planetary_computer.sign(references)


# This reference file contains links to the original NetCDF files, along with the positions and lengths of each variable in the files. As usual, we need to sign the URLs to include short-lived tokens so that we can read the data.
# 
# We can pass that set of references to fsspec's `ReferenceFileSystem`:

# In[16]:


reference_filesystem = fsspec.filesystem("reference", fo=references)


# And (quickly!) open the referenced files with xarray and Zarr.

# In[17]:


get_ipython().run_cell_magic('time', '', 'ds = xr.open_dataset(\n    reference_filesystem.get_mapper("/"),\n    engine="zarr",\n    backend_kwargs={"consolidated": False},\n    chunks={},\n)\nds\n')


# This reference file system includes all the variables for all the years covered by the specific scenario.

# ### Next Steps
# 
# For more on the NEX-GDDP-CMIP6 dataset, visit the [dataset's homepage](https://www.nccs.nasa.gov/services/data-collections/land-based-products/nex-gddp-cmip6). If you're working with large subsets of this data, you might want to [scale with Dask](https://planetarycomputer.microsoft.com/docs/quickstarts/scale-with-dask/).