NASA JPL PODAAC has put the entire MUR SST dataset on AWS cloud as individual netCDF files, but all ~7000 of them are netCDF files.\ Accessing one file works well, but accessing multiple files is very slow because the metadata for each file has to be queried. Here, we create fast access by consolidating the metadata and accessing the entire dataset rapidly via zarr. More background on this project: medium article and in this repo. We need help developing documentation and more test datasets. If you want to help, we are working in the Pangeo Gitter.
To run this code:
aws configure --profile esip-qhub
Authors:
Credit:
Accessing the entire dataset is substantially faster because we have already consolidated the metadata into a single file. Accessing 1 year of data is also substantially faster because our single metadata file can point to exactly where the data is rather than first accessing the metadata for each day, then finding the data. Accessing 1 day is roughly the same amount of time because both are just calling metadata for a single file.
Test | Kerchunk | Single netCDF | Improvement |
---|---|---|---|
Access entire dataset | 1:41 min | 50 min* | 30x |
Plot 1 year at point | 20 sec | 3:08 min | 12x |
Plot 1 day | 50 sec | 1:15 min | 1.5x |
*Extrapolated from 2:48 min for 1 year to 18 years of data
This compares the consolidated metadata access with two Zarr versions of MUR SST currently stored on AWS. Zarr-v1 was re-chuncked for general use, balancing time/space. The access is faster for the timeseries analysis, because there are less chuncks in the zarr file to access than the netcdf file. The access is slightly faster to plot one day - likely because even though there are more chunks, Zarr offers faster access. Zarr was re-chuncked for timeseries analysis. The access is faster for the timeseries analysis, because there are less chuncks in the zarr file to access than the netcdf file. The access is substantially slower to image the globe for a single day, because with the chuncking set this way, to access a single day requires reading in the entire 16 TB dataset.
Test | Consolidated-metadata | Zarr-v1 | Zarr |
---|---|---|---|
Access entire dataset | 1:41 min | 3 sec | 2 sec |
Plot 1 year at point | 20 sec | 5 sec | 3 sec |
Plot 1 day | 50 sec | 40 sec | a long time |
import os
import sys
from datetime import datetime
from urllib import request
import fsspec
import hvplot.xarray
import requests
import s3fs
import xarray as xr
from earthdata import Auth #, DataColletions, DataGranules, Accessor
auth = Auth().login()
def begin_s3_direct_access():
url = "https://archive.podaac.earthdata.nasa.gov/s3credentials"
response = requests.get(url).json()
return s3fs.S3FileSystem(
key=response["accessKeyId"],
secret=response["secretAccessKey"],
token=response["sessionToken"],
client_kwargs={"region_name": "us-west-2"},
)
url = "https://archive.podaac.earthdata.nasa.gov/s3credentials"
response = requests.get(url).json()
%%time
json_consolidated = "s3://esip-qhub-public/nasa/mur/murv41_consolidated_20211011.json"
s_opts = {"requester_pays": True, "skip_instance_cache": True}
r_opts = {
"key": response["accessKeyId"],
"secret": response["secretAccessKey"],
"token": response["sessionToken"],
"client_kwargs": {"region_name": "us-west-2"},
}
fs = fsspec.filesystem(
"reference",
fo=json_consolidated,
ref_storage_args=s_opts,
remote_protocol="s3",
remote_options=r_opts,
simple_templates=True,
)
m = fs.get_mapper("")
ds = xr.open_dataset(m, decode_times=False, engine="zarr", consolidated=False)
ds.close()
ds
CPU times: user 1min 15s, sys: 5.55 s, total: 1min 20s Wall time: 1min 40s
<xarray.Dataset> Dimensions: (time: 7065, lat: 17999, lon: 36000) Coordinates: * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99 * lon (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 * time (time) datetime64[us] 2002-06-01T09:00:00 ... 2021-10-0... Data variables: analysed_sst (time, lat, lon) float32 ... analysis_error (time, lat, lon) float32 ... mask (time, lat, lon) float32 ... sea_ice_fraction (time, lat, lon) float32 ... Attributes: (12/47) Conventions: CF-1.5 Metadata_Conventions: Unidata Observation Dataset v1.0 acknowledgment: Please acknowledge the use of these data with... cdm_data_type: grid comment: MUR = "Multi-scale Ultra-high Reolution" creator_email: ghrsst@podaac.jpl.nasa.gov ... ... summary: A merged, multi-sensor L4 Foundation SST anal... time_coverage_end: 20020601T210000Z time_coverage_start: 20020531T210000Z title: Daily MUR SST, Final product uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66 westernmost_longitude: -180.0
array([-89.99, -89.98, -89.97, ..., 89.97, 89.98, 89.99], dtype=float32)
array([-179.99, -179.98, -179.97, ..., 179.98, 179.99, 180. ], dtype=float32)
array(['2002-06-01T09:00:00.000000', '2002-06-02T09:00:00.000000', '2002-06-03T09:00:00.000000', ..., '2021-10-03T09:00:00.000000', '2021-10-04T09:00:00.000000', '2021-10-05T09:00:00.000000'], dtype='datetime64[us]')
[4577865660000 values with dtype=float32]
[4577865660000 values with dtype=float32]
[4577865660000 values with dtype=float32]
[4577865660000 values with dtype=float32]
%%time
# test getting a random value
ds["analysed_sst"].sel(time="2005-12-20", lat=0, lon=0, method="nearest")
CPU times: user 8.49 ms, sys: 365 µs, total: 8.85 ms Wall time: 9.72 ms
<xarray.DataArray 'analysed_sst' (time: 1)> array([301.13998], dtype=float32) Coordinates: lat float32 0.0 lon float32 0.0 * time (time) datetime64[us] 2005-12-20T09:00:00 Attributes: comment: "Final" version using Multi-Resolution Variational Analys... long_name: analysed sea surface temperature source: AMSRE-REMSS, AVHRR_Pathfinder-PFV5.2-NODC_day, AVHRR_Path... standard_name: sea_surface_foundation_temperature units: kelvin valid_max: 32767 valid_min: -32767
array([301.13998], dtype=float32)
array(0., dtype=float32)
array(0., dtype=float32)
array(['2005-12-20T09:00:00.000000'], dtype='datetime64[us]')
%%time
# test getting a random value
ts = ds["analysed_sst"].sel(
lat=0.01, lon=0.01, time=slice("2005-01-01T09", "2006-06-01T09")
)
ts.plot()
CPU times: user 13.5 s, sys: 894 ms, total: 14.4 s Wall time: 20.4 s
[<matplotlib.lines.Line2D at 0x7fa32a1cdfd0>]
%%time
now = datetime.now()
dy = ds["analysed_sst"].sel(time="2005-01-01", method="nearest")
dy.hvplot.quadmesh(x="lon", y="lat", geo=True, rasterize=True, cmap="turbo")
CPU times: user 3.96 s, sys: 485 ms, total: 4.45 s Wall time: 6.55 s
then = datetime.now()
print(then - now)
0:00:49.748207
%%time
from os.path import dirname, join
fs = begin_s3_direct_access()
files = fs.glob(
join("podaac-ops-cumulus-protected/", "MUR-JPL-L4-GLOB-v4.1", "2005*.nc")
)
ds2 = xr.open_mfdataset(
paths=[fs.open(f) for f in files],
combine="by_coords",
mask_and_scale=True,
decode_cf=True,
chunks={"time": 1}, # analysis.
)
ds2.close()
CPU times: user 38.4 s, sys: 3.59 s, total: 42 s Wall time: 2min 48s
%%time
# test getting a random value
# ts = ds2['analysed_sst'].sel(lat=0.01,lon=0.01,time=slice('2005-01-01T09','2006-01-01T09')) #memory issues times out so break it up
tem2 = []
for imon in range(12):
tem = (
ds2["analysed_sst"]
.sel(lat=0.01, lon=0.01, time="2005-" + str(imon + 1).zfill(2))
.load()
)
tem2.append(tem)
# print(imon+1)
ts = xr.concat(tem2, dim="time")
CPU times: user 37.6 s, sys: 6.93 s, total: 44.5 s Wall time: 3min 8s
%%time
ts.plot()
CPU times: user 149 ms, sys: 13.7 ms, total: 162 ms Wall time: 542 ms
[<matplotlib.lines.Line2D at 0x7fb485359fa0>]
%%time
now = datetime.now()
dy = ds2["analysed_sst"].sel(time="2005-01-01")
dy.hvplot.quadmesh(x="lon", y="lat", geo=True, rasterize=True, cmap="turbo")
CPU times: user 3.18 s, sys: 306 ms, total: 3.49 s Wall time: 6.7 s
then = datetime.now()
print(then - now)
0:01:15.796176
import warnings
import fsspec
import hvplot.xarray
import numpy as np
import pandas as pd
import xarray as xr
import hvplot.xarray
warnings.simplefilter("ignore") # filter some warning messages
xr.set_options(display_style="html") # display dataset nicely
<xarray.core.options.set_options at 0x7fb3ccb77370>
%%time
ds_sst = xr.open_zarr(
"https://mur-sst.s3.us-west-2.amazonaws.com/zarr-v1", consolidated=True
)
ds_sst
CPU times: user 2.01 s, sys: 86.3 ms, total: 2.1 s Wall time: 2.43 s
<xarray.Dataset> Dimensions: (time: 6443, lat: 17999, lon: 36000) Coordinates: * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99 * lon (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 * time (time) datetime64[ns] 2002-06-01T09:00:00 ... 2020-01-2... Data variables: analysed_sst (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray> analysis_error (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray> mask (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray> sea_ice_fraction (time, lat, lon) float32 dask.array<chunksize=(5, 1799, 3600), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.7 Metadata_Conventions: Unidata Observation Dataset v1.0 acknowledgment: Please acknowledge the use of these data with... cdm_data_type: grid comment: MUR = "Multi-scale Ultra-high Resolution" creator_email: ghrsst@podaac.jpl.nasa.gov ... ... summary: A merged, multi-sensor L4 Foundation SST anal... time_coverage_end: 20200116T210000Z time_coverage_start: 20200115T210000Z title: Daily MUR SST, Final product uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66 westernmost_longitude: -180.0
array([-89.99, -89.98, -89.97, ..., 89.97, 89.98, 89.99], dtype=float32)
array([-179.99, -179.98, -179.97, ..., 179.98, 179.99, 180. ], dtype=float32)
array(['2002-06-01T09:00:00.000000000', '2002-06-02T09:00:00.000000000', '2002-06-03T09:00:00.000000000', ..., '2020-01-18T09:00:00.000000000', '2020-01-19T09:00:00.000000000', '2020-01-20T09:00:00.000000000'], dtype='datetime64[ns]')
|
|
|
|
%%time
ts = ds_sst["analysed_sst"].sel(
lat=0.01, lon=0.01, time=slice("2005-01-01T09", "2006-01-01T09")
)
ts.plot()
CPU times: user 2.22 s, sys: 2.3 s, total: 4.52 s Wall time: 5.17 s
[<matplotlib.lines.Line2D at 0x7fb3cc4a4880>]
%%time
now = datetime.now()
dy = ds_sst["analysed_sst"].sel(time="2005-01-01")
dy.hvplot.quadmesh(x="lon", y="lat", geo=True, rasterize=True, cmap="turbo")
CPU times: user 14.4 ms, sys: 310 µs, total: 14.7 ms Wall time: 14 ms
then = datetime.now()
print(then - now)
0:00:40.159789
%%time
ds_sst = xr.open_zarr(
"https://mur-sst.s3.us-west-2.amazonaws.com/zarr", consolidated=True
)
ds_sst
CPU times: user 1.22 s, sys: 65.3 ms, total: 1.28 s Wall time: 1.5 s
<xarray.Dataset> Dimensions: (time: 6443, lat: 17999, lon: 36000) Coordinates: * lat (lat) float32 -89.99 -89.98 -89.97 ... 89.97 89.98 89.99 * lon (lon) float32 -180.0 -180.0 -180.0 ... 180.0 180.0 180.0 * time (time) datetime64[ns] 2002-06-01T09:00:00 ... 2020-01-2... Data variables: analysed_sst (time, lat, lon) float32 dask.array<chunksize=(6443, 100, 100), meta=np.ndarray> analysis_error (time, lat, lon) float32 dask.array<chunksize=(6443, 100, 100), meta=np.ndarray> mask (time, lat, lon) int8 dask.array<chunksize=(6443, 100, 100), meta=np.ndarray> sea_ice_fraction (time, lat, lon) float32 dask.array<chunksize=(6443, 100, 100), meta=np.ndarray> Attributes: (12/47) Conventions: CF-1.7 Metadata_Conventions: Unidata Observation Dataset v1.0 acknowledgment: Please acknowledge the use of these data with... cdm_data_type: grid comment: MUR = "Multi-scale Ultra-high Resolution" creator_email: ghrsst@podaac.jpl.nasa.gov ... ... summary: A merged, multi-sensor L4 Foundation SST anal... time_coverage_end: 20200116T210000Z time_coverage_start: 20200115T210000Z title: Daily MUR SST, Final product uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66 westernmost_longitude: -180.0
array([-89.99, -89.98, -89.97, ..., 89.97, 89.98, 89.99], dtype=float32)
array([-179.99, -179.98, -179.97, ..., 179.98, 179.99, 180. ], dtype=float32)
array(['2002-06-01T09:00:00.000000000', '2002-06-02T09:00:00.000000000', '2002-06-03T09:00:00.000000000', ..., '2020-01-18T09:00:00.000000000', '2020-01-19T09:00:00.000000000', '2020-01-20T09:00:00.000000000'], dtype='datetime64[ns]')
|
|
|
|
%%time
ts = ds_sst["analysed_sst"].sel(
lat=0.01, lon=0.01, time=slice("2005-01-01T09", "2006-01-01T09")
)
ts.plot()
CPU times: user 316 ms, sys: 127 ms, total: 443 ms Wall time: 2.84 s
[<matplotlib.lines.Line2D at 0x7fb3cc82ec40>]
#%%time
# now = datetime.now()
# dy = ds_sst["analysed_sst"].sel(time="2005-01-01")
# dy.hvplot.quadmesh(x='lon', y='lat', geo=True, rasterize=True, cmap='turbo' )
# then = datetime.now()
# print(then-now)