kerchunk¶

The original NetCDF file here (ROMS model output) has float32 vars with _FillValue set as 1e37.

reading NetCDF with Xarray (and decode_cf=True) correctly sets these values to NaN
reading the corresponding Zarr or kerchunk JSON with Xarray does not

In [1]:

import fsspec
import xarray as xr

Read NetCDF file directly with Xarray¶

_FillValue is correctly converted to NaN.

In [2]:

fs = fsspec.filesystem('s3', anon=True, client_kwargs={'endpoint_url': 'https://mghp.osn.xsede.org'})
url = 's3://rsignellbucket1/COAWST/coawst_us_20220101_01.nc'

In [3]:

ds  = xr.open_dataset(fs.open(url), decode_cf=True)

In [4]:

ds.temp[0,0,0,0].values

Out[4]:

array(nan)

In [5]:

ds  = xr.open_dataset(fs.open(url), decode_cf=False)

In [6]:

ds.temp[0,0,0,0].values

Out[6]:

array(1.e+37, dtype=float32)

In [7]:

ds.temp._FillValue

Out[7]:

1e+37

Write to Zarr. Read resulting Zarr with xarray.¶

User sees NaN in xarray, but fill_value (the attribute used to store the fill value in Zarr) is 9.999999933815813e+36 instead of 1.e+37

In [8]:

%%time
ds[['temp','salt']].isel(ocean_time=slice(0,2)).to_zarr('foo.zarr', compute=True, mode='w')

CPU times: user 736 ms, sys: 136 ms, total: 871 ms
Wall time: 9.74 s

Out[8]:

<xarray.backends.zarr.ZarrStore at 0x7fc955646200>

In [9]:

ds2 = xr.open_dataset('foo.zarr', engine='zarr', decode_cf=False)

In [10]:

ds2.temp._FillValue

Out[10]:

1e+37

In [11]:

ds2.temp[0,0,0,0].values

Out[11]:

array(1.e+37, dtype=float32)

In [12]:

ds2 = xr.open_dataset('foo.zarr', engine='zarr', decode_cf=True)
ds2.temp[0,0,0,0].values

Out[12]:

array(nan)

In [13]:

! cat ./foo.zarr/temp/.zarray

{
    "chunks": [
        1,
        4,
        84,
        448
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "<f4",
    "fill_value": 9.999999933815813e+36,
    "filters": null,
    "order": "C",
    "shape": [
        2,
        16,
        336,
        896
    ],
    "zarr_format": 2
}

Read Kerchunk JSON representation of the above NetCDF file¶

Here the user doesn't get NaN values in the masked regions, but a value close too but

In [14]:

json_url = 's3://rsignellbucket1/COAWST/jsons/coawst_us_20220101_01.nc.json'

Try with decode_cf=True:

In [15]:

s_opts = dict(skip_instance_cache=True, anon=True, client_kwargs={'endpoint_url': 'https://mghp.osn.xsede.org'})  #json 
r_opts = dict(anon=True, client_kwargs={'endpoint_url': 'https://mghp.osn.xsede.org'}) #data

fs = fsspec.filesystem("reference", fo=json_url, ref_storage_args=s_opts,
                       remote_protocol='s3', remote_options=r_opts)
m = fs.get_mapper("")

ds = xr.open_dataset(m, engine="zarr", chunks={}, 
                     backend_kwargs=dict(consolidated=False), decode_cf=True)

In [16]:

ds.temp[0,0,0,0].values

Out[16]:

array(9.99999993e+36)

print with full precision:

In [17]:

format(ds.temp[0,0,0,0].values, '.60g') 

Out[17]:

'9999999933815812510711506376257961984'

So these came in as (non-NaN) values because they are different than fill_value: 9.999999933815813e+36 ?

Try with decode_cf=False:

In [18]:

ds = xr.open_dataset(m, engine="zarr", chunks={}, 
                     backend_kwargs=dict(consolidated=False), decode_cf=False)

In [19]:

ds.temp[0,0,0,0].values

Out[19]:

array(1.e+37, dtype=float32)

The kerchunk-generated JSON of course reflects what the Zarr file has:

In [20]:

fs.download('temp/.zattrs', 'foo')

In [21]:

!more foo

{
    "_ARRAY_DIMENSIONS": [
        "ocean_time",
        "s_rho",
        "eta_rho",
        "xi_rho"
    ],
    "_FillValue": 9.999999933815813e+36,
    "coordinates": "lon_rho lat_rho s_rho ocean_time",
    "field": "temperature, scalar, series",
    "grid": "grid",
    "location": "face",
    "long_name": "potential temperature",
    "time": "ocean_time",
    "units": "Celsius"
}