Hi all,
I'm looking for guidance / best practices on writing an xarray object to (a collection of) COGs. Let's start with a common case of a DataArray that's indexed by (time, band, y, x)
. Let's also assume that it's a chunked DataArray, with a chunksize of 1 for time
and band
, and it might be chunked along y
and x
as well.
My high-level questions:
.rio.to_raster(path, driver="COG")
have the right defaults? Anything special we should do to make sure we write "good" COGs for a single chunk?I'm particularly interested in item 2. My proposed naming convention is
<prefix>/time=<time>/band=<band>-y=<y-coord>-x=<x-coord>.tif
This works well for xarray: we have coordinate information available when writing the chunk, so we can safely generate a unique name for a chunk using the (time, band, y, x)
coordinates of, say, the top-left value in the chunk.
Here's a small example:
!pip install -q -U --no-deps git+https://github.com/TomAugspurger/xcog
We'll mock up some data that has the right structure for pystac / rioxarray to do their thing.
import xarray as xr
import numpy as np
import dask.array as da
import stackstac
import rioxarray
import pystac
import pandas as pd
values = da.random.uniform(size=(2, 3, 10980, 10980), chunks=(1, 1, 5490, 5490))
x = np.arange(399960, 509751, step=10.)
y = np.arange(4800000, 4690210 - 1, step=-10.)
band = np.array(["B02", "B03", "B04"])
time = pd.to_datetime(["2021-01-01T17:07:19.024000000", "2021-01-04T17:17:19.024000000"])
data = xr.DataArray(
values,
dims=("time", "band", "y", "x"),
coords={
"time": xr.DataArray(time, name="time", dims="time"),
"band": xr.DataArray(band, name="band", dims="band"),
"y": xr.DataArray(y, name="y", dims="y"),
"x": xr.DataArray(x, name="x", dims="x"),
"common_name": xr.DataArray(['blue', 'green', 'red'], dims="band", name="common_name"),
"center_wavelength":xr.DataArray([0.49 , 0.56 , 0.665], dims="band", name="center_wavelength"),
"full_width_half_max": xr.DataArray([0.098, 0.045, 0.038], dims="band", name="full_width_half_max"),
},
attrs={
"crs": "epsg:32615",
},
)
data
<xarray.DataArray 'uniform-c58b35bf81df37c21a92878e04d325df' (time: 2, band: 3, y: 10980, x: 10980)> dask.array<uniform, shape=(2, 3, 10980, 10980), dtype=float64, chunksize=(1, 1, 5490, 5490), chunktype=numpy.ndarray> Coordinates: * time (time) datetime64[ns] 2021-01-01T17:07:19.024000 202... * band (band) <U3 'B02' 'B03' 'B04' * y (y) float64 4.8e+06 4.8e+06 ... 4.69e+06 4.69e+06 * x (x) float64 4e+05 4e+05 4e+05 ... 5.097e+05 5.098e+05 common_name (band) <U5 'blue' 'green' 'red' center_wavelength (band) float64 0.49 0.56 0.665 full_width_half_max (band) float64 0.098 0.045 0.038 Attributes: crs: epsg:32615
|
array(['2021-01-01T17:07:19.024000000', '2021-01-04T17:17:19.024000000'], dtype='datetime64[ns]')
array(['B02', 'B03', 'B04'], dtype='<U3')
array([4800000., 4799990., 4799980., ..., 4690230., 4690220., 4690210.])
array([399960., 399970., 399980., ..., 509730., 509740., 509750.])
array(['blue', 'green', 'red'], dtype='<U5')
array([0.49 , 0.56 , 0.665])
array([0.098, 0.045, 0.038])
We're using xcog
here, a simple little library with some utilities for writing out chunks. We'll write to local disk, but we should be able to use any fsspec-compatible file-system.
from pathlib import Path
import xcog
dst = Path("/tmp/cogs/")
dst.mkdir(parents=True, exist_ok=True)
template = xcog.make_template(data)
r = data.map_blocks(
xcog.write_block,
kwargs=dict(
prefix=str(dst),
storage_options=dict(auto_mkdir=True),
),
template=template
)
r
<xarray.DataArray 'uniform-c58b35bf81df37c21a92878e04d325df' (time: 2, band: 3, y: 2, x: 2)> dask.array<<this-array>-write_block, shape=(2, 3, 2, 2), dtype=object, chunksize=(1, 1, 1, 1), chunktype=numpy.ndarray> Coordinates: * time (time) datetime64[ns] 2021-01-01T17:07:19.024000 202... * band (band) object 'B02' 'B03' 'B04' * y (y) float64 4.8e+06 4.745e+06 * x (x) float64 4e+05 4.549e+05 common_name (band) <U5 dask.array<chunksize=(1,), meta=np.ndarray> center_wavelength (band) float64 dask.array<chunksize=(1,), meta=np.ndarray> full_width_half_max (band) float64 dask.array<chunksize=(1,), meta=np.ndarray> Attributes: crs: epsg:32615
|
array(['2021-01-01T17:07:19.024000000', '2021-01-04T17:17:19.024000000'], dtype='datetime64[ns]')
array(['B02', 'B03', 'B04'], dtype=object)
array([4800000., 4745100.])
array([399960., 454860.])
|
|
|
%time result = r.compute()
ERROR 4: `/vsimem/0333842e-13d1-4d4c-bae8-91078f97cfea/0333842e-13d1-4d4c-bae8-91078f97cfea.tif' not recognized as a supported file format. ERROR 4: `/vsimem/3277a413-6d34-4fa4-bd6b-ffe4f84e24a7/3277a413-6d34-4fa4-bd6b-ffe4f84e24a7.tif' not recognized as a supported file format. ERROR 4: `/vsimem/25cbb39e-c42d-4371-8ec3-87635548454e/25cbb39e-c42d-4371-8ec3-87635548454e.tif' not recognized as a supported file format. ERROR 4: `/vsimem/58437de5-77dd-4328-b7c3-a40fcc9f5473/58437de5-77dd-4328-b7c3-a40fcc9f5473.tif' not recognized as a supported file format. ERROR 4: `/vsimem/97ca0b27-49a3-4408-a59f-cf5d84cb6514/97ca0b27-49a3-4408-a59f-cf5d84cb6514.tif' not recognized as a supported file format. ERROR 4: `/vsimem/6746481a-aa5e-4a7b-ba60-9188f8c0cb3d/6746481a-aa5e-4a7b-ba60-9188f8c0cb3d.tif' not recognized as a supported file format. ERROR 4: `/vsimem/a5fc1e2f-05da-40e1-b426-d6ca0cdfd1a2/a5fc1e2f-05da-40e1-b426-d6ca0cdfd1a2.tif' not recognized as a supported file format. ERROR 4: `/vsimem/5b58cf03-9dbf-481d-b906-394c0c6a35a0/5b58cf03-9dbf-481d-b906-394c0c6a35a0.tif' not recognized as a supported file format. ERROR 4: `/vsimem/1ef50cf2-16e8-4b8a-a996-79702c169547/1ef50cf2-16e8-4b8a-a996-79702c169547.tif' not recognized as a supported file format. ERROR 4: `/vsimem/939a1672-6ccf-45e0-9fad-f533e8e49cd2/939a1672-6ccf-45e0-9fad-f533e8e49cd2.tif' not recognized as a supported file format. ERROR 4: `/vsimem/735c91ac-2ce6-470d-9989-4872b1c407b3/735c91ac-2ce6-470d-9989-4872b1c407b3.tif' not recognized as a supported file format. ERROR 4: `/vsimem/321482dd-8af7-48a1-ab3d-26ba6a2b3a26/321482dd-8af7-48a1-ab3d-26ba6a2b3a26.tif' not recognized as a supported file format. ERROR 4: `/vsimem/fafb6251-03f2-4515-88e4-5c51e6ea2924/fafb6251-03f2-4515-88e4-5c51e6ea2924.tif' not recognized as a supported file format. ERROR 4: `/vsimem/d944aa38-3947-4503-ae1a-5e384132078f/d944aa38-3947-4503-ae1a-5e384132078f.tif' not recognized as a supported file format. ERROR 4: `/vsimem/544de44e-5574-48f6-afcb-18b2bd8d5e75/544de44e-5574-48f6-afcb-18b2bd8d5e75.tif' not recognized as a supported file format. ERROR 4: `/vsimem/ef7dc61a-97e9-4258-9f77-77061824677a/ef7dc61a-97e9-4258-9f77-77061824677a.tif' not recognized as a supported file format. ERROR 4: `/vsimem/f2b2489a-5382-4284-93ae-f84bffeb327b/f2b2489a-5382-4284-93ae-f84bffeb327b.tif' not recognized as a supported file format. ERROR 4: `/vsimem/f131c431-965e-4b46-8ce7-62797c293e3c/f131c431-965e-4b46-8ce7-62797c293e3c.tif' not recognized as a supported file format. ERROR 4: `/vsimem/c14c0381-be15-4955-9ad7-94294fe883e6/c14c0381-be15-4955-9ad7-94294fe883e6.tif' not recognized as a supported file format. ERROR 4: `/vsimem/472be3d2-fc92-450c-ad9f-5b87b07b5c17/472be3d2-fc92-450c-ad9f-5b87b07b5c17.tif' not recognized as a supported file format. ERROR 4: `/vsimem/17fd26c8-a442-4be3-ae04-ef8a8b443f13/17fd26c8-a442-4be3-ae04-ef8a8b443f13.tif' not recognized as a supported file format. ERROR 4: `/vsimem/417c54a8-39e1-4faa-9c9a-b4ae598ec1fa/417c54a8-39e1-4faa-9c9a-b4ae598ec1fa.tif' not recognized as a supported file format. ERROR 4: `/vsimem/603a2f00-8dd8-45aa-8f5f-31d3640b7821/603a2f00-8dd8-45aa-8f5f-31d3640b7821.tif' not recognized as a supported file format. ERROR 4: `/vsimem/381bd740-f95f-4f20-88aa-bbe7a660e5e7/381bd740-f95f-4f20-88aa-bbe7a660e5e7.tif' not recognized as a supported file format.
CPU times: user 1min 48s, sys: 28.5 s, total: 2min 16s Wall time: 2min 11s
Here's paths of the COGs we wrote out:
!tree /tmp/cogs/
/tmp/cogs/ ├── time=2021-01-01T17:07:19.024000 │ ├── band=B02-y=4745100.0-x=399960.0.tif │ ├── band=B02-y=4745100.0-x=454860.0.tif │ ├── band=B02-y=4800000.0-x=399960.0.tif │ ├── band=B02-y=4800000.0-x=454860.0.tif │ ├── band=B03-y=4745100.0-x=399960.0.tif │ ├── band=B03-y=4745100.0-x=454860.0.tif │ ├── band=B03-y=4800000.0-x=399960.0.tif │ ├── band=B03-y=4800000.0-x=454860.0.tif │ ├── band=B04-y=4745100.0-x=399960.0.tif │ ├── band=B04-y=4745100.0-x=454860.0.tif │ ├── band=B04-y=4800000.0-x=399960.0.tif │ └── band=B04-y=4800000.0-x=454860.0.tif └── time=2021-01-04T17:17:19.024000 ├── band=B02-y=4745100.0-x=399960.0.tif ├── band=B02-y=4745100.0-x=454860.0.tif ├── band=B02-y=4800000.0-x=399960.0.tif ├── band=B02-y=4800000.0-x=454860.0.tif ├── band=B03-y=4745100.0-x=399960.0.tif ├── band=B03-y=4745100.0-x=454860.0.tif ├── band=B03-y=4800000.0-x=399960.0.tif ├── band=B03-y=4800000.0-x=454860.0.tif ├── band=B04-y=4745100.0-x=399960.0.tif ├── band=B04-y=4745100.0-x=454860.0.tif ├── band=B04-y=4800000.0-x=399960.0.tif └── band=B04-y=4800000.0-x=454860.0.tif 2 directories, 24 files
Our result
DataArray is a bunch of STAC items (one per original chunk).
result[0, 0, 0, 0].item()
<Item id=time=2021-01-01T17:07:19.024000/y=4800000.0-x=399960.0.tif>
We can group those together (all the assets with the same ID are merged into a single item)
new_items = xcog.collate(result)
new_items[:5]
[<Item id=time=2021-01-01T17:07:19.024000/y=4745100.0-x=399960.0.tif>, <Item id=time=2021-01-01T17:07:19.024000/y=4745100.0-x=454860.0.tif>, <Item id=time=2021-01-01T17:07:19.024000/y=4800000.0-x=399960.0.tif>, <Item id=time=2021-01-01T17:07:19.024000/y=4800000.0-x=454860.0.tif>, <Item id=time=2021-01-04T17:17:19.024000/y=4745100.0-x=399960.0.tif>]
And those can be fed back to stackstac, so kind of a round-trip from DataArray -> {STAC + COG} -> DataArray
stackstac.stack([x.to_dict() for x in new_items], chunksize=5490).groupby("time").apply(stackstac.mosaic)
<xarray.DataArray 'stackstac-0b7bcb1640ad2a27ee326c47a1ebb2d5' (time: 2, band: 3, y: 10981, x: 10981)> dask.array<concatenate, shape=(2, 3, 10981, 10981), dtype=float64, chunksize=(1, 1, 5490, 5490), chunktype=numpy.ndarray> Coordinates: * band (band) <U3 'B02' 'B03' 'B04' * x (x) float64 4e+05 4e+05 4e+05 ... 5.097e+05 5.097e+05 5.098e+05 * y (y) float64 4.8e+06 4.8e+06 4.8e+06 ... 4.69e+06 4.69e+06 proj:epsg int64 32615 epsg int64 32615 * time (time) datetime64[ns] 2021-01-01T17:07:19.024000 2021-01-04T17... Attributes: spec: RasterSpec(epsg=32615, bounds=(399950.0, 4690200.0, 509760.0... crs: epsg:32615 transform: | 10.00, 0.00, 399950.00|\n| 0.00,-10.00, 4800010.00|\n| 0.0... resolution: 10.0
|
array(['B02', 'B03', 'B04'], dtype='<U3')
array([399950., 399960., 399970., ..., 509730., 509740., 509750.])
array([4800010., 4800000., 4799990., ..., 4690230., 4690220., 4690210.])
array(32615)
array(32615)
array(['2021-01-01T17:07:19.024000000', '2021-01-04T17:17:19.024000000'], dtype='datetime64[ns]')
(We'll ignore the fact that we're off by one on the size).