Bulk STAC item queries with GeoParquet¶

In addition to its STAC API, the Planetary Computer also provides access to STAC items as geoparquet datasets. These parquet datasets can be used for "bulk" workloads, where the search might return a very large number of items, or if it might require many separate queries to get your desired result. In general, these parquet datasets are produced with a lag relative to what's available through the STAC API. Most use-cases, including those that need recently added assets, should use our STAC API.

This example shows how to load STAC items from a Parquet dataset into a geopandas GeoDataFrame. A similar workflow would be possible with R's geoarrow package, or any other library that can read GeoParquet.

In [1]:

import dask.dataframe as dd
import geopandas
import planetary_computer
import pystac_client
import pandas as pd

pd.options.display.max_columns = 8

Loading STAC Items¶

Each STAC collection providing a geoparquet dataset has a collection-level asset under the geoparquet-items key.

In [2]:

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/",
    modifier=planetary_computer.sign_inplace,
)


asset = catalog.get_collection("io-lulc-9-class").assets["geoparquet-items"]

df = geopandas.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"]
)
df.head()

Out[2]:

	type	stac_version	stac_extensions	id	...	end_datetime	proj:transform	start_datetime	io:supercell_id
0	Feature	1.0.0	[https://stac-extensions.github.io/projection/...	12Q-2017	...	2018-01-01 00:00:00+00:00	[10.0, 0.0, 178910.0, 0.0, -10.0, 2657470.0]	2017-01-01 00:00:00+00:00	12Q
1	Feature	1.0.0	[https://stac-extensions.github.io/projection/...	15R-2017	...	2018-01-01 00:00:00+00:00	[10.0, 0.0, 194773.70566898846, 0.0, -10.0, 35...	2017-01-01 00:00:00+00:00	15R
2	Feature	1.0.0	[https://stac-extensions.github.io/projection/...	16M-2017	...	2018-01-01 00:00:00+00:00	[10.0, 0.0, 166023.6435927535, 0.0, -10.0, 999...	2017-01-01 00:00:00+00:00	16M
3	Feature	1.0.0	[https://stac-extensions.github.io/projection/...	20L-2022	...	2023-01-01 00:00:00+00:00	[10.0, 0.0, 169256.89710350422, 0.0, -10.0, 91...	2022-01-01 00:00:00+00:00	20L
4	Feature	1.0.0	[https://stac-extensions.github.io/projection/...	20M-2019	...	2020-01-01 00:00:00+00:00	[10.0, 0.0, 166023.6435927521, 0.0, -10.0, 999...	2019-01-01 00:00:00+00:00	20M

5 rows × 18 columns

Now we can do things like look at the count of each proj:epsg code.

In [3]:

df["proj:epsg"].value_counts()

Out[3]:

proj:epsg
32616    60
32638    60
32650    60
32648    60
32617    60
         ..
32714    12
32710    11
32708    10
32711     6
32727     6
Name: count, Length: 120, dtype: int64

Or filter the items to a specific code and plot the footprints.

In [4]:

subset = df.loc[df["proj:epsg"] == 32651, ["io:tile_id", "geometry"]]

In [5]:

import contextily

ax = subset.plot(figsize=(4, 20), color="none", edgecolor="yellow")
contextily.add_basemap(
    ax, crs=df.crs.to_string(), source=contextily.providers.Esri.NatGeoWorldMap
)

ax.set_axis_off()

Schemas¶

Each parquet dataset has a unique schema, reflecting the unique properties captured in each collection. But there are some general patterns.

Each dataset has a column for the properties required on a STAC item (type, stac_version, stac_extensions, id, geometry, bbox, links, assets, and collection).
All fields under properties are lifted to the top-level, including datetime-related fields like datetime, start_datetime, end_datetime, common metadata (e.g. platform) and extension fields (e.g. proj:bbox, ...).
Dynamic datasets, where new items are regularly added, are partitioned by time.

Partitioning¶

Depending on the number of STAC items in the collection and whether or not new items are being added, the Parquet dataset may be split into multiple files by time.

For example, the io-lulc-9-class collection is not partitioned and has just a single file:

In [6]:

import adlfs

fs = adlfs.AzureBlobFileSystem(**asset.extra_fields["table:storage_options"])
fs.ls("items/io-lulc-9-class.parquet")  # Not partitioned, single result

Out[6]:

['items/io-lulc-9-class.parquet']

Compare that to sentintel-2-l2a, which is partitioned by week.

In [7]:

fs.ls("items/sentinel-2-l2a.parquet")[:5]

Out[7]:

['items/sentinel-2-l2a.parquet/part-0001_2015-06-29T10:25:31+00:00_2015-07-06T10:25:31+00:00.parquet',
 'items/sentinel-2-l2a.parquet/part-0002_2015-07-06T10:25:31+00:00_2015-07-13T10:25:31+00:00.parquet',
 'items/sentinel-2-l2a.parquet/part-0003_2015-07-13T10:25:31+00:00_2015-07-20T10:25:31+00:00.parquet',
 'items/sentinel-2-l2a.parquet/part-0004_2015-07-20T10:25:31+00:00_2015-07-27T10:25:31+00:00.parquet',
 'items/sentinel-2-l2a.parquet/part-0005_2015-07-27T10:25:31+00:00_2015-08-03T10:25:31+00:00.parquet']

To work with a partitioned dataset, you can use a library like dask or dask-geopandas.

In [8]:

asset = catalog.get_collection("sentinel-2-l2a").assets["geoparquet-items"]

s2l2a = dd.read_parquet(
    asset.href, storage_options=asset.extra_fields["table:storage_options"]
)
s2l2a.head()

Out[8]:

	type	stac_version	stac_extensions	id	...	s2:high_proba_clouds_percentage	s2:reflectance_conversion_factor	s2:medium_proba_clouds_percentage
0	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T35XQA_2021041...	...	92.546540	0.967449	4.807670
1	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T32TMM_2021041...	...	0.048035	0.967449	0.051376
2	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T32TMN_2021041...	...	0.011238	0.967449	0.022928
3	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T36WWC_2021041...	...	65.812266	0.967449	19.050561
4	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T36WWD_2021041...	...	97.629422	0.967449	1.861097

5 rows × 42 columns

You can perform filtering operations on the entire collection.

In [9]:

mask = (s2l2a["eo:cloud_cover"] < 10) & (s2l2a["s2:nodata_pixel_percentage"] > 90)
keep = s2l2a[mask]
keep.head()

Out[9]:

	type	stac_version	stac_extensions	id	...	s2:high_proba_clouds_percentage	s2:reflectance_conversion_factor	s2:medium_proba_clouds_percentage
27	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T32RMS_2021041...	...	0.000000	0.967449	0.000000
56	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T31PFP_2021041...	...	2.169701	0.967449	1.014810
68	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T31QGU_2021041...	...	0.211487	0.967449	0.171659
77	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T31SGT_2021041...	...	1.710171	0.967449	1.712733
80	Feature	1.0.0	[https://stac-extensions.github.io/eo/v1.0.0/s...	S2A_MSIL2A_20150704T101006_R022_T31QHC_2021041...	...	0.000000	0.967449	0.000000

5 rows × 42 columns

When you compute the results, the computation will run in parallel. See Scale with Dask for more.

As mentioned earlier, the different collections have different properties, and so have different columns in the DataFrame.

In [10]:

s2l2a.columns.tolist()

Out[10]:

['type',
 'stac_version',
 'stac_extensions',
 'id',
 'geometry',
 'bbox',
 'links',
 'assets',
 'collection',
 'datetime',
 'platform',
 'proj:epsg',
 'instruments',
 's2:mgrs_tile',
 'constellation',
 's2:granule_id',
 'eo:cloud_cover',
 's2:datatake_id',
 's2:product_uri',
 's2:datastrip_id',
 's2:product_type',
 'sat:orbit_state',
 's2:datatake_type',
 's2:generation_time',
 'sat:relative_orbit',
 's2:water_percentage',
 's2:mean_solar_zenith',
 's2:mean_solar_azimuth',
 's2:processing_baseline',
 's2:snow_ice_percentage',
 's2:vegetation_percentage',
 's2:thin_cirrus_percentage',
 's2:cloud_shadow_percentage',
 's2:nodata_pixel_percentage',
 's2:unclassified_percentage',
 's2:dark_features_percentage',
 's2:not_vegetated_percentage',
 's2:degraded_msi_data_percentage',
 's2:high_proba_clouds_percentage',
 's2:reflectance_conversion_factor',
 's2:medium_proba_clouds_percentage',
 's2:saturated_defective_pixel_percentage']

Different collections will be partitioned by different frequencies, depending on the update cadence, number of STAC items, and size of each STAC item. Look for an msft:partition_info property on the asset to check if the dataset is partitioned. The partition_frequency is a pandas Offset alias.

In [11]:

asset.extra_fields["msft:partition_info"]

Out[11]:

{'is_partitioned': True, 'partition_frequency': 'W-MON'}

Expanding nested fields¶

STAC items are highly nested data structures, while libraries like pandas were mostly designed for working with non-nested data types. Consider a column like assets, which is a dictionary mapping asset keys to asset objects (which include an href and other properties).

In [12]:

df["assets"].head()

Out[12]:

0    {'data': {'file:size': 53208880, 'file:values'...
1    {'data': {'file:size': 114187155, 'file:values...
2    {'data': {'file:size': 53981476, 'file:values'...
3    {'data': {'file:size': 165601021, 'file:values...
4    {'data': {'file:size': 97175834, 'file:values'...
Name: assets, dtype: object

The json_normalize method can be used to expand this single column of nested data into many columns, one per asset:

In [13]:

import pandas as pd

assets = pd.json_normalize(df["assets"].head())
assets

Out[13]:

	data.file:size	data.file:values	data.href	data.raster:bands	...	tilejson.href	tilejson.roles	tilejson.title	tilejson.type
0	53208880	[{'summary': 'No Data', 'values': [0]}, {'summ...	https://ai4edataeuwest.blob.core.windows.net/i...	[{'nodata': 0, 'spatial_resolution': 10}]	...	https://planetarycomputer.microsoft.com/api/da...	[tiles]	TileJSON with default rendering	application/json
1	114187155	[{'summary': 'No Data', 'values': [0]}, {'summ...	https://ai4edataeuwest.blob.core.windows.net/i...	[{'nodata': 0, 'spatial_resolution': 10}]	...	https://planetarycomputer.microsoft.com/api/da...	[tiles]	TileJSON with default rendering	application/json
2	53981476	[{'summary': 'No Data', 'values': [0]}, {'summ...	https://ai4edataeuwest.blob.core.windows.net/i...	[{'nodata': 0, 'spatial_resolution': 10}]	...	https://planetarycomputer.microsoft.com/api/da...	[tiles]	TileJSON with default rendering	application/json
3	165601021	[{'summary': 'No Data', 'values': [0]}, {'summ...	https://ai4edataeuwest.blob.core.windows.net/i...	[{'nodata': 0, 'spatial_resolution': 10}]	...	https://planetarycomputer.microsoft.com/api/da...	[tiles]	TileJSON with default rendering	application/json
4	97175834	[{'summary': 'No Data', 'values': [0]}, {'summ...	https://ai4edataeuwest.blob.core.windows.net/i...	[{'nodata': 0, 'spatial_resolution': 10}]	...	https://planetarycomputer.microsoft.com/api/da...	[tiles]	TileJSON with default rendering	application/json

5 rows × 15 columns

And the explode method will transform each element of a list-like value to a row:

In [14]:

assets["data.file:values"]

Out[14]:

0    [{'summary': 'No Data', 'values': [0]}, {'summ...
1    [{'summary': 'No Data', 'values': [0]}, {'summ...
2    [{'summary': 'No Data', 'values': [0]}, {'summ...
3    [{'summary': 'No Data', 'values': [0]}, {'summ...
4    [{'summary': 'No Data', 'values': [0]}, {'summ...
Name: data.file:values, dtype: object

In [15]:

assets["data.file:values"].explode()

Out[15]:

0               {'summary': 'No Data', 'values': [0]}
0                 {'summary': 'Water', 'values': [1]}
0                 {'summary': 'Trees', 'values': [2]}
0    {'summary': 'Flooded vegetation', 'values': [4]}
0                 {'summary': 'Crops', 'values': [5]}
0            {'summary': 'Built area', 'values': [7]}
0           {'summary': 'Bare ground', 'values': [8]}
0              {'summary': 'Snow/ice', 'values': [9]}
0               {'summary': 'Clouds', 'values': [10]}
0            {'summary': 'Rangeland', 'values': [11]}
1               {'summary': 'No Data', 'values': [0]}
1                 {'summary': 'Water', 'values': [1]}
1                 {'summary': 'Trees', 'values': [2]}
1    {'summary': 'Flooded vegetation', 'values': [4]}
1                 {'summary': 'Crops', 'values': [5]}
1            {'summary': 'Built area', 'values': [7]}
1           {'summary': 'Bare ground', 'values': [8]}
1              {'summary': 'Snow/ice', 'values': [9]}
1               {'summary': 'Clouds', 'values': [10]}
1            {'summary': 'Rangeland', 'values': [11]}
2               {'summary': 'No Data', 'values': [0]}
2                 {'summary': 'Water', 'values': [1]}
2                 {'summary': 'Trees', 'values': [2]}
2    {'summary': 'Flooded vegetation', 'values': [4]}
2                 {'summary': 'Crops', 'values': [5]}
2            {'summary': 'Built area', 'values': [7]}
2           {'summary': 'Bare ground', 'values': [8]}
2              {'summary': 'Snow/ice', 'values': [9]}
2               {'summary': 'Clouds', 'values': [10]}
2            {'summary': 'Rangeland', 'values': [11]}
3               {'summary': 'No Data', 'values': [0]}
3                 {'summary': 'Water', 'values': [1]}
3                 {'summary': 'Trees', 'values': [2]}
3    {'summary': 'Flooded vegetation', 'values': [4]}
3                 {'summary': 'Crops', 'values': [5]}
3            {'summary': 'Built area', 'values': [7]}
3           {'summary': 'Bare ground', 'values': [8]}
3              {'summary': 'Snow/ice', 'values': [9]}
3               {'summary': 'Clouds', 'values': [10]}
3            {'summary': 'Rangeland', 'values': [11]}
4               {'summary': 'No Data', 'values': [0]}
4                 {'summary': 'Water', 'values': [1]}
4                 {'summary': 'Trees', 'values': [2]}
4    {'summary': 'Flooded vegetation', 'values': [4]}
4                 {'summary': 'Crops', 'values': [5]}
4            {'summary': 'Built area', 'values': [7]}
4           {'summary': 'Bare ground', 'values': [8]}
4              {'summary': 'Snow/ice', 'values': [9]}
4               {'summary': 'Clouds', 'values': [10]}
4            {'summary': 'Rangeland', 'values': [11]}
Name: data.file:values, dtype: object