In addition to its STAC API, the Planetary Computer also provides access to STAC items as geoparquet datasets. These parquet datasets can be used for "bulk" workloads, where the search might return a very large number of items, or if it might require many separate queries to get your desired result. In general, these parquet datasets are produced with a lag relative to what's available through the STAC API. Most use-cases, including those that need recently added assets, should use our STAC API.
This example shows how to load STAC items from a Parquet dataset into a geopandas GeoDataFrame. A similar workflow would be possible with R's geoarrow package, or any other library that can read GeoParquet.
import dask.dataframe as dd
import geopandas
import planetary_computer
import pystac_client
import pandas as pd
pd.options.display.max_columns = 8
Each STAC collection providing a geoparquet dataset has a collection-level asset under the geoparquet-items
key.
catalog = pystac_client.Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1/",
modifier=planetary_computer.sign_inplace,
)
asset = catalog.get_collection("io-lulc-9-class").assets["geoparquet-items"]
df = geopandas.read_parquet(
asset.href, storage_options=asset.extra_fields["table:storage_options"]
)
df.head()
type | stac_version | stac_extensions | id | ... | end_datetime | proj:transform | start_datetime | io:supercell_id | |
---|---|---|---|---|---|---|---|---|---|
0 | Feature | 1.0.0 | [https://stac-extensions.github.io/projection/... | 12Q-2017 | ... | 2018-01-01 00:00:00+00:00 | [10.0, 0.0, 178910.0, 0.0, -10.0, 2657470.0] | 2017-01-01 00:00:00+00:00 | 12Q |
1 | Feature | 1.0.0 | [https://stac-extensions.github.io/projection/... | 15R-2017 | ... | 2018-01-01 00:00:00+00:00 | [10.0, 0.0, 194773.70566898846, 0.0, -10.0, 35... | 2017-01-01 00:00:00+00:00 | 15R |
2 | Feature | 1.0.0 | [https://stac-extensions.github.io/projection/... | 16M-2017 | ... | 2018-01-01 00:00:00+00:00 | [10.0, 0.0, 166023.6435927535, 0.0, -10.0, 999... | 2017-01-01 00:00:00+00:00 | 16M |
3 | Feature | 1.0.0 | [https://stac-extensions.github.io/projection/... | 20L-2022 | ... | 2023-01-01 00:00:00+00:00 | [10.0, 0.0, 169256.89710350422, 0.0, -10.0, 91... | 2022-01-01 00:00:00+00:00 | 20L |
4 | Feature | 1.0.0 | [https://stac-extensions.github.io/projection/... | 20M-2019 | ... | 2020-01-01 00:00:00+00:00 | [10.0, 0.0, 166023.6435927521, 0.0, -10.0, 999... | 2019-01-01 00:00:00+00:00 | 20M |
5 rows × 18 columns
Now we can do things like look at the count of each proj:epsg
code.
df["proj:epsg"].value_counts()
proj:epsg 32616 60 32638 60 32650 60 32648 60 32617 60 .. 32714 12 32710 11 32708 10 32711 6 32727 6 Name: count, Length: 120, dtype: int64
Or filter the items to a specific code and plot the footprints.
subset = df.loc[df["proj:epsg"] == 32651, ["io:tile_id", "geometry"]]
import contextily
ax = subset.plot(figsize=(4, 20), color="none", edgecolor="yellow")
contextily.add_basemap(
ax, crs=df.crs.to_string(), source=contextily.providers.Esri.NatGeoWorldMap
)
ax.set_axis_off()
Each parquet dataset has a unique schema, reflecting the unique properties captured in each collection. But there are some general patterns.
type
, stac_version
, stac_extensions
, id
, geometry
, bbox
, links
, assets
, and collection
).properties
are lifted to the top-level, including datetime-related fields like datetime
, start_datetime
, end_datetime
, common metadata (e.g. platform
) and extension fields (e.g. proj:bbox
, ...).Depending on the number of STAC items in the collection and whether or not new items are being added, the Parquet dataset may be split into multiple files by time.
For example, the io-lulc-9-class
collection is not partitioned and has just a single file:
import adlfs
fs = adlfs.AzureBlobFileSystem(**asset.extra_fields["table:storage_options"])
fs.ls("items/io-lulc-9-class.parquet") # Not partitioned, single result
['items/io-lulc-9-class.parquet']
Compare that to sentintel-2-l2a
, which is partitioned by week.
fs.ls("items/sentinel-2-l2a.parquet")[:5]
['items/sentinel-2-l2a.parquet/part-0001_2015-06-29T10:25:31+00:00_2015-07-06T10:25:31+00:00.parquet', 'items/sentinel-2-l2a.parquet/part-0002_2015-07-06T10:25:31+00:00_2015-07-13T10:25:31+00:00.parquet', 'items/sentinel-2-l2a.parquet/part-0003_2015-07-13T10:25:31+00:00_2015-07-20T10:25:31+00:00.parquet', 'items/sentinel-2-l2a.parquet/part-0004_2015-07-20T10:25:31+00:00_2015-07-27T10:25:31+00:00.parquet', 'items/sentinel-2-l2a.parquet/part-0005_2015-07-27T10:25:31+00:00_2015-08-03T10:25:31+00:00.parquet']
To work with a partitioned dataset, you can use a library like dask or dask-geopandas.
asset = catalog.get_collection("sentinel-2-l2a").assets["geoparquet-items"]
s2l2a = dd.read_parquet(
asset.href, storage_options=asset.extra_fields["table:storage_options"]
)
s2l2a.head()
type | stac_version | stac_extensions | id | ... | s2:high_proba_clouds_percentage | s2:reflectance_conversion_factor | s2:medium_proba_clouds_percentage | s2:saturated_defective_pixel_percentage | |
---|---|---|---|---|---|---|---|---|---|
0 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T35XQA_2021041... | ... | 92.546540 | 0.967449 | 4.807670 | 0.0 |
1 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T32TMM_2021041... | ... | 0.048035 | 0.967449 | 0.051376 | 0.0 |
2 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T32TMN_2021041... | ... | 0.011238 | 0.967449 | 0.022928 | 0.0 |
3 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T36WWC_2021041... | ... | 65.812266 | 0.967449 | 19.050561 | 0.0 |
4 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T36WWD_2021041... | ... | 97.629422 | 0.967449 | 1.861097 | 0.0 |
5 rows × 42 columns
You can perform filtering operations on the entire collection.
mask = (s2l2a["eo:cloud_cover"] < 10) & (s2l2a["s2:nodata_pixel_percentage"] > 90)
keep = s2l2a[mask]
keep.head()
type | stac_version | stac_extensions | id | ... | s2:high_proba_clouds_percentage | s2:reflectance_conversion_factor | s2:medium_proba_clouds_percentage | s2:saturated_defective_pixel_percentage | |
---|---|---|---|---|---|---|---|---|---|
27 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T32RMS_2021041... | ... | 0.000000 | 0.967449 | 0.000000 | 0.0 |
56 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T31PFP_2021041... | ... | 2.169701 | 0.967449 | 1.014810 | 0.0 |
68 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T31QGU_2021041... | ... | 0.211487 | 0.967449 | 0.171659 | 0.0 |
77 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T31SGT_2021041... | ... | 1.710171 | 0.967449 | 1.712733 | 0.0 |
80 | Feature | 1.0.0 | [https://stac-extensions.github.io/eo/v1.0.0/s... | S2A_MSIL2A_20150704T101006_R022_T31QHC_2021041... | ... | 0.000000 | 0.967449 | 0.000000 | 0.0 |
5 rows × 42 columns
When you compute the results, the computation will run in parallel. See Scale with Dask for more.
As mentioned earlier, the different collections have different properties, and so have different columns in the DataFrame.
s2l2a.columns.tolist()
['type', 'stac_version', 'stac_extensions', 'id', 'geometry', 'bbox', 'links', 'assets', 'collection', 'datetime', 'platform', 'proj:epsg', 'instruments', 's2:mgrs_tile', 'constellation', 's2:granule_id', 'eo:cloud_cover', 's2:datatake_id', 's2:product_uri', 's2:datastrip_id', 's2:product_type', 'sat:orbit_state', 's2:datatake_type', 's2:generation_time', 'sat:relative_orbit', 's2:water_percentage', 's2:mean_solar_zenith', 's2:mean_solar_azimuth', 's2:processing_baseline', 's2:snow_ice_percentage', 's2:vegetation_percentage', 's2:thin_cirrus_percentage', 's2:cloud_shadow_percentage', 's2:nodata_pixel_percentage', 's2:unclassified_percentage', 's2:dark_features_percentage', 's2:not_vegetated_percentage', 's2:degraded_msi_data_percentage', 's2:high_proba_clouds_percentage', 's2:reflectance_conversion_factor', 's2:medium_proba_clouds_percentage', 's2:saturated_defective_pixel_percentage']
Different collections will be partitioned by different frequencies, depending on the update cadence, number of STAC items, and size of each STAC item. Look for an msft:partition_info
property on the asset to check if the dataset is partitioned. The partition_frequency
is a pandas Offset alias.
asset.extra_fields["msft:partition_info"]
{'is_partitioned': True, 'partition_frequency': 'W-MON'}
STAC items are highly nested data structures, while libraries like pandas were mostly designed for working with non-nested data types. Consider a column like assets
, which is a dictionary mapping asset keys to asset objects (which include an href
and other properties).
df["assets"].head()
0 {'data': {'file:size': 53208880, 'file:values'... 1 {'data': {'file:size': 114187155, 'file:values... 2 {'data': {'file:size': 53981476, 'file:values'... 3 {'data': {'file:size': 165601021, 'file:values... 4 {'data': {'file:size': 97175834, 'file:values'... Name: assets, dtype: object
The json_normalize method can be used to expand this single column of nested data into many columns, one per asset:
import pandas as pd
assets = pd.json_normalize(df["assets"].head())
assets
data.file:size | data.file:values | data.href | data.raster:bands | ... | tilejson.href | tilejson.roles | tilejson.title | tilejson.type | |
---|---|---|---|---|---|---|---|---|---|
0 | 53208880 | [{'summary': 'No Data', 'values': [0]}, {'summ... | https://ai4edataeuwest.blob.core.windows.net/i... | [{'nodata': 0, 'spatial_resolution': 10}] | ... | https://planetarycomputer.microsoft.com/api/da... | [tiles] | TileJSON with default rendering | application/json |
1 | 114187155 | [{'summary': 'No Data', 'values': [0]}, {'summ... | https://ai4edataeuwest.blob.core.windows.net/i... | [{'nodata': 0, 'spatial_resolution': 10}] | ... | https://planetarycomputer.microsoft.com/api/da... | [tiles] | TileJSON with default rendering | application/json |
2 | 53981476 | [{'summary': 'No Data', 'values': [0]}, {'summ... | https://ai4edataeuwest.blob.core.windows.net/i... | [{'nodata': 0, 'spatial_resolution': 10}] | ... | https://planetarycomputer.microsoft.com/api/da... | [tiles] | TileJSON with default rendering | application/json |
3 | 165601021 | [{'summary': 'No Data', 'values': [0]}, {'summ... | https://ai4edataeuwest.blob.core.windows.net/i... | [{'nodata': 0, 'spatial_resolution': 10}] | ... | https://planetarycomputer.microsoft.com/api/da... | [tiles] | TileJSON with default rendering | application/json |
4 | 97175834 | [{'summary': 'No Data', 'values': [0]}, {'summ... | https://ai4edataeuwest.blob.core.windows.net/i... | [{'nodata': 0, 'spatial_resolution': 10}] | ... | https://planetarycomputer.microsoft.com/api/da... | [tiles] | TileJSON with default rendering | application/json |
5 rows × 15 columns
And the explode method will transform each element of a list-like value to a row:
assets["data.file:values"]
0 [{'summary': 'No Data', 'values': [0]}, {'summ... 1 [{'summary': 'No Data', 'values': [0]}, {'summ... 2 [{'summary': 'No Data', 'values': [0]}, {'summ... 3 [{'summary': 'No Data', 'values': [0]}, {'summ... 4 [{'summary': 'No Data', 'values': [0]}, {'summ... Name: data.file:values, dtype: object
assets["data.file:values"].explode()
0 {'summary': 'No Data', 'values': [0]} 0 {'summary': 'Water', 'values': [1]} 0 {'summary': 'Trees', 'values': [2]} 0 {'summary': 'Flooded vegetation', 'values': [4]} 0 {'summary': 'Crops', 'values': [5]} 0 {'summary': 'Built area', 'values': [7]} 0 {'summary': 'Bare ground', 'values': [8]} 0 {'summary': 'Snow/ice', 'values': [9]} 0 {'summary': 'Clouds', 'values': [10]} 0 {'summary': 'Rangeland', 'values': [11]} 1 {'summary': 'No Data', 'values': [0]} 1 {'summary': 'Water', 'values': [1]} 1 {'summary': 'Trees', 'values': [2]} 1 {'summary': 'Flooded vegetation', 'values': [4]} 1 {'summary': 'Crops', 'values': [5]} 1 {'summary': 'Built area', 'values': [7]} 1 {'summary': 'Bare ground', 'values': [8]} 1 {'summary': 'Snow/ice', 'values': [9]} 1 {'summary': 'Clouds', 'values': [10]} 1 {'summary': 'Rangeland', 'values': [11]} 2 {'summary': 'No Data', 'values': [0]} 2 {'summary': 'Water', 'values': [1]} 2 {'summary': 'Trees', 'values': [2]} 2 {'summary': 'Flooded vegetation', 'values': [4]} 2 {'summary': 'Crops', 'values': [5]} 2 {'summary': 'Built area', 'values': [7]} 2 {'summary': 'Bare ground', 'values': [8]} 2 {'summary': 'Snow/ice', 'values': [9]} 2 {'summary': 'Clouds', 'values': [10]} 2 {'summary': 'Rangeland', 'values': [11]} 3 {'summary': 'No Data', 'values': [0]} 3 {'summary': 'Water', 'values': [1]} 3 {'summary': 'Trees', 'values': [2]} 3 {'summary': 'Flooded vegetation', 'values': [4]} 3 {'summary': 'Crops', 'values': [5]} 3 {'summary': 'Built area', 'values': [7]} 3 {'summary': 'Bare ground', 'values': [8]} 3 {'summary': 'Snow/ice', 'values': [9]} 3 {'summary': 'Clouds', 'values': [10]} 3 {'summary': 'Rangeland', 'values': [11]} 4 {'summary': 'No Data', 'values': [0]} 4 {'summary': 'Water', 'values': [1]} 4 {'summary': 'Trees', 'values': [2]} 4 {'summary': 'Flooded vegetation', 'values': [4]} 4 {'summary': 'Crops', 'values': [5]} 4 {'summary': 'Built area', 'values': [7]} 4 {'summary': 'Bare ground', 'values': [8]} 4 {'summary': 'Snow/ice', 'values': [9]} 4 {'summary': 'Clouds', 'values': [10]} 4 {'summary': 'Rangeland', 'values': [11]} Name: data.file:values, dtype: object