Topic 3: Data Discovery: STAC & CMR-STAC API¶

Summary¶

In this example we will discover and access the NASA's Harmonized Landsat Sentinel-2 (HLS) version 2 assets, which are archived in cloud optimized geoTIFF (COG) format in the LP DAAC Cumulus cloud space. The COGs can be used like any other geoTIFF file, but have some added features that make them more efficient within the cloud data access paradigm. These features include: overviews and internal tiling. Below we will demonstrate how to leverage these features.

First we need to find the data. Specifically, we want to find data that intersects with our region of interest for our desired date range. We will use SpatioTemporal Asset Catalog (STAC) and the CMR-STAC API to identify the data assets that meet our search criteria.

What is STAC?¶

SpatioTemporal Asset Catalog (STAC) is a specification that provides a common language for interpreting geospatial information in order to standardize indexing and discovering data.

The STAC specification is made up of a collection of related, yet independent specifications that when used together provide search and discovery capabilities for remove assets.

Four STAC Specifications¶

In the following sections, we will explore each of STAC element using NASA's Common Metadata Repository (CMR) STAC application programming interface (API), or CMR-STAC API for short.

CMR-STAC API¶

The CMR-STAC API is NASA's implementation of the STAC API specification for all NASA data holdings within EOSDIS. The current implementation does not allow for querries accross the entire NASA catalog. Users must execute searches within provider catalogs (e.g., LPCLOUD) to find the STAC Items they are searching for. All the providers can be found at the CMR-STAC endpoint here: https://cmr.earthdata.nasa.gov/stac/.

In this exercise, we will query the LPCLOUD provider to identify STAC Items from the Harmonized Landsat Sentinel-2 (HLS) collection that fall within our region of interest (ROI) and within our specified time range.

Exercise¶

Import Required Packages¶

In [ ]:

import requests
from pystac_client import Client       # https://pystac-client.readthedocs.io/en/latest/index.html  
from collections import defaultdict    
import json
import geopandas
import geoviews as gv
from cartopy import crs
gv.extension('bokeh', 'matplotlib')

Submit `GET` request to the CMR STAC API¶

Use the reqests package to submit a GET request to the CMR STAC API. We'll parse the response and extract the information we need to navigate the STAC Catalog.

In [ ]:

stac_url = 'https://cmr.earthdata.nasa.gov/stac'

In [ ]:

provider_cat = requests.get(stac_url)

The CMR STAC API endpoint lists the available providers. Each provider is a seperate STAC Catalog endpoint that can be used to submit spatiotemporal queries agaist.

In [ ]:

providers = [p['title'] for p in provider_cat.json()['links'] if 'child' in p['rel']]
providers

Explore the `LPCLOUD` provider¶

In [ ]:

provider = 'LPCLOUD'

In [ ]:

provider_url = f'{stac_url}/{provider}'
provider_url

In [ ]:

cat = requests.get(provider_url)
cat.json()

List the STAC Collections within the `LPCLOUD` Catalog¶

In [ ]:

cols = [{l['href'].split('/')[-1]: l['href']} for l in cat.json() ['links'] if 'child' in l['rel']]
for c in cols:
    print(c)

Notice that only 10 collections are returned here, but the LPCLOUD provider has over 100 data products available in Earthdata Cloud. This is because the CMR STAC API returns 10 collections by default. A limit parameter can be added to the end of the LPCLOUD STAC API endpoint to increase the number of collections returned at one time, but this can sometimes be time consuming. Here we'll loop through each next page link to make a seperate request for the next 10 collections from each available page.

In [ ]:

try:
    print(f"Requesting page:")
    while nxt_pg := [l for l in cat.json()['links'] if 'next' in l['rel']][0]:
        print(f"{nxt_pg['href'].split('=')[-1]}...", end = ' ')
        cat = requests.get(nxt_pg['href'])
        cols.extend([{l['href'].split('/')[-1]: l['href']} for l in cat.json()['links']if 'child' in l['rel']])
except:
    print('No additional pages')

Print all available STAC Collections within the `LPCLOUD` Catalog¶

In [ ]:

print(f'LPCLOUD has {len(cols)} Collections')

In [ ]:

for c in cols:
    print(c)

Get information about an individual Collection¶

Below we'll specify a STAC Collection, HLSL30.v2.0, and request the STAC metadata

In [ ]:

collection = 'HLSL30.v2.0'

In [ ]:

collection_link = list(filter(lambda c: collection == list(c.keys())[0], cols))[0]
collection_link

In [ ]:

requests.get(collection_link[collection]).json()

Set up query parameters to submit to the CMR-STAC API¶

For this next sections we'll use a Python package call pystac_client to submit a spatiotemporal querie for data assets across multiple Collections. We will define our area of interest using the geojson file from the previous exercise, while also specifying the data collections and time range of needed for our example.

In [ ]:

field = geopandas.read_file('../data/ne_w_agfields.geojson')
field

In [ ]:

fieldShape = field['geometry'][0]
fieldShape

In [ ]:

base = gv.tile_sources.EsriImagery.opts(width=650, height=500)
farmField = gv.Polygons(fieldShape).opts(line_color='yellow', line_width=10, color=None)
base * farmField

We will now start to specify the search criteria we are interested in, i.e, the date range, the region of interest (roi), and the data collections, to pass to the STAC API

Specify the region of interest¶

In [ ]:

roi = json.loads(field.to_json())['features'][0]['geometry']
roi

Specify date range¶

In [ ]:

#date_range = "2021-05-01T00:00:00Z/2021-08-30T23:59:59Z"    # closed interval
#date_range = "2021-05-01T00:00:00Z/.."                      # open interval - does not currently work with the CMR-STAC API
date_range = "2021-05/2021-08"

Specify the STAC Collections¶

Note, a STAC Collection is synonomous with what we usually consider a data product.

In [ ]:

collections = ['HLSL30.v2.0', 'HLSS30.v2.0']
collections

Perform Search Against the CMR-STAC API¶

In [ ]:

catalog = Client.open(provider_url)

In [ ]:

search = catalog.search(
    collections=collections,
    intersects=roi,
    datetime=date_range
)

Print out how many STAC Items match our search query¶

In [ ]:

search.matched()

We now have a search object containing the STAC records that matched our query. Now, let's pull out all of the STAC Items (as a PySTAC ItemCollection object) and explore the contents (i.e., the STAC Items)

In [ ]:

item_collection = list(search.get_items())

In [ ]:

item_collection[0:5]

Grab the first Item and print it out as a dictionary¶

In [ ]:

item_collection[0].to_dict()

Filtering STAC Items¶

While the CMR-STAC API is a powerful search and discovery utility, it is still maturing and currently does not have the full gamut of filtering capabilities that the STAC API specification allows for. Hence, additional filtering is required if we want to filter by a property like cloud cover for example. Below we will loop through and filter the item_collection by a specified cloud cover as well as extract the band we need to do an Enhanced Vegetation Index (EVI) calculation later.

Now we will set the max cloud cover allowable and extract the band links for those Items that match or are less than the max cloud cover.

In [ ]:

cloudcover = 25

We will also specify the STAC Assets (i.e., bands/layers) of interest for both the S30 and L30 collections

In [ ]:

s30_bands = ['B8A', 'B04', 'B02', 'Fmask']    # S30 bands for EVI calculation and quality filtering -> NIR, RED, BLUE, Quality 
l30_bands = ['B05', 'B04', 'B02', 'Fmask']    # L30 bands for EVI calculation and quality filtering -> NIR, RED, BLUE, Quality 

In [ ]:

evi_band_links = []

for i in item_collection:
    if i.properties['eo:cloud_cover'] <= cloudcover:
        if i.collection_id == 'HLSS30.v2.0':
            #print(i.properties['eo:cloud_cover'])
            evi_bands = s30_bands
        elif i.collection_id == 'HLSL30.v2.0':
            #print(i.properties['eo:cloud_cover'])
            evi_bands = l30_bands

        for a in i.assets:
            if any(b==a for b in evi_bands):
                evi_band_links.append(i.assets[a].href)

In [ ]:

#len(evi_band_links)/4    # Print the number of Items that match our cloud criteria 

The filtering done in the previous steps produces a list of links to STAC Assets. Let's print out the first ten links.

In [ ]:

evi_band_links[:10]

NOTICE that in the list of links that we have multiple tiles, i.e. T14TKL & T13TGF, that intersect with our region of interest. These tiles represent neighboring UTM zones. We will split the list of links into seperate lists for each tile.

We now have a list of links to data assets that meet our search and filtering criteria. The commands that follow will split this list into logical groupings using python routines.

Split Data Links List into Logical Groupings¶

Split by Universal Transverse Mercator (UTM) tile specified in the file name (e.g., T14TKL & T13TGF)

In [ ]:

tile_dicts = defaultdict(list)    # https://stackoverflow.com/questions/26367812/appending-to-list-in-python-dictionary

In [ ]:

for l in evi_band_links:
    tile = l.split('.')[-6]
    tile_dicts[tile].append(l)

Print dictionary keys and values, i.e. the data links¶

In [ ]:

tile_dicts.keys()

In [ ]:

tile_dicts['T13TGF'][:5]

Now we will create a seperate list of data links for each tile

In [ ]:

tile_links_T14TKL = tile_dicts['T14TKL']
tile_links_T13TGF = tile_dicts['T13TGF']

Print band/layer links for HLS tile T13TGF¶

In [ ]:

tile_links_T13TGF[:10]

Split the links by band¶

In [ ]:

bands_dicts = defaultdict(list)

In [ ]:

for b in tile_links_T13TGF:
    band = b.split('.')[-2]
    bands_dicts[band].append(b)

In [ ]:

bands_dicts.keys()

In [ ]:

bands_dicts['B04']

Save links to a text file¶

To finish off this exercise, we will save the idividual link lists as seperate text files with descriptive names.

Write links from CMR-STAC API to a file¶

In [ ]:

for k, v in bands_dicts.items():
    name = (f'HTTPS_T13TGF_{k}_Links.txt')
    with open(f'../data/{name}', 'w') as f:
        for l in v:
            f.write(f"{l}" + '\n')

Write links to file for S3 access¶

In [ ]:

for k, v in bands_dicts.items():
    name = (f'S3_T13TGF_{k}_Links.txt')
    with open(f'../data/{name}', 'w') as f:
        for l in v:
            s3l = l.replace('https://data.lpdaac.earthdatacloud.nasa.gov/', 's3://')
            f.write(f"{s3l}" + '\n')

Topic 3: Data Discovery: STAC & CMR-STAC API¶

Summary¶

What is STAC?¶

Four STAC Specifications¶

CMR-STAC API¶

Exercise¶

Import Required Packages¶

Submit `GET` request to the CMR STAC API¶

Explore the `LPCLOUD` provider¶

List the STAC Collections within the `LPCLOUD` Catalog¶

Print all available STAC Collections within the `LPCLOUD` Catalog¶

Get information about an individual Collection¶

Set up query parameters to submit to the CMR-STAC API¶

Specify the region of interest¶

Specify date range¶

Specify the STAC Collections¶

Perform Search Against the CMR-STAC API¶

Print out how many STAC Items match our search query¶

Grab the first Item and print it out as a dictionary¶

Filtering STAC Items¶

Split Data Links List into Logical Groupings¶

Print dictionary keys and values, i.e. the data links¶

Print band/layer links for HLS tile T13TGF¶

Split the links by band¶

Save links to a text file¶

Write links from CMR-STAC API to a file¶

Write links to file for S3 access¶

Resources¶

Next: Topic 4 - Data Access - Direct S3 Access ¶

Topic 3: Data Discovery: STAC & CMR-STAC API¶

Summary¶

What is STAC?¶

Four STAC Specifications¶

CMR-STAC API¶

Exercise¶

Import Required Packages¶

Submit GET request to the CMR STAC API¶

Explore the LPCLOUD provider¶

List the STAC Collections within the LPCLOUD Catalog¶

Print all available STAC Collections within the LPCLOUD Catalog¶

Get information about an individual Collection¶

Set up query parameters to submit to the CMR-STAC API¶

Specify the region of interest¶

Specify date range¶

Specify the STAC Collections¶

Perform Search Against the CMR-STAC API¶

Print out how many STAC Items match our search query¶

Grab the first Item and print it out as a dictionary¶

Filtering STAC Items¶

Split Data Links List into Logical Groupings¶

Print dictionary keys and values, i.e. the data links¶

Print band/layer links for HLS tile T13TGF¶

Split the links by band¶

Save links to a text file¶

Write links from CMR-STAC API to a file¶

Write links to file for S3 access¶

Resources¶

Next: Topic 4 - Data Access - Direct S3 Access¶

Submit `GET` request to the CMR STAC API¶

Explore the `LPCLOUD` provider¶

List the STAC Collections within the `LPCLOUD` Catalog¶

Print all available STAC Collections within the `LPCLOUD` Catalog¶

Next: Topic 4 - Data Access - Direct S3 Access ¶