In this example we will discover and access the NASA's Harmonized Landsat Sentinel-2 (HLS) version 2 assets, which are archived in cloud optimized geoTIFF (COG) format in the LP DAAC Cumulus cloud space. The COGs can be used like any other geoTIFF file, but have some added features that make them more efficient within the cloud data access paradigm. These features include: overviews and internal tiling. Below we will demonstrate how to leverage these features.
First we need to find the data. Specifically, we want to find data that intersects with our region of interest for our desired date range. We will use SpatioTemporal Asset Catalog (STAC) and the CMR-STAC API to identify the data assets that meet our search criteria.
SpatioTemporal Asset Catalog (STAC) is a specification that provides a common language for interpreting geospatial information in order to standardize indexing and discovering data.
The STAC specification is made up of a collection of related, yet independent specifications that when used together provide search and discovery capabilities for remove assets.
In the following sections, we will explore each of STAC element using NASA's Common Metadata Repository (CMR) STAC application programming interface (API), or CMR-STAC API for short.
The CMR-STAC API is NASA's implementation of the STAC API specification for all NASA data holdings within EOSDIS. The current implementation does not allow for querries accross the entire NASA catalog. Users must execute searches within provider catalogs (e.g., LPCLOUD) to find the STAC Items they are searching for. All the providers can be found at the CMR-STAC endpoint here: https://cmr.earthdata.nasa.gov/stac/.
In this exercise, we will query the LPCLOUD provider to identify STAC Items from the Harmonized Landsat Sentinel-2 (HLS) collection that fall within our region of interest (ROI) and within our specified time range.
import requests
from pystac_client import Client # https://pystac-client.readthedocs.io/en/latest/index.html
from collections import defaultdict
import json
import geopandas
import geoviews as gv
from cartopy import crs
gv.extension('bokeh', 'matplotlib')
GET
request to the CMR STAC API¶Use the reqests
package to submit a GET
request to the CMR STAC API. We'll parse the response and extract the information we need to navigate the STAC Catalog.
stac_url = 'https://cmr.earthdata.nasa.gov/stac'
provider_cat = requests.get(stac_url)
The CMR STAC API endpoint lists the available providers. Each provider is a seperate STAC Catalog endpoint that can be used to submit spatiotemporal queries agaist.
providers = [p['title'] for p in provider_cat.json()['links'] if 'child' in p['rel']]
providers
LPCLOUD
provider¶provider = 'LPCLOUD'
provider_url = f'{stac_url}/{provider}'
provider_url
cat = requests.get(provider_url)
cat.json()
LPCLOUD
Catalog¶cols = [{l['href'].split('/')[-1]: l['href']} for l in cat.json() ['links'] if 'child' in l['rel']]
for c in cols:
print(c)
Notice that only 10 collections are returned here, but the LPCLOUD
provider has over 100 data products available in Earthdata Cloud. This is because the CMR STAC API returns 10 collections by default. A limit
parameter can be added to the end of the LPCLOUD
STAC API endpoint to increase the number of collections returned at one time, but this can sometimes be time consuming. Here we'll loop through each next
page link to make a seperate request for the next 10 collections from each available page.
try:
print(f"Requesting page:")
while nxt_pg := [l for l in cat.json()['links'] if 'next' in l['rel']][0]:
print(f"{nxt_pg['href'].split('=')[-1]}...", end = ' ')
cat = requests.get(nxt_pg['href'])
cols.extend([{l['href'].split('/')[-1]: l['href']} for l in cat.json()['links']if 'child' in l['rel']])
except:
print('No additional pages')
LPCLOUD
Catalog¶print(f'LPCLOUD has {len(cols)} Collections')
for c in cols:
print(c)
Below we'll specify a STAC Collection, HLSL30.v2.0
, and request the STAC metadata
collection = 'HLSL30.v2.0'
collection_link = list(filter(lambda c: collection == list(c.keys())[0], cols))[0]
collection_link
requests.get(collection_link[collection]).json()
For this next sections we'll use a Python package call pystac_client
to submit a spatiotemporal querie for data assets across multiple Collections. We will define our area of interest using the geojson file from the previous exercise, while also specifying the data collections and time range of needed for our example.
field = geopandas.read_file('../data/ne_w_agfields.geojson')
field
fieldShape = field['geometry'][0]
fieldShape
base = gv.tile_sources.EsriImagery.opts(width=650, height=500)
farmField = gv.Polygons(fieldShape).opts(line_color='yellow', line_width=10, color=None)
base * farmField
We will now start to specify the search criteria we are interested in, i.e, the date range, the region of interest (roi), and the data collections, to pass to the STAC API
roi = json.loads(field.to_json())['features'][0]['geometry']
roi
#date_range = "2021-05-01T00:00:00Z/2021-08-30T23:59:59Z" # closed interval
#date_range = "2021-05-01T00:00:00Z/.." # open interval - does not currently work with the CMR-STAC API
date_range = "2021-05/2021-08"
Note, a STAC Collection is synonomous with what we usually consider a data product.
collections = ['HLSL30.v2.0', 'HLSS30.v2.0']
collections
catalog = Client.open(provider_url)
search = catalog.search(
collections=collections,
intersects=roi,
datetime=date_range
)
search.matched()
We now have a search object containing the STAC records that matched our query. Now, let's pull out all of the STAC Items (as a PySTAC ItemCollection object) and explore the contents (i.e., the STAC Items)
item_collection = list(search.get_items())
item_collection[0:5]
item_collection[0].to_dict()
While the CMR-STAC API is a powerful search and discovery utility, it is still maturing and currently does not have the full gamut of filtering capabilities that the STAC API specification allows for. Hence, additional filtering is required if we want to filter by a property like cloud cover for example. Below we will loop through and filter the item_collection by a specified cloud cover as well as extract the band we need to do an Enhanced Vegetation Index (EVI) calculation later.
Now we will set the max cloud cover allowable and extract the band links for those Items that match or are less than the max cloud cover.
cloudcover = 25
We will also specify the STAC Assets (i.e., bands/layers) of interest for both the S30 and L30 collections
s30_bands = ['B8A', 'B04', 'B02', 'Fmask'] # S30 bands for EVI calculation and quality filtering -> NIR, RED, BLUE, Quality
l30_bands = ['B05', 'B04', 'B02', 'Fmask'] # L30 bands for EVI calculation and quality filtering -> NIR, RED, BLUE, Quality
evi_band_links = []
for i in item_collection:
if i.properties['eo:cloud_cover'] <= cloudcover:
if i.collection_id == 'HLSS30.v2.0':
#print(i.properties['eo:cloud_cover'])
evi_bands = s30_bands
elif i.collection_id == 'HLSL30.v2.0':
#print(i.properties['eo:cloud_cover'])
evi_bands = l30_bands
for a in i.assets:
if any(b==a for b in evi_bands):
evi_band_links.append(i.assets[a].href)
#len(evi_band_links)/4 # Print the number of Items that match our cloud criteria
The filtering done in the previous steps produces a list of links to STAC Assets. Let's print out the first ten links.
evi_band_links[:10]
NOTICE that in the list of links that we have multiple tiles, i.e. T14TKL & T13TGF, that intersect with our region of interest. These tiles represent neighboring UTM zones. We will split the list of links into seperate lists for each tile.
We now have a list of links to data assets that meet our search and filtering criteria. The commands that follow will split this list into logical groupings using python routines.
Split by Universal Transverse Mercator (UTM) tile specified in the file name (e.g., T14TKL & T13TGF)
tile_dicts = defaultdict(list) # https://stackoverflow.com/questions/26367812/appending-to-list-in-python-dictionary
for l in evi_band_links:
tile = l.split('.')[-6]
tile_dicts[tile].append(l)
tile_dicts.keys()
tile_dicts['T13TGF'][:5]
Now we will create a seperate list of data links for each tile
tile_links_T14TKL = tile_dicts['T14TKL']
tile_links_T13TGF = tile_dicts['T13TGF']
tile_links_T13TGF[:10]
bands_dicts = defaultdict(list)
for b in tile_links_T13TGF:
band = b.split('.')[-2]
bands_dicts[band].append(b)
bands_dicts.keys()
bands_dicts['B04']
To finish off this exercise, we will save the idividual link lists as seperate text files with descriptive names.
for k, v in bands_dicts.items():
name = (f'HTTPS_T13TGF_{k}_Links.txt')
with open(f'../data/{name}', 'w') as f:
for l in v:
f.write(f"{l}" + '\n')
for k, v in bands_dicts.items():
name = (f'S3_T13TGF_{k}_Links.txt')
with open(f'../data/{name}', 'w') as f:
for l in v:
s3l = l.replace('https://data.lpdaac.earthdatacloud.nasa.gov/', 's3://')
f.write(f"{s3l}" + '\n')