Topic 4: Analysis In-Place / Data Proximate Compute¶

Summary¶

In the previous exercises we exercise we search for and discovered cloud data assets that met certain search criteria (i.e., interests with our region of interest and for a specified date range). In this exercise we will leverage that data to create a normalized difference vegetation index (NDVI) time series. This exercise will demonstrate how to access and process (including quality filtering) data directly from S3.

Exercise¶

Import Required Packages¶

In [ ]:

%matplotlib inline
import matplotlib.pyplot as plt
from datetime import datetime
import os
import subprocess
import requests
import boto3
from pystac_client import Client
from collections import defaultdict
import numpy as np
import xarray as xr
import rasterio as rio
from rasterio.session import AWSSession
from rasterio.plot import show
import rioxarray
import geopandas
import pyproj
from pyproj import Proj
from shapely.ops import transform
import geoviews as gv
from cartopy import crs
import hvplot.xarray
import holoviews as hv
gv.extension('bokeh', 'matplotlib')

Get Temporary Credentials and Configure Local Environment¶

To perform direct S3 data access one needs to acquire temporary S3 credentials. The credentials give users direct access to S3 buckets in NASA Earthdata Cloud. AWS credentials should not be shared, so take precautions when using them in notebooks our scripts. Note, these temporary credentials are valid for only 1 hour. For more information regarding the temporary credentials visit https://data.lpdaac.earthdatacloud.nasa.gov/s3credentialsREADME.

In [ ]:

def get_temp_creds():
    temp_creds_url = 'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials'
    return requests.get(temp_creds_url).json()

In [ ]:

temp_creds_req = get_temp_creds()
#temp_creds_req                      # !!! BEWARE, removing the # on this line will print your temporary S3 credentials.

Insert the credentials into our `boto3` session and configure out `rasterio` environment for data access¶

Create a boto3 Session object using your temporary credentials. This Session can then be used to pass those credentials and get S3 objects from applicable buckets.

In [ ]:

session = boto3.Session(aws_access_key_id=temp_creds_req['accessKeyId'], 
                        aws_secret_access_key=temp_creds_req['secretAccessKey'],
                        aws_session_token=temp_creds_req['sessionToken'],
                        region_name='us-west-2')

For this exercise, we are going to open up a context manager for the notebook using the rasterio.env module to store the required GDAL and AWS configurations we need to access the data in Earthdata Cloud. While the context manager is open (rio_env.__enter__()) we will be able to run the open or get data commands that would typically be executed within a with statement, thus allowing us to more freely interact with the data. We’ll close the context (rio_env.__exit__()) at the end of the notebook.

GDAL environment variables must be configured to access Earthdata Cloud data assets. Geospatial data access Python packages like rasterio and rioxarray depend on GDAL, leveraging GDAL's "Virtual File Systems" to read remote files. GDAL has a lot of environment variables that control it's behavior. Changing these settings can mean the difference being able to access a file or not. They can also have an impact on the performance.

In [ ]:

rio_env = rio.Env(AWSSession(session),
                  GDAL_DISABLE_READDIR_ON_OPEN='TRUE',
                  GDAL_HTTP_COOKIEFILE=os.path.expanduser('~/cookies.txt'),
                  GDAL_HTTP_COOKIEJAR=os.path.expanduser('~/cookies.txt'))
rio_env.__enter__()

Read In and Process STAC Asset Links¶

In the previous section, we used the NASA CMR-STAC API to discover HLS assets the intersect with our search criteria, i.e., ROI, Date range, and collections. The search results were filtered and saved as text files by individual bands for each tile. We will read in the text files for tile T13TGF for the RED (L30: B04 & S30: B04), NIR (L30: B05 & S30: B8A), and Fmask bands.

List text files with HLS links¶

In [ ]:

[t for t in os.listdir('../data') if '.txt' in t]

Read in our asset links for BO4 (RED)¶

In [ ]:

red_s3_links = open('../data/S3_T13TGF_B04_Links.txt').read().splitlines()
red_s3_links

Read in and combine our asset links for BO5 (Landsat NIR) and B8A (Sentinel-2 NIR)¶

The near-infrared (NIR) band for Landsat is B05 while the NIR band for Sentinel-2 is B8A. In the next step we will read in and combine the lists into a single NIR list.

In [ ]:

nir_bands = ['B05', 'B8A']
nir_link_text = [x for x in os.listdir('../data') if any(b in x for b in nir_bands) and 'S3' in x]
nir_s3_links = []
for file in nir_link_text:
    nir_s3_links.extend(open(f'../data/{file}').read().splitlines())
nir_s3_links

Read in our asset links for Fmask¶

In [ ]:

fmask_s3_links = open('../data/S3_T13TGF_Fmask_Links.txt').read().splitlines()
fmask_s3_links

In this example we will use the gdalbuildvrt.exe utility to create a time series virtual raster format (VRT) file. The utility, however, expects the links to be formated with the GDAL virtual file system (VSI) path, rather than the actual asset links. We will therefore use the VSI path to access our assets. The examples below show the VSI path substitution for S3 (vsis3) links.

/vsis3/lp-prod-protected/HLSS30.015/HLS.S30.T13TGF.2020191T172901.v1.5.B04.tif

See the GDAL Virtual File Systems for more information regarding GDAL VSI.

Write out a new text file containing the `vsis3` path¶

In [ ]:

with open('../data/S3_T13TGF_RED_VSI_Links.txt', 'w') as f:
    links_vsi = [r.replace('s3://', '/vsis3/' ) + '\n' for r in red_s3_links]
    for link in links_vsi:
        f.write(link)

In [ ]:

with open('../data/S3_T13TGF_NIR_VSI_Links.txt', 'w') as f:
    links_vsi = [r.replace('s3://', '/vsis3/' ) + '\n' for r in nir_s3_links]
    for link in links_vsi:
        f.write(link)

In [ ]:

with open('../data/S3_T13TGF_FMASK_VSI_Links.txt', 'w') as f:
    links_vsi = [r.replace('s3://', '/vsis3/' ) + '\n' for r in fmask_s3_links]
    for link in links_vsi:
        f.write(link)

Read in geoJSON for subsetting¶

We will use the input geoJSON file to clip the source data to our desired region of interest.

In [ ]:

field = geopandas.read_file('../data/ne_w_agfields.geojson')
fieldShape = field['geometry'][0]  

To clip the source data to our input feature boundary, we need to transform the feature boundary from its original WGS84 coordinate reference system to the projected reference system of the source HLS file (i.e., UTM Zone 13).

In [ ]:

foa_url = red_s3_links[0]
with rio.open(foa_url) as src:
    hls_proj = src.crs.to_string()

hls_proj    

Transform geoJSON feature from WGS84 to UTM¶

In [ ]:

geo_CRS = Proj('+proj=longlat +datum=WGS84 +no_defs', preserve_units=True)   # Source coordinate system of the ROI
project = pyproj.Transformer.from_proj(geo_CRS, hls_proj)                    # Set up the transformation
fsUTM = transform(project.transform, fieldShape)

Direct S3 Data Access¶

There are multiple way to read COG data in as a time series. The subprocess package is used in this example to run GDAL's build virtual raster file (gdalbuildvrt) executable outside our python session. First we’ll need to construct a string object with the command and it’s parameter parameters (including our temporary credentials). Then, we run the command using the subprocess.call() function.

Build GDAL VRT Files¶

Construct the GDAL VRT call¶

In [ ]:

build_red_vrt = f"gdalbuildvrt ../data/red_stack.vrt -separate -input_file_list ../data/S3_T13TGF_RED_VSI_Links.txt --config AWS_ACCESS_KEY_ID {temp_creds_req['accessKeyId']} --config AWS_SECRET_ACCESS_KEY {temp_creds_req['secretAccessKey']} --config AWS_SESSION_TOKEN {temp_creds_req['sessionToken']} --config GDAL_DISABLE_READDIR_ON_OPEN TRUE"
#build_red_vrt    # !!! BEWARE, removing the # on this line will print your temporary S3 credentials.

We now have a fully configured gdalbuildvrt string that we can pass to Python's subprocess module to run the gdalbuildvrt executable outside our Python environment.

Execute `gdalbuildvrt` to construct a VRT on disk from the `S3` links¶

In [ ]:

%%time

subprocess.call(build_red_vrt, shell=True)

0 means success! We'll have some troubleshooting to do you get any other value. In this tutorial, the path for the output VRT file or the input file list are the first things to check.

While we're here, we'll build the VRT files for the NIR layers and the Fmask layers.

In [ ]:

build_nir_vrt = f"gdalbuildvrt ../data/nir_stack.vrt -separate -input_file_list ../data/S3_T13TGF_NIR_VSI_Links.txt --config AWS_ACCESS_KEY_ID {temp_creds_req['accessKeyId']} --config AWS_SECRET_ACCESS_KEY {temp_creds_req['secretAccessKey']} --config AWS_SESSION_TOKEN {temp_creds_req['sessionToken']} --config GDAL_DISABLE_READDIR_ON_OPEN TRUE"
subprocess.call(build_nir_vrt, shell=True)

In [ ]:

build_fmask_vrt = f"gdalbuildvrt ../data/fmask_stack.vrt -separate -input_file_list ../data/S3_T13TGF_FMASK_VSI_Links.txt --config AWS_ACCESS_KEY_ID {temp_creds_req['accessKeyId']} --config AWS_SECRET_ACCESS_KEY {temp_creds_req['secretAccessKey']} --config AWS_SESSION_TOKEN {temp_creds_req['sessionToken']} --config GDAL_DISABLE_READDIR_ON_OPEN TRUE"
subprocess.call(build_fmask_vrt, shell=True)

Reading in an HLS time series¶

We can now read the VRT files into our Python session. A drawback of reading VRTs into Python is that the time coordinate variable needs to be contructed. Below we not only read in the VRT file using rioxarray, but we also repurpose the band variable, which is generated automatically, to hold out time information.

Read the RED VRT in as xarray with Dask backing¶

In [ ]:

%%time

chunks=dict(band=1, x=1024, y=1024)
#chunks=dict(band=1, x=512, y=512)
red = rioxarray.open_rasterio('../data/red_stack.vrt', chunks=chunks)                    # Read in VRT
red = red.rename({'band':'time'})                                                       # Rename the 'band' coordinate variable to 'time' 
red['time'] = [datetime.strptime(x.split('.')[-5], '%Y%jT%H%M%S') for x in links_vsi]   # Extract the time information from the input file names and assign them to the time coordinate variable
red = red.sortby('time')                                                                # Sort by the time coordinate variable
red

Above we use the parameter chunk in the rioxarray.open_rasterio() function to enable the Dask backing. What this allows is lazy reading of the data, which means the data is not actually read in into memory at this point. What we have is an object with some metadata and pointer to the source data. The data will be streamed to us when we call for it, but not stored in memory until with call the Dask compute() or persist() methods.

Print out the `time` coordinate¶

In [ ]:

red.time

Clip out the ROI and persist the result in memory¶

Up until now, we haven't read any of the HLS data into memory. Now we will use the persist() method to load the data into memory.

In [ ]:

red_clip = red.rio.clip([fsUTM]).persist()
red_clip

Above, we persisted the clipped results to memory using the persist() method. This doesn't necessarily need to be done, but it will substantially improve the performance of the visualization of the time series below.

Plot `red_clip` with `hvplot`¶

In [ ]:

red_clip.hvplot.image(x='x', y='y', width=800, height=600, colorbar=True, cmap='Reds').opts(clim=(0.0, red_clip.values.max()))

Read in the NIR and Fmask VRT files¶

In [ ]:

%%time
chunks = dict(band=1, x=1024, y=1024)
nir = rioxarray.open_rasterio('../data/nir_stack.vrt', chunks=chunks)                    # Read in VRT
nir = nir.rename({'band':'time'})                                                       # Rename the 'band' coordinate variable to 'time' 
nir['time'] = [datetime.strptime(x.split('.')[-5], '%Y%jT%H%M%S') for x in links_vsi]   # Extract the time information from the input file names and assign them to the time coordinate variable
nir = nir.sortby('time')                                                                # Sort by the time coordinate variable
nir

In [ ]:

%%time
chunks = dict(band=1, x=1024, y=1024)
fmask = rioxarray.open_rasterio('../data/fmask_stack.vrt', chunks=chunks)                    # Read in VRT
fmask = fmask.rename({'band':'time'})                                                       # Rename the 'band' coordinate variable to 'time' 
fmask['time'] = [datetime.strptime(x.split('.')[-5], '%Y%jT%H%M%S') for x in links_vsi]     # Extract the time information from the input file names and assign them to the time coordinate variable
fmask = fmask.sortby('time')                                                                # Sort by the time coordinate variable
fmask

Create an `xarray dataset`¶

We will now combine the RED, NIR, and Fmask arrays into a dataset and create/add a new NDVI variable.

In [ ]:

hls_ndvi = xr.Dataset({'red': red, 'nir': nir, 'fmask': fmask, 'ndvi': (nir - red) / (nir + red)})
hls_ndvi

Above, we created a new NDVI variable. Now, we will clip and plot our results.

In [ ]:

ndvi_clip = hls_ndvi.rio.clip([fsUTM]).persist()
ndvi_clip

Plot NDVI¶

In [ ]:

ndvi_clip.ndvi.hvplot.image(x='x', y='y', groupby='time', width=800, height=600, colorbar=True, cmap='YlGn').opts(clim=(0.0, 1.0))

You may have notices that some images for some of the time step are 'blurrier' than other. This is because they are contaminated in some way, be it clouds, cloud shadows, snow, ice.

Apply quality filter¶

We want to keep NDVI data values where Fmask equals 0 (no clouds, no cloud shadow, no snow/ice, no water.

In [ ]:

ndvi_clip_filter = hls_ndvi.ndvi.where(fmask==64, np.nan).rio.clip([fsUTM]).persist()

In [ ]:

ndvi_clip_filter

In [ ]:

ndvi_clip_filter.hvplot.image(x='x', y='y', groupby='time', width=800, height=600, colorbar=True, cmap='YlGn').opts(clim=(0.0, 1.0))

Aggregate by month¶

Finally, we will use xarray's groupby operation to aggregate by month.

In [ ]:

ndvi_clip_filter.groupby('time.month').mean('time').hvplot.image(x = 'x', y = 'y', crs = hls_proj, groupby='month', cmap='YlGn', width=800, height=600, colorbar=True).opts(clim=(0.0, 1.0))

In [ ]:

rio_env.__exit__()

Topic 4: Analysis In-Place / Data Proximate Compute¶

Summary¶

Exercise¶

Import Required Packages¶

Get Temporary Credentials and Configure Local Environment¶

Insert the credentials into our boto3 session and configure out rasterio environment for data access¶

Read In and Process STAC Asset Links¶

List text files with HLS links¶

Read in our asset links for BO4 (RED)¶

Read in and combine our asset links for BO5 (Landsat NIR) and B8A (Sentinel-2 NIR)¶

Read in our asset links for Fmask¶

Write out a new text file containing the vsis3 path¶

Read in geoJSON for subsetting¶

Transform geoJSON feature from WGS84 to UTM¶

Direct S3 Data Access¶

Build GDAL VRT Files¶

Construct the GDAL VRT call¶

Execute gdalbuildvrt to construct a VRT on disk from the S3 links¶

Reading in an HLS time series¶

Read the RED VRT in as xarray with Dask backing¶

Print out the time coordinate¶

Clip out the ROI and persist the result in memory¶

Plot red_clip with hvplot¶

Read in the NIR and Fmask VRT files¶

Create an xarray dataset¶

Plot NDVI¶

Apply quality filter¶

Aggregate by month¶

References¶

Insert the credentials into our `boto3` session and configure out `rasterio` environment for data access¶

Write out a new text file containing the `vsis3` path¶

Execute `gdalbuildvrt` to construct a VRT on disk from the `S3` links¶

Print out the `time` coordinate¶

Plot `red_clip` with `hvplot`¶

Create an `xarray dataset`¶