AWS buckets can be configured to control access to the data they contain. Buckets can be configured to be completely free and open to users, like the Sentinel-2 Cloud Optimized GeoTIFF data within Open Data on AWS. Buckets can also be free and open, but require authenication, like NASA assets stored in the NASA Cumulus cloud space. Others may require user to confirm they will pay for data that is pulled out of an AWS Region (e.g., requester pays). Below we will walk through how to access buckets with different configurations, as well as show how to access assets via HTTPS and as s3 (when it's an option).
%matplotlib inline
import matplotlib.pyplot as plt
import requests
import boto3
import rasterio as rio # https://rasterio.readthedocs.io/en/latest/
from rasterio.plot import show
from rasterio.session import AWSSession
from osgeo import gdal
import rioxarray # https://corteva.github.io/rioxarray/stable/index.html
GDAL is a foundational piece of geospatial software that is leveraged by several popular open-source, and closed, geospatial software. The rasterio
package is no exception. Rasterio leverages GDAL to, among other things, read and write raster data files, e.g., GeoTIFFs/Cloud Optimized GeoTIFFs. To read remote files, i.e., files/objects stored in the cloud, GDAL uses its Virtual File System API. In a perfect world, one would be able to point a Virtual File System (there are several) at a remote data asset and have the asset retrieved, but that is not always the case. GDAL has a host of configurations/environmental variables that adjust its behavior to, for example, make a req`uest more performant or to pass AWS credentials to the distribution system. Below, we'll identify the evironmental variables that will help us get our data from cloud.
fo_url = 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/18/S/UH/2020/7/S2A_18SUH_20200729_0_L2A/B04.tif'
%%time
with rio.Env(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR', # https://rasterio.readthedocs.io/en/latest/topics/configuration.html
AWS_NO_SIGN_REQUEST='YES',
CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif'):
with rio.open(fo_url) as src:
print(src.profile)
print(f'Overviews levels: {src.overviews(1)}')
print(type(src))
fig, ax = plt.subplots(1, figsize=(12, 12))
show(src)
foa_url = "https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSS30.020/HLS.S30.T13TGF.2020274T174141.v2.0/HLS.S30.T13TGF.2020274T174141.v2.0.B04.tif"
If you do not have a netrc file with proper credetial in your user directory and do not set the proper GDAL environmental variables, you will get this fun/misleading error.
In this example we need to set GDAL_HTTP_COOKIEFILE
and GDAL_HTTP_COOKIEJAR
. These two configurations settings, along with the configurations we set above will allow us to access these data.
%%time
with rio.Env(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif',
GDAL_HTTP_COOKIEFILE='~/cookies.txt',
GDAL_HTTP_COOKIEJAR='~/cookies.txt'):
with rio.open(foa_url) as src:
print(src.profile)
print(f'Overviews levels: {src.overviews(1)}')
print(type(src))
hls_ov_levels = src.overviews(1) # We'll use this later on...
hls_proj = src.crs.to_string() # We'll use this later on too...
fig, ax = plt.subplots(1, figsize=(12, 12))
show(src, cmap='Reds')
session = boto3.Session()
rp_url = 's3://usgs-landsat/collection02/level-2/standard/oli-tirs/2020/032/031/LC08_L2SP_032031_20200704_20210330_02_T1/LC08_L2SP_032031_20200704_20210330_02_T1_SR_B4.TIF'
#rp_url = 's3://usgs-landsat/collection02/level-2/standard/oli-tirs/2020/032/032/LC08_L2SP_032032_20200704_20210330_02_T1/LC08_L2SP_032032_20200704_20210330_02_T1_SR_B4.TIF'
with rio.Env(AWSSession(session, requester_pays=True),
AWS_NO_SIGN_REQUEST='NO',
GDAL_DISABLE_READDIR_ON_OPEN='TRUE'):
with rio.open(rp_url) as src:
print(src.profile)
print(f'Overviews levels: {src.overviews(1)}')
print(type(src))
fig, ax = plt.subplots(1, figsize=(12, 12))
show(src, cmap='Reds')
Temperary S3 bucket access credential: https://lpdaac.earthdata.nasa.gov/s3credentials
nasa_hls_s3_url = 's3://lp-prod-protected/HLSS30.020/HLS.S30.T13TGF.2020274T174141.v2.0/HLS.S30.T13TGF.2020274T174141.v2.0.B04.tif'
def get_temp_creds():
temp_creds_url = 'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials'
return requests.get(temp_creds_url).json()
temp_creds_req = get_temp_creds()
#temp_creds_req
session = boto3.Session(aws_access_key_id=temp_creds_req['accessKeyId'],
aws_secret_access_key=temp_creds_req['secretAccessKey'],
aws_session_token=temp_creds_req['sessionToken'],
region_name='us-west-2')
with rio.Env(AWSSession(session),
GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif',
VSI_CACHE=True,
region_name='us-west-2'):
with rio.open(nasa_hls_s3_url, 'r') as src:
print(src.profile)
%%time
with rio.Env(AWSSession(session),
GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif',
VSI_CACHE=True,
region_name='us-west-2'):
with rio.open(nasa_hls_s3_url, 'r') as src:
print(src.profile)
print(f'Overviews levels: {src.overviews(1)}')
print(type(src))
fig, ax = plt.subplots(1, figsize=(12, 12))
show(src, cmap='Reds')
The image above is, in fact, the same NASA HLS asset we requested using HTTPS. Notice how much faster the rendering is when direct S3 access is used.
All the examples above use Rasterio
directly to open the cloud assets. This results in a numpy.ndarray
object. We can also use a package call rioxarray
to read our cloud assets in as an xarray/dask array object.
Access via rioxarray
with rio.Env(AWSSession(session),
GDAL_DISABLE_READDIR_ON_OPEN='TRUE',
CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif',
VSI_CACHE=True,
region_name='us-west-2'):
with rioxarray.open_rasterio(nasa_hls_s3_url) as src:
ds = src.squeeze('band', drop=True)
print(ds)
Note about context manager/environments
ds
ds.values # Throws an error...on purpose
Access data value within the rasterio
environment context manager (execute the cell if you get an error the first time)
with rio.Env(AWSSession(session),
GDAL_DISABLE_READDIR_ON_OPEN='TRUE',
CPL_VSIL_CURL_ALLOWED_EXTENSIONS='tif',
VSI_CACHE=True,
region_name='us-west-2'):
with rioxarray.open_rasterio(nasa_hls_s3_url) as src:
ds = src.squeeze('band', drop=True)
print(ds.values)