This notebook is based off of the NSIDC-Data-Access-Notebook provided through NSIDC's Github organization.
Now that we've visualized our study areas, we will first explore data coverage, size, and customization (subsetting, reformatting, reprojection) service availability, and then access those associated files.
___A note on data access options:___ We will be pursuing data discovery and access "programmatically" using Application Programming Interfaces, or APIs.
What is an API? You can think of an API as a middle man between an application or end-use (in this case, us) and a data provider. In this case the data provider is both the Common Metadata Repository (CMR) housing data information, and NSIDC as the data distributor. These APIs are generally structured as a URL with a base plus individual key-value-pairs separated by ‘&’.
There are other discovery and access methods available from NSIDC listed on the data set landing page (e.g. ATL07 Data Access) and NASA Earthdata Search. Programmatic API access is beneficial for those of you who want to incorporate data access into your visualization and analysis workflow. This method is also reproducible and documented to ensure data provenance.
Here are the steps you will learn in this customize and access notebook:
import requests
import json
import math
import earthaccess
# This is our functions module. We created several functions used in this notebook and the Visualize and Analyze notebook.
import tutorial_helper_functions as fn
The Common Metadata Repository (CMR) is a high-performance, high-quality, continuously evolving metadata system that catalogs Earth Science data and associated service metadata records. These metadata records are registered, modified, discovered, and accessed through programmatic interfaces leveraging standard protocols and APIs. Note that not all NSIDC data can be searched at the file level using CMR, particularly those outside of the NASA DAAC program.
CMR API documentation: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html
Data sets are selected by data set IDs (e.g. ATL07, ATL10, and MOD29). In the CMR API documentation, a data set id is referred to as a "short name". These short names are located at the top of each NSIDC data set landing page in white underneath the full title, after 'DATA SET:'.
We are using the Python Requests package to access the CMR. Data are then converted to JSON format; a language independant human-readable open-standard file format. More than one version can exist for a given data set:
# Create dictionary of data set parameters we'll use in our access API command below. We'll start with data set IDs (e.g. ATL07) of interest here, also known as "short name".
data_dict = {
'sea_ice_fb' : {'short_name' : 'ATL10'},
'sea_ice_height' : {'short_name' : 'ATL07'},
'ist' : {'short_name' : 'MOD29'},
}
# Get json response from CMR collection metadata to grab version numbers and add the most recent version number to data_dict
for i in range(len(data_dict)):
cmr_collections_url = 'https://cmr.earthdata.nasa.gov/search/collections.json'
response = requests.get(cmr_collections_url, params=list(data_dict.values())[i])
results = json.loads(response.content)
# Find all instances of 'version_id' in metadata and print most recent version number
versions = [el['version_id'] for el in results['feed']['entry']]
versions = [i for i in versions if not any(c.isalpha() for c in i)]
data_dict[list(data_dict.keys())[i]]['version'] = max(versions)
We will add spatial and temporal filters to the data dictionary. The bounding box coordinates cover our region of interest over the East Siberian Sea and the temporal range covers March 23, 2019.
# Bounding Box spatial parameter in 'W,S,E,N' decimal degrees format
bounding_box = '140,72,153,80'
# Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
temporal = '2019-03-23T00:00:00Z,2019-03-23T23:59:59Z'
#add bounding_box and temporal to each data set in the dictionary
for k, v in data_dict.items():
data_dict[k]['bounding_box'] = bounding_box
data_dict[k]['temporal'] = temporal
We will use the granule_info
function to query the CMR granule API. The function prints the number of granules, average size, and total volume of those granules. It returns the granule number value.
for i in range(len(data_dict)):
gran_num = fn.granule_info(list(data_dict.values())[i])
list(data_dict.values())[i]['gran_num'] = gran_num
There are 2 granules of ATL10 version 005 over my area and time of interest. The average size of each granule is 168.34 MB and the total size of all 2 granules is 336.69 MB There are 4 granules of ATL07 version 005 over my area and time of interest. The average size of each granule is 320.07 MB and the total size of all 4 granules is 1280.29 MB There are 13 granules of MOD29 version 61 over my area and time of interest. The average size of each granule is 2.80 MB and the total size of all 13 granules is 36.40 MB
Note that subsetting, reformatting, or reprojecting can alter the size of the granules if those services are applied to your request.
The NSIDC DAAC supports customization (subsetting, reformatting, reprojection) services on many of our NASA Earthdata mission collections. Let's discover whether or not our data set has these services available using the print_service_options
function. If services are available, we will also determine the specific service options supported for this data set, which we will then add to our data dictionary.
An Earthdata Login account is required to query data services and to access data from the NSIDC DAAC. If you do not already have an Earthdata Login account, visit http://urs.earthdata.nasa.gov to register. We are going to use the earthaccess library to authenticate with our Earthdata Login credentials. We recommend using a netrc file to store your Earthdata Login credentials for authentication. Below are instructions for creating a netrc file. If we don't want to use a netrc file, skip these instructions and instead we will be prompted to enter our credentials below.
machine urs.earthdata.nasa.gov login <USERNAME> password <PASSWORD>
where <USERNAME> and <PASSWORD> are replaced by your Earthdata Login username and password.
auth = earthaccess.login()
EDL_USERNAME and EDL_PASSWORD are not set in the current environment, try setting them or use a different strategy (netrc, interactive) You're now authenticated with NASA Earthdata Login Using token with expiration date: 04/16/2023 Using .netrc file for EDL
We now need to create an HTTP session in order to store cookies and pass our credentials to the data service URLs. The capability URL below is what we will query to determine service information.
import warnings
warnings.filterwarnings('ignore')
# Query service capability URL
for i in range(len(data_dict)):
capability_url = f"https://n5eil02u.ecs.nsidc.org/egi/capabilities/{list(data_dict.values())[i]['short_name']}.{list(data_dict.values())[i]['version']}.xml"
# Create session to store cookie and pass credentials to capabilities url
session = earthaccess.get_requests_https_session()
s = session.get(capability_url)
response = session.get(s.url)
response.raise_for_status() # Raise bad request to check that Earthdata Login credentials were accepted
#This function provides a list of all available services
fn.print_service_options(list(data_dict.values())[i], response) # not sure if the syntax is correct for a loop
Services available for ATL10 : Bounding box subsetting Shapefile subsetting Temporal subsetting Variable subsetting Reformatting to the following options: ['TABULAR_ASCII', 'NetCDF4-CF'] Services available for ATL07 : Bounding box subsetting Shapefile subsetting Temporal subsetting Variable subsetting Reformatting to the following options: ['TABULAR_ASCII', 'NetCDF4-CF'] Services available for MOD29 : Bounding box subsetting Variable subsetting Reformatting to the following options: ['GeoTIFF'] Reprojection to the following options: ['GEOGRAPHIC', 'UNIVERSAL TRANSVERSE MERCATOR', 'POLAR STEREOGRAPHIC', 'SINUSOIDAL']
We already added our CMR search keywords to our data dictionary, so now we need to add the service options we want to request. A list of all available service keywords for use with NSIDC's access and service API are available in our Key-Value-Pair table, as a part of our Programmatic access guide. For our ATL10 and ATL07 request, we are interested in bounding box, temporal subsetting. For MOD29 we are interested in spatial subsetting. These options crop the data values to the specified ranges and variables of interest. We will enter those values into our data dictionary below.
Spatial subsetting: Output files are cropped to the specified bounding box extent.
Temporal subsetting: Output files are cropped to the specified temporal range extent.
# Spatial and temporal subsetting for ATL10
data_dict['sea_ice_fb']['bbox'] = bounding_box
data_dict['sea_ice_fb']['time'] = '2019-03-23T00:00:00,2019-03-23T23:59:59'
# Spatial and temporal subsetting for ATL07
data_dict['sea_ice_height']['bbox'] = bounding_box
data_dict['sea_ice_height']['time'] = '2019-03-23T00:00:00,2019-03-23T23:59:59'
# Spatial subsetting and polar stereographic reprojection for MOD29
data_dict['ist']['bbox'] = bounding_box
For ATL07, we are also interested in variable subsetting.
Variable subsetting: Subsets the data set variable or group of variables. For hierarchical data, all lower level variables are returned if a variable group or subgroup is specified.
For ATL07, we will use only strong beams since these groups contain higher coverage and resolution due to higher surface returns. According to the user guide, the spacecraft was in the backwards orientation during our day of interest, setting the gt*l
beams as the strong beams.
We'll use these primary geolocation, height and quality variables of interest for each of the three strong beams. The following descriptions are provided in the ATL07 Data Dictionary, with additional information on the algorithm and variable descriptions in the ATBD (Algorithm Theoretical Basis Document).
delta_time
: Number of GPS seconds since the ATLAS SDP epoch.
latitude
: Latitude, WGS84, North=+, Lat of segment center
longitude
: Longitude, WGS84, East=+,Lon of segment center
height_segment_height
: Mean height from along-track segment fit determined by the sea ice algorithm
height_segment_confidence
: Confidence level in the surface height estimate based on the number of photons; the background noise rate; and the error
analysis
height_segment_quality
: Height segment quality flag, 1 is good quality, 0 is bad
height_segment_surface_error_est
: Error estimate of the surface height (reported in meters)
height_segment_length_seg
: along-track length of segment containing n_photons_actual
We will enter these variables into the data dictionary below
data_dict['sea_ice_height']['coverage'] = '/gt1l/sea_ice_segments/delta_time,\
/gt1l/sea_ice_segments/latitude,\
/gt1l/sea_ice_segments/longitude,\
/gt1l/sea_ice_segments/heights/height_segment_confidence,\
/gt1l/sea_ice_segments/heights/height_segment_height,\
/gt1l/sea_ice_segments/heights/height_segment_quality,\
/gt1l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt1l/sea_ice_segments/heights/height_segment_length_seg,\
/gt2l/sea_ice_segments/delta_time,\
/gt2l/sea_ice_segments/latitude,\
/gt2l/sea_ice_segments/longitude,\
/gt2l/sea_ice_segments/heights/height_segment_confidence,\
/gt2l/sea_ice_segments/heights/height_segment_height,\
/gt2l/sea_ice_segments/heights/height_segment_quality,\
/gt2l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt2l/sea_ice_segments/heights/height_segment_length_seg,\
/gt3l/sea_ice_segments/delta_time,\
/gt3l/sea_ice_segments/latitude,\
/gt3l/sea_ice_segments/longitude,\
/gt3l/sea_ice_segments/heights/height_segment_confidence,\
/gt3l/sea_ice_segments/heights/height_segment_height,\
/gt3l/sea_ice_segments/heights/height_segment_quality,\
/gt3l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt3l/sea_ice_segments/heights/height_segment_length_seg'
The data request can be accessed asynchronously or synchronously. The asynchronous option will allow concurrent requests to be queued and processed as orders. Those requested orders will be delivered to the specified email address, or they can be accessed programmatically as shown below. Synchronous requests will automatically download the data as soon as processing is complete. For this tutorial, we will be selecting the asynchronous method.
#Set NSIDC data access base URL
base_url = 'https://n5eil02u.ecs.nsidc.org/egi/request'
for k, v in data_dict.items():
#Set the request mode to asynchronous
data_dict[k]['request_mode'] = 'async'
#Set the page size to the maximum for asynchronous requests
page_size = 2000
data_dict[k]['page_size'] = page_size
#Determine number of orders needed for requests over 2000 granules.
page_num = math.ceil(data_dict[k]['gran_num']/page_size)
data_dict[k]['page_num'] = page_num
del data_dict[k]['gran_num']
Programmatic API requests are formatted as HTTPS URLs that contain key-value-pairs specifying the service operations that we specified above. We will first create a string of key-value-pairs from our data dictionary and we'll feed those into our API endpoint. This API endpoint can be executed via command line, a web browser, or in Python below.
endpoint_list = []
for k, v in data_dict.items():
param_string = '&'.join("{!s}={!r}".format(k,v) for (k,v) in v.items()) # Convert data_dict to string
param_string = param_string.replace("'","") # Remove quotes
#Print API base URL + request parameters
API_request = api_request = f'{base_url}?{param_string}'
endpoint_list.append(API_request)
if data_dict[k]['page_num'] > 1:
for i in range(data_dict[k]['page_num']):
page_val = i + 2
data_dict[k]['page_num'] = page_val
API_request = api_request = f'{base_url}?{param_string}'
endpoint_list.append(API_request)
print("\n".join("\n"+s for s in endpoint_list))
https://n5eil02u.ecs.nsidc.org/egi/request?short_name=ATL10&version=005&bounding_box=140,72,153,80&temporal=2019-03-23T00:00:00Z,2019-03-23T23:59:59Z&page_size=2000&page_num=1&bbox=140,72,153,80&time=2019-03-23T00:00:00,2019-03-23T23:59:59&request_mode=async https://n5eil02u.ecs.nsidc.org/egi/request?short_name=ATL07&version=005&bounding_box=140,72,153,80&temporal=2019-03-23T00:00:00Z,2019-03-23T23:59:59Z&page_size=2000&page_num=1&bbox=140,72,153,80&time=2019-03-23T00:00:00,2019-03-23T23:59:59&coverage=/gt1l/sea_ice_segments/delta_time,/gt1l/sea_ice_segments/latitude,/gt1l/sea_ice_segments/longitude,/gt1l/sea_ice_segments/heights/height_segment_confidence,/gt1l/sea_ice_segments/heights/height_segment_height,/gt1l/sea_ice_segments/heights/height_segment_quality,/gt1l/sea_ice_segments/heights/height_segment_surface_error_est,/gt1l/sea_ice_segments/heights/height_segment_length_seg,/gt2l/sea_ice_segments/delta_time,/gt2l/sea_ice_segments/latitude,/gt2l/sea_ice_segments/longitude,/gt2l/sea_ice_segments/heights/height_segment_confidence,/gt2l/sea_ice_segments/heights/height_segment_height,/gt2l/sea_ice_segments/heights/height_segment_quality,/gt2l/sea_ice_segments/heights/height_segment_surface_error_est,/gt2l/sea_ice_segments/heights/height_segment_length_seg,/gt3l/sea_ice_segments/delta_time,/gt3l/sea_ice_segments/latitude,/gt3l/sea_ice_segments/longitude,/gt3l/sea_ice_segments/heights/height_segment_confidence,/gt3l/sea_ice_segments/heights/height_segment_height,/gt3l/sea_ice_segments/heights/height_segment_quality,/gt3l/sea_ice_segments/heights/height_segment_surface_error_est,/gt3l/sea_ice_segments/heights/height_segment_length_seg&request_mode=async https://n5eil02u.ecs.nsidc.org/egi/request?short_name=MOD29&version=61&bounding_box=140,72,153,80&temporal=2019-03-23T00:00:00Z,2019-03-23T23:59:59Z&page_size=2000&page_num=1&bbox=140,72,153,80&request_mode=async
We will now download data using the request_data
function, which utilizes the Python requests library. Our param_dict and HTTP session will be passed to the function to allow Earthdata Login access. The data will be downloaded directly to this notebook directory in a new Outputs folder. The progress of the order will be reported. The data are returned in separate files, so we'll use the clean_folder
function to remove those individual folders.
#Need to remove the page_num parameter from data_dict
for k, v in data_dict.items():
del data_dict[k]['page_num']
#Now use the request_data function to request the data
for i in range(len(data_dict)):
# Create session to store cookie and pass credentials to capabilities url
session = earthaccess.get_requests_https_session()
s = session.get(capability_url)
response = session.get(s.url)
response.raise_for_status() # Raise bad request to check that Earthdata Login credentials were accepted
fn.request_data(list(data_dict.values())[i],session)
#clean up output folder
fn.clean_folder()
Request HTTP response: 201 Order request URL: https://n5eil02u.ecs.nsidc.org/egi/request?short_name=ATL10&version=005&bounding_box=140%2C72%2C153%2C80&temporal=2019-03-23T00%3A00%3A00Z%2C2019-03-23T23%3A59%3A59Z&page_size=2000&bbox=140%2C72%2C153%2C80&time=2019-03-23T00%3A00%3A00%2C2019-03-23T23%3A59%3A59&request_mode=async order ID: 5000004021435 status URL: https://n5eil02u.ecs.nsidc.org/egi/request/5000004021435 HTTP response from order response URL: 201 Initial request status is processing Status is not complete. Trying again. Retry request status is: complete Zip download URL: https://n5eil02u.ecs.nsidc.org/esir/5000004021435.zip Beginning download of zipped output... Data request is complete. Request HTTP response: 201 Order request URL: https://n5eil02u.ecs.nsidc.org/egi/request?short_name=ATL07&version=005&bounding_box=140%2C72%2C153%2C80&temporal=2019-03-23T00%3A00%3A00Z%2C2019-03-23T23%3A59%3A59Z&page_size=2000&bbox=140%2C72%2C153%2C80&time=2019-03-23T00%3A00%3A00%2C2019-03-23T23%3A59%3A59&coverage=%2Fgt1l%2Fsea_ice_segments%2Fdelta_time%2C%2Fgt1l%2Fsea_ice_segments%2Flatitude%2C%2Fgt1l%2Fsea_ice_segments%2Flongitude%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_confidence%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_height%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_quality%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_surface_error_est%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_length_seg%2C%2Fgt2l%2Fsea_ice_segments%2Fdelta_time%2C%2Fgt2l%2Fsea_ice_segments%2Flatitude%2C%2Fgt2l%2Fsea_ice_segments%2Flongitude%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_confidence%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_height%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_quality%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_surface_error_est%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_length_seg%2C%2Fgt3l%2Fsea_ice_segments%2Fdelta_time%2C%2Fgt3l%2Fsea_ice_segments%2Flatitude%2C%2Fgt3l%2Fsea_ice_segments%2Flongitude%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_confidence%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_height%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_quality%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_surface_error_est%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_length_seg&request_mode=async order ID: 5000004021436 status URL: https://n5eil02u.ecs.nsidc.org/egi/request/5000004021436 HTTP response from order response URL: 201 Initial request status is processing Status is not complete. Trying again. Retry request status is: complete Zip download URL: https://n5eil02u.ecs.nsidc.org/esir/5000004021436.zip Beginning download of zipped output... Data request is complete. Request HTTP response: 201 Order request URL: https://n5eil02u.ecs.nsidc.org/egi/request?short_name=MOD29&version=61&bounding_box=140%2C72%2C153%2C80&temporal=2019-03-23T00%3A00%3A00Z%2C2019-03-23T23%3A59%3A59Z&page_size=2000&bbox=140%2C72%2C153%2C80&request_mode=async order ID: 5000004021437 status URL: https://n5eil02u.ecs.nsidc.org/egi/request/5000004021437 HTTP response from order response URL: 201 Initial request status is processing Status is not complete. Trying again. Retry request status is: complete Zip download URL: https://n5eil02u.ecs.nsidc.org/esir/5000004021437.zip Beginning download of zipped output... Data request is complete.
To review, we have explored data availability and volume over a region and time of interest, discovered and selected data customization options, constructed API endpoints for our requests, and downloaded data. Let's move on to the analysis portion of the tutorial.