Notebook

Discover, Customize and Access NSIDC DAAC Data¶

This notebook is based off of the NSIDC-Data-Access-Notebook provided through NSIDC's Github organization.

Now that we've visualized our study areas, we will first explore data coverage, size, and customization (subsetting, reformatting, reprojection) service availability, and then access those associated files.

___A note on data access options:___ We will be pursuing data discovery and access "programmatically" using Application Programming Interfaces, or APIs.

What is an API? You can think of an API as a middle man between an application or end-use (in this case, us) and a data provider. In this case the data provider is both the Common Metadata Repository (CMR) housing data information, and NSIDC as the data distributor. These APIs are generally structured as a URL with a base plus individual key-value-pairs separated by ‘&’.

There are other discovery and access methods available from NSIDC listed on the data set landing page (e.g. ATL07 Data Access) and NASA Earthdata Search. Programmatic API access is beneficial for those of you who want to incorporate data access into your visualization and analysis workflow. This method is also reproducible and documented to ensure data provenance.

Here are the steps you will learn in this customize and access notebook:

Search for data programmatically using the Common Metadata Repository API by time and area of interest.
Determine subsetting, reformatting, and reprojection capabilities for our data of interest.
Access and customize data using NSIDC's data access and service API.

Import packages¶

In [2]:

import requests
import json
import math
import earthaccess 

# This is our functions module. We created several functions used in this notebook and the Visualize and Analyze notebook.
import tutorial_helper_functions as fn 

Explore data availability using the Common Metadata Repository¶

The Common Metadata Repository (CMR) is a high-performance, high-quality, continuously evolving metadata system that catalogs Earth Science data and associated service metadata records. These metadata records are registered, modified, discovered, and accessed through programmatic interfaces leveraging standard protocols and APIs. Note that not all NSIDC data can be searched at the file level using CMR, particularly those outside of the NASA DAAC program.

CMR API documentation: https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html

Select data sets and determine version number¶

Data sets are selected by data set IDs (e.g. ATL07, ATL10, and MOD29). In the CMR API documentation, a data set id is referred to as a "short name". These short names are located at the top of each NSIDC data set landing page in white underneath the full title, after 'DATA SET:'.

We are using the Python Requests package to access the CMR. Data are then converted to JSON format; a language independant human-readable open-standard file format. More than one version can exist for a given data set:

In [3]:

# Create dictionary of data set parameters we'll use in our access API command below. We'll start with data set IDs (e.g. ATL07) of interest here, also known as "short name".
data_dict = {
    'sea_ice_fb' : {'short_name' : 'ATL10'},
    'sea_ice_height' : {'short_name' : 'ATL07'},
    'ist' : {'short_name' : 'MOD29'},
}

In [4]:

# Get json response from CMR collection metadata to grab version numbers and add the most recent version number to data_dict

for i in range(len(data_dict)):
    cmr_collections_url = 'https://cmr.earthdata.nasa.gov/search/collections.json'
    response = requests.get(cmr_collections_url, params=list(data_dict.values())[i])
    results = json.loads(response.content) 

    # Find all instances of 'version_id' in metadata and print most recent version number
    versions = [el['version_id'] for el in results['feed']['entry']]
    versions = [i for i in versions if not any(c.isalpha() for c in i)]
    data_dict[list(data_dict.keys())[i]]['version'] = max(versions)

Select time and area of interest¶

We will add spatial and temporal filters to the data dictionary. The bounding box coordinates cover our region of interest over the East Siberian Sea and the temporal range covers March 23, 2019.

In [5]:

# Bounding Box spatial parameter in 'W,S,E,N' decimal degrees format
bounding_box = '140,72,153,80'

# Each date in yyyy-MM-ddTHH:mm:ssZ format; date range in start,end format
temporal = '2019-03-23T00:00:00Z,2019-03-23T23:59:59Z' 

#add bounding_box and temporal to each data set in the dictionary
for k, v in data_dict.items(): 
    data_dict[k]['bounding_box'] = bounding_box
    data_dict[k]['temporal'] = temporal

Determine how many files exist over this time and area of interest, as well as the average size and total volume of those granules¶

We will use the granule_info function to query the CMR granule API. The function prints the number of granules, average size, and total volume of those granules. It returns the granule number value.

In [6]:

for i in range(len(data_dict)):
    gran_num = fn.granule_info(list(data_dict.values())[i])
    list(data_dict.values())[i]['gran_num'] = gran_num

There are 2 granules of ATL10 version 005 over my area and time of interest.
The average size of each granule is 168.34 MB and the total size of all 2 granules is 336.69 MB
There are 4 granules of ATL07 version 005 over my area and time of interest.
The average size of each granule is 320.07 MB and the total size of all 4 granules is 1280.29 MB
There are 13 granules of MOD29 version 61 over my area and time of interest.
The average size of each granule is 2.80 MB and the total size of all 13 granules is 36.40 MB

Note that subsetting, reformatting, or reprojecting can alter the size of the granules if those services are applied to your request.

Determine the subsetting, reformatting, and reprojection services enabled for your data set of interest.¶

The NSIDC DAAC supports customization (subsetting, reformatting, reprojection) services on many of our NASA Earthdata mission collections. Let's discover whether or not our data set has these services available using the print_service_options function. If services are available, we will also determine the specific service options supported for this data set, which we will then add to our data dictionary.

An Earthdata Login account is required to query data services and to access data from the NSIDC DAAC. If you do not already have an Earthdata Login account, visit http://urs.earthdata.nasa.gov to register. We are going to use the earthaccess library to authenticate with our Earthdata Login credentials. We recommend using a netrc file to store your Earthdata Login credentials for authentication. Below are instructions for creating a netrc file. If we don't want to use a netrc file, skip these instructions and instead we will be prompted to enter our credentials below.

Creating a netrc file¶

In your home directory create a netrc file using a text editor. On a Mac or Linux Operating System the file should be named ".netrc" and if on a Windows Operating System it should be named "_netrc".
In the file add the following content:

machine urs.earthdata.nasa.gov login <USERNAME> password <PASSWORD>

where <USERNAME> and <PASSWORD> are replaced by your Earthdata Login username and password.

In [7]:

auth = earthaccess.login()

EDL_USERNAME and EDL_PASSWORD are not set in the current environment, try setting them or use a different strategy (netrc, interactive)
You're now authenticated with NASA Earthdata Login
Using token with expiration date: 04/16/2023
Using .netrc file for EDL

We now need to create an HTTP session in order to store cookies and pass our credentials to the data service URLs. The capability URL below is what we will query to determine service information.

In [8]:

import warnings

warnings.filterwarnings('ignore')
# Query service capability URL 
for i in range(len(data_dict)):
    capability_url = f"https://n5eil02u.ecs.nsidc.org/egi/capabilities/{list(data_dict.values())[i]['short_name']}.{list(data_dict.values())[i]['version']}.xml" 
    
    # Create session to store cookie and pass credentials to capabilities url
    session = earthaccess.get_requests_https_session()
    s = session.get(capability_url)
    response = session.get(s.url)
    response.raise_for_status() # Raise bad request to check that Earthdata Login credentials were accepted 
    
    #This function provides a list of all available services
    fn.print_service_options(list(data_dict.values())[i], response) # not sure if the syntax is correct for a loop

Services available for ATL10 :

Bounding box subsetting
Shapefile subsetting
Temporal subsetting
Variable subsetting
Reformatting to the following options: ['TABULAR_ASCII', 'NetCDF4-CF']
Services available for ATL07 :

Bounding box subsetting
Shapefile subsetting
Temporal subsetting
Variable subsetting
Reformatting to the following options: ['TABULAR_ASCII', 'NetCDF4-CF']
Services available for MOD29 :

Bounding box subsetting
Variable subsetting
Reformatting to the following options: ['GeoTIFF']
Reprojection to the following options: ['GEOGRAPHIC', 'UNIVERSAL TRANSVERSE MERCATOR', 'POLAR STEREOGRAPHIC', 'SINUSOIDAL']

Populate data dictionary with services of interest¶

We already added our CMR search keywords to our data dictionary, so now we need to add the service options we want to request. A list of all available service keywords for use with NSIDC's access and service API are available in our Key-Value-Pair table, as a part of our Programmatic access guide. For our ATL10 and ATL07 request, we are interested in bounding box, temporal subsetting. For MOD29 we are interested in spatial subsetting. These options crop the data values to the specified ranges and variables of interest. We will enter those values into our data dictionary below.

Spatial subsetting: Output files are cropped to the specified bounding box extent.

Temporal subsetting: Output files are cropped to the specified temporal range extent.

In [9]:

# Spatial and temporal subsetting for ATL10

data_dict['sea_ice_fb']['bbox'] = bounding_box
data_dict['sea_ice_fb']['time'] = '2019-03-23T00:00:00,2019-03-23T23:59:59'

# Spatial and temporal subsetting for ATL07

data_dict['sea_ice_height']['bbox'] = bounding_box
data_dict['sea_ice_height']['time'] = '2019-03-23T00:00:00,2019-03-23T23:59:59'

# Spatial subsetting and polar stereographic reprojection for MOD29

data_dict['ist']['bbox'] = bounding_box

For ATL07, we are also interested in variable subsetting.

Variable subsetting: Subsets the data set variable or group of variables. For hierarchical data, all lower level variables are returned if a variable group or subgroup is specified.

For ATL07, we will use only strong beams since these groups contain higher coverage and resolution due to higher surface returns. According to the user guide, the spacecraft was in the backwards orientation during our day of interest, setting the gt*l beams as the strong beams.

We'll use these primary geolocation, height and quality variables of interest for each of the three strong beams. The following descriptions are provided in the ATL07 Data Dictionary, with additional information on the algorithm and variable descriptions in the ATBD (Algorithm Theoretical Basis Document).

delta_time: Number of GPS seconds since the ATLAS SDP epoch.

latitude: Latitude, WGS84, North=+, Lat of segment center

longitude: Longitude, WGS84, East=+,Lon of segment center

height_segment_height: Mean height from along-track segment fit determined by the sea ice algorithm

height_segment_confidence: Confidence level in the surface height estimate based on the number of photons; the background noise rate; and the error analysis

height_segment_quality: Height segment quality flag, 1 is good quality, 0 is bad

height_segment_surface_error_est: Error estimate of the surface height (reported in meters)

height_segment_length_seg: along-track length of segment containing n_photons_actual

We will enter these variables into the data dictionary below

In [10]:

data_dict['sea_ice_height']['coverage'] = '/gt1l/sea_ice_segments/delta_time,\
/gt1l/sea_ice_segments/latitude,\
/gt1l/sea_ice_segments/longitude,\
/gt1l/sea_ice_segments/heights/height_segment_confidence,\
/gt1l/sea_ice_segments/heights/height_segment_height,\
/gt1l/sea_ice_segments/heights/height_segment_quality,\
/gt1l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt1l/sea_ice_segments/heights/height_segment_length_seg,\
/gt2l/sea_ice_segments/delta_time,\
/gt2l/sea_ice_segments/latitude,\
/gt2l/sea_ice_segments/longitude,\
/gt2l/sea_ice_segments/heights/height_segment_confidence,\
/gt2l/sea_ice_segments/heights/height_segment_height,\
/gt2l/sea_ice_segments/heights/height_segment_quality,\
/gt2l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt2l/sea_ice_segments/heights/height_segment_length_seg,\
/gt3l/sea_ice_segments/delta_time,\
/gt3l/sea_ice_segments/latitude,\
/gt3l/sea_ice_segments/longitude,\
/gt3l/sea_ice_segments/heights/height_segment_confidence,\
/gt3l/sea_ice_segments/heights/height_segment_height,\
/gt3l/sea_ice_segments/heights/height_segment_quality,\
/gt3l/sea_ice_segments/heights/height_segment_surface_error_est,\
/gt3l/sea_ice_segments/heights/height_segment_length_seg'

Select data access configurations¶

The data request can be accessed asynchronously or synchronously. The asynchronous option will allow concurrent requests to be queued and processed as orders. Those requested orders will be delivered to the specified email address, or they can be accessed programmatically as shown below. Synchronous requests will automatically download the data as soon as processing is complete. For this tutorial, we will be selecting the asynchronous method.

In [11]:

#Set NSIDC data access base URL
base_url = 'https://n5eil02u.ecs.nsidc.org/egi/request'

for k, v in data_dict.items():
    
    #Set the request mode to asynchronous
    data_dict[k]['request_mode'] = 'async'

    #Set the page size to the maximum for asynchronous requests 
    page_size = 2000
    data_dict[k]['page_size'] = page_size

    #Determine number of orders needed for requests over 2000 granules. 
    page_num = math.ceil(data_dict[k]['gran_num']/page_size)
    data_dict[k]['page_num'] = page_num
    del data_dict[k]['gran_num']

Create the data request API endpoint¶

Programmatic API requests are formatted as HTTPS URLs that contain key-value-pairs specifying the service operations that we specified above. We will first create a string of key-value-pairs from our data dictionary and we'll feed those into our API endpoint. This API endpoint can be executed via command line, a web browser, or in Python below.

In [12]:

endpoint_list = [] 
for k, v in data_dict.items():
    param_string = '&'.join("{!s}={!r}".format(k,v) for (k,v) in v.items()) # Convert data_dict to string
    param_string = param_string.replace("'","") # Remove quotes
    
    #Print API base URL + request parameters
    API_request = api_request = f'{base_url}?{param_string}'
    endpoint_list.append(API_request)
    if data_dict[k]['page_num'] > 1:
        for i in range(data_dict[k]['page_num']):
            page_val = i + 2
            data_dict[k]['page_num'] = page_val
            API_request = api_request = f'{base_url}?{param_string}'
            endpoint_list.append(API_request)

print("\n".join("\n"+s for s in endpoint_list))

https://n5eil02u.ecs.nsidc.org/egi/request?short_name=ATL10&version=005&bounding_box=140,72,153,80&temporal=2019-03-23T00:00:00Z,2019-03-23T23:59:59Z&page_size=2000&page_num=1&bbox=140,72,153,80&time=2019-03-23T00:00:00,2019-03-23T23:59:59&request_mode=async

https://n5eil02u.ecs.nsidc.org/egi/request?short_name=ATL07&version=005&bounding_box=140,72,153,80&temporal=2019-03-23T00:00:00Z,2019-03-23T23:59:59Z&page_size=2000&page_num=1&bbox=140,72,153,80&time=2019-03-23T00:00:00,2019-03-23T23:59:59&coverage=/gt1l/sea_ice_segments/delta_time,/gt1l/sea_ice_segments/latitude,/gt1l/sea_ice_segments/longitude,/gt1l/sea_ice_segments/heights/height_segment_confidence,/gt1l/sea_ice_segments/heights/height_segment_height,/gt1l/sea_ice_segments/heights/height_segment_quality,/gt1l/sea_ice_segments/heights/height_segment_surface_error_est,/gt1l/sea_ice_segments/heights/height_segment_length_seg,/gt2l/sea_ice_segments/delta_time,/gt2l/sea_ice_segments/latitude,/gt2l/sea_ice_segments/longitude,/gt2l/sea_ice_segments/heights/height_segment_confidence,/gt2l/sea_ice_segments/heights/height_segment_height,/gt2l/sea_ice_segments/heights/height_segment_quality,/gt2l/sea_ice_segments/heights/height_segment_surface_error_est,/gt2l/sea_ice_segments/heights/height_segment_length_seg,/gt3l/sea_ice_segments/delta_time,/gt3l/sea_ice_segments/latitude,/gt3l/sea_ice_segments/longitude,/gt3l/sea_ice_segments/heights/height_segment_confidence,/gt3l/sea_ice_segments/heights/height_segment_height,/gt3l/sea_ice_segments/heights/height_segment_quality,/gt3l/sea_ice_segments/heights/height_segment_surface_error_est,/gt3l/sea_ice_segments/heights/height_segment_length_seg&request_mode=async

https://n5eil02u.ecs.nsidc.org/egi/request?short_name=MOD29&version=61&bounding_box=140,72,153,80&temporal=2019-03-23T00:00:00Z,2019-03-23T23:59:59Z&page_size=2000&page_num=1&bbox=140,72,153,80&request_mode=async

Request data and clean up Output folder¶

We will now download data using the request_data function, which utilizes the Python requests library. Our param_dict and HTTP session will be passed to the function to allow Earthdata Login access. The data will be downloaded directly to this notebook directory in a new Outputs folder. The progress of the order will be reported. The data are returned in separate files, so we'll use the clean_folder function to remove those individual folders.

In [13]:

#Need to remove the page_num parameter from data_dict
for k, v in data_dict.items():
    del data_dict[k]['page_num']
    
#Now use the request_data function to request the data

for i in range(len(data_dict)):
    # Create session to store cookie and pass credentials to capabilities url
    session = earthaccess.get_requests_https_session() 
    s = session.get(capability_url)
    response = session.get(s.url)
    response.raise_for_status() # Raise bad request to check that Earthdata Login credentials were accepted
    
    fn.request_data(list(data_dict.values())[i],session)
    
#clean up output folder
fn.clean_folder()

Request HTTP response:  201

Order request URL:  https://n5eil02u.ecs.nsidc.org/egi/request?short_name=ATL10&version=005&bounding_box=140%2C72%2C153%2C80&temporal=2019-03-23T00%3A00%3A00Z%2C2019-03-23T23%3A59%3A59Z&page_size=2000&bbox=140%2C72%2C153%2C80&time=2019-03-23T00%3A00%3A00%2C2019-03-23T23%3A59%3A59&request_mode=async

order ID:  5000004021435
status URL:  https://n5eil02u.ecs.nsidc.org/egi/request/5000004021435
HTTP response from order response URL:  201

Initial request status is  processing

Status is not complete. Trying again.
Retry request status is:  complete
Zip download URL:  https://n5eil02u.ecs.nsidc.org/esir/5000004021435.zip
Beginning download of zipped output...
Data request is complete.
Request HTTP response:  201

Order request URL:  https://n5eil02u.ecs.nsidc.org/egi/request?short_name=ATL07&version=005&bounding_box=140%2C72%2C153%2C80&temporal=2019-03-23T00%3A00%3A00Z%2C2019-03-23T23%3A59%3A59Z&page_size=2000&bbox=140%2C72%2C153%2C80&time=2019-03-23T00%3A00%3A00%2C2019-03-23T23%3A59%3A59&coverage=%2Fgt1l%2Fsea_ice_segments%2Fdelta_time%2C%2Fgt1l%2Fsea_ice_segments%2Flatitude%2C%2Fgt1l%2Fsea_ice_segments%2Flongitude%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_confidence%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_height%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_quality%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_surface_error_est%2C%2Fgt1l%2Fsea_ice_segments%2Fheights%2Fheight_segment_length_seg%2C%2Fgt2l%2Fsea_ice_segments%2Fdelta_time%2C%2Fgt2l%2Fsea_ice_segments%2Flatitude%2C%2Fgt2l%2Fsea_ice_segments%2Flongitude%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_confidence%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_height%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_quality%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_surface_error_est%2C%2Fgt2l%2Fsea_ice_segments%2Fheights%2Fheight_segment_length_seg%2C%2Fgt3l%2Fsea_ice_segments%2Fdelta_time%2C%2Fgt3l%2Fsea_ice_segments%2Flatitude%2C%2Fgt3l%2Fsea_ice_segments%2Flongitude%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_confidence%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_height%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_quality%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_surface_error_est%2C%2Fgt3l%2Fsea_ice_segments%2Fheights%2Fheight_segment_length_seg&request_mode=async

order ID:  5000004021436
status URL:  https://n5eil02u.ecs.nsidc.org/egi/request/5000004021436
HTTP response from order response URL:  201

Initial request status is  processing

Status is not complete. Trying again.
Retry request status is:  complete
Zip download URL:  https://n5eil02u.ecs.nsidc.org/esir/5000004021436.zip
Beginning download of zipped output...
Data request is complete.
Request HTTP response:  201

Order request URL:  https://n5eil02u.ecs.nsidc.org/egi/request?short_name=MOD29&version=61&bounding_box=140%2C72%2C153%2C80&temporal=2019-03-23T00%3A00%3A00Z%2C2019-03-23T23%3A59%3A59Z&page_size=2000&bbox=140%2C72%2C153%2C80&request_mode=async

order ID:  5000004021437
status URL:  https://n5eil02u.ecs.nsidc.org/egi/request/5000004021437
HTTP response from order response URL:  201

Initial request status is  processing

Status is not complete. Trying again.
Retry request status is:  complete
Zip download URL:  https://n5eil02u.ecs.nsidc.org/esir/5000004021437.zip
Beginning download of zipped output...
Data request is complete.

To review, we have explored data availability and volume over a region and time of interest, discovered and selected data customization options, constructed API endpoints for our requests, and downloaded data. Let's move on to the analysis portion of the tutorial.