Radiant MLHub Logo

How to use the Radiant MLHub API to browse and download the NASA Tropical Storm Wind Speed Competition Data

This Jupyter notebook, which you may copy and adapt for any use, shows basic examples of how to use the API to download labels and source imagery for the NASA Tropical Storm Wind Speed Competition dataset. Full documentation for the API is available at docs.mlhub.earth.

We'll show you how to set up your authorization,retrieve the items (the data contained within them) from those collections, and load the data into a dataframe.

Each item in our collection is explained in json format compliant with STAC label extension definition.

Citation

M. Maskey, R. Ramachandran, I. Gurung, B. Freitag, M. Ramasubramanian, J. Miller (2020) "Tropical Cyclone Wind Estimation Competition Dataset", Version 1.0, Radiant MLHub. [Date Accessed] https://doi.org/10.34911/rdnt.xs53up

Dependencies

This notebook utilizes the radiant-mlhub Python client for interacting with the API and the pandas library for compiling the data. If you are running this notebooks using Binder, then these dependency has already been installed. If you are running this notebook locally, you will need to install this yourself.

See the official radiant-mlhub docs for more documentation of the full functionality of that library.

Authentication

Create an API Key

Access to the Radiant MLHub API requires an API key. To get your API key, go to dashboard.mlhub.earth. If you have not used Radiant MLHub before, you will need to sign up and create a new account. Otherwise, sign in. In the API Keys tab, you'll be able to create API key(s), which you will need. Do not share your API key with others: your usage may be limited and sharing your API key is a security risk.

Configure the Client

Once you have your API key, you need to configure the radiant_mlhub library to use that key. There are a number of ways to configure this (see the Authentication docs for details).

For these examples, we will set the MLHUB_API_KEY environment variable. Run the cell below to save your API key as an environment variable that the client library will recognize.

If you are running this notebook locally and have configured a profile as described in the Authentication docs, then you do not need to execute this cell.

In [ ]:
import os

os.environ['MLHUB_API_KEY'] = 'PASTE_YOUR_API_KEY_HERE'
In [ ]:
import json
import re
from glob import glob
import tarfile

import numpy as np
import pandas as pd
from pathlib import Path
from radiant_mlhub import Dataset, Collection, client

Explore the Collections

A Radiant MLHub Dataset is a group of related Collections. We can use the Dataset.list method to get a list of the available datasets as Python objects and inspect their id and title attributes.

In [ ]:
for dataset in Dataset.list():
    print(f'{dataset.id}: ({dataset.title})')

We're interested in the "Tropical Cyclone Wind Estimation Competition" dataset. We can fetch this dataset using its ID (nasa_tropical_storm_competition) and then use the collections property to list the source imagery and label collections associated with this dataset.

In [ ]:
dataset = Dataset.fetch('nasa_tropical_storm_competition')

print('Source Imagery Collections\n--------------------------')
for collection in dataset.collections.source_imagery:
    print(collection.id)

print('')

print('Label Collections\n-----------------')
for collection in dataset.collections.labels:
    print(collection.id)

We can see that this dataset has 2 collections containing source imagery for this dataset and 1 collection containing labels.

The following cell gets the first item from each collection and prints the item ID, as well as a summary of the assets associated with the item.

In [ ]:
def print_summary(item, collection):
    print(f'Collection: {collection.id}')
    print(f'Item: {item["id"]}')
    print('Assets:')
    for asset_name, asset in item.get('assets', {}).items():
        print(f'- {asset_name}: {asset["title"]} [{asset["type"]}]')
    
    print('\n')

for collection in dataset.collections:
    item = next(client.list_collection_items(collection.id, limit=1))
    print_summary(item, collection)

Items in the *train_labels collection have a "labels" JSON asset containing wind speed labels for each source image. Items in the *test_source and *train_source collections have both a "features" JSON asset containing image features as JSON and an "image" JPEG asset.

Download Assets

In the following section, we download all JSON assets for both the test and train collections. ML Hub makes archives available that contain all the assets for a given collection. We will download these archives for the nasa_tropical_storm_competition_train_labels and nasa_tropical_storm_competition_test_source collections and then extract the items that we need.

In [ ]:
# Use this to download to a data folder the current working directory
# download_dir = Path('./data').resolve()

# # Use this to download the the typical Mac user Downloads folder
download_dir = Path('~/Downloads').expanduser().resolve()

# # Use this to download to the typical Linux /tmp directory
# download_dir = Path('/tmp')
In [ ]:
# NOTE: Extracting the archives takes a while so this cell may take 5-10 minutes to complete
archive_paths = dataset.download(output_dir=download_dir)
for archive_path in archive_paths:
    print(f'Extracting {archive_path}...')
    with tarfile.open(archive_path) as tfile:
        tfile.extractall(path=download_dir)
print('Done')

Loading Data into a Dataframe

The cells below will load both the training and test items into dataframes, join the two, and sort the rows by the Image ID.

In [ ]:
train_data = []

train_source = 'nasa_tropical_storm_competition_train_source'
train_labels = 'nasa_tropical_storm_competition_train_labels'

jpg_names = glob(str(download_dir / train_source / '**' / '*.jpg'))

for jpg_path in jpg_names:
    jpg_path = Path(jpg_path)
    
    # Get the IDs and file paths
    features_path = jpg_path.parent / 'features.json'
    image_id = '_'.join(jpg_path.parent.stem.rsplit('_', 3)[-2:])
    storm_id = image_id.split('_')[0]
    labels_path = str(jpg_path.parent / 'labels.json').replace(train_source, train_labels)


    # Load the features data
    with open(features_path) as src:
        features_data = json.load(src)
        
    # Load the labels data
    with open(labels_path) as src:
        labels_data = json.load(src)

    train_data.append([
        image_id, 
        storm_id, 
        int(features_data['relative_time']), 
        int(features_data['ocean']), 
        int(labels_data['wind_speed'])
    ])

train_df = pd.DataFrame(
    np.array(train_data),
    columns=['Image ID', 'Storm ID', 'Relative Time', 'Ocean', 'Wind Speed']
).sort_values(by=['Image ID']).reset_index(drop=True)

train_df.head()
In [ ]:
test_data = []

test_source = 'nasa_tropical_storm_competition_test_source'

jpg_names = glob(str(download_dir / test_source / '**' / '*.jpg'))

for jpg_path in jpg_names:
    jpg_path = Path(jpg_path)

    # Get the IDs and file paths
    features_path = jpg_path.parent / 'features.json'
    image_id = '_'.join(jpg_path.parent.stem.rsplit('_', 3)[-2:])
    storm_id = image_id.split('_')[0]

    # Load the features data
    with open(features_path) as src:
        features_data = json.load(src)

    test_data.append([
        image_id, 
        storm_id, 
        int(features_data['relative_time']), 
        int(features_data['ocean']), 
    ])

test_df = pd.DataFrame(
    np.array(test_data),
    columns=['Image ID', 'Storm ID', 'Relative Time', 'Ocean']
).sort_values(by=['Image ID']).reset_index(drop=True)

test_df.head()
In [ ]: