This Jupyter notebook, which you may copy and adapt for any use, shows basic examples of how to use the API to download labels and source imagery for the NASA Tropical Storm Wind Speed Competition dataset. Full documentation for the API is available at docs.mlhub.earth.
We'll show you how to set up your authorization,retrieve the items (the data contained within them) from those collections, and load the data into a dataframe.
Each item in our collection is explained in json format compliant with STAC label extension definition.
M. Maskey, R. Ramachandran, I. Gurung, B. Freitag, M. Ramasubramanian, J. Miller (2020) "Tropical Cyclone Wind Estimation Competition Dataset", Version 1.0, Radiant MLHub. [Date Accessed] https://doi.org/10.34911/rdnt.xs53up
This notebook utilizes the radiant-mlhub
Python client for interacting with the API and the pandas
library for compiling the data. If you are running this notebooks using Binder, then these dependency has already been installed. If you are running this notebook locally, you will need to install this yourself.
See the official radiant-mlhub
docs for more documentation of the full functionality of that library.
Access to the Radiant MLHub API requires an API key. To get your API key, go to dashboard.mlhub.earth. If you have not used Radiant MLHub before, you will need to sign up and create a new account. Otherwise, sign in. In the API Keys tab, you'll be able to create API key(s), which you will need. Do not share your API key with others: your usage may be limited and sharing your API key is a security risk.
Once you have your API key, you need to configure the radiant_mlhub
library to use that key. There are a number of ways to configure this (see the Authentication docs for details).
For these examples, we will set the MLHUB_API_KEY
environment variable. Run the cell below to save your API key as an environment variable that the client library will recognize.
If you are running this notebook locally and have configured a profile as described in the Authentication docs, then you do not need to execute this cell.
import os
os.environ['MLHUB_API_KEY'] = 'PASTE_YOUR_API_KEY_HERE'
import json
import re
from glob import glob
import tarfile
import numpy as np
import pandas as pd
from pathlib import Path
from radiant_mlhub import Dataset, Collection, client
A Radiant MLHub Dataset is a group of related Collections. We can use the Dataset.list
method to get a list of the available datasets as Python objects and inspect their id
and title
attributes.
for dataset in Dataset.list():
print(f'{dataset.id}: ({dataset.title})')
We're interested in the "Tropical Cyclone Wind Estimation Competition" dataset. We can fetch this dataset using its
ID (nasa_tropical_storm_competition
) and then use the collections
property to list the source imagery and label collections associated with this dataset.
dataset = Dataset.fetch('nasa_tropical_storm_competition')
print('Source Imagery Collections\n--------------------------')
for collection in dataset.collections.source_imagery:
print(collection.id)
print('')
print('Label Collections\n-----------------')
for collection in dataset.collections.labels:
print(collection.id)
We can see that this dataset has 2 collections containing source imagery for this dataset and 1 collection containing labels.
The following cell gets the first item from each collection and prints the item ID, as well as a summary of the assets associated with the item.
def print_summary(item, collection):
print(f'Collection: {collection.id}')
print(f'Item: {item["id"]}')
print('Assets:')
for asset_name, asset in item.get('assets', {}).items():
print(f'- {asset_name}: {asset["title"]} [{asset["type"]}]')
print('\n')
for collection in dataset.collections:
item = next(client.list_collection_items(collection.id, limit=1))
print_summary(item, collection)
Items in the *train_labels
collection have a "labels"
JSON asset containing wind speed labels for each source image. Items in the *test_source
and *train_source
collections have both a "features"
JSON asset containing image features as JSON and an "image"
JPEG asset.
In the following section, we download all JSON assets for both the test
and train
collections. ML Hub makes archives available that contain all the assets for a given collection. We will download these archives for the nasa_tropical_storm_competition_train_labels
and nasa_tropical_storm_competition_test_source
collections and then extract the items that we need.
# Use this to download to a data folder the current working directory
# download_dir = Path('./data').resolve()
# # Use this to download the the typical Mac user Downloads folder
download_dir = Path('~/Downloads').expanduser().resolve()
# # Use this to download to the typical Linux /tmp directory
# download_dir = Path('/tmp')
# NOTE: Extracting the archives takes a while so this cell may take 5-10 minutes to complete
archive_paths = dataset.download(output_dir=download_dir)
for archive_path in archive_paths:
print(f'Extracting {archive_path}...')
with tarfile.open(archive_path) as tfile:
tfile.extractall(path=download_dir)
print('Done')
The cells below will load both the training and test items into dataframes, join the two, and sort the rows by the Image ID.
train_data = []
train_source = 'nasa_tropical_storm_competition_train_source'
train_labels = 'nasa_tropical_storm_competition_train_labels'
jpg_names = glob(str(download_dir / train_source / '**' / '*.jpg'))
for jpg_path in jpg_names:
jpg_path = Path(jpg_path)
# Get the IDs and file paths
features_path = jpg_path.parent / 'features.json'
image_id = '_'.join(jpg_path.parent.stem.rsplit('_', 3)[-2:])
storm_id = image_id.split('_')[0]
labels_path = str(jpg_path.parent / 'labels.json').replace(train_source, train_labels)
# Load the features data
with open(features_path) as src:
features_data = json.load(src)
# Load the labels data
with open(labels_path) as src:
labels_data = json.load(src)
train_data.append([
image_id,
storm_id,
int(features_data['relative_time']),
int(features_data['ocean']),
int(labels_data['wind_speed'])
])
train_df = pd.DataFrame(
np.array(train_data),
columns=['Image ID', 'Storm ID', 'Relative Time', 'Ocean', 'Wind Speed']
).sort_values(by=['Image ID']).reset_index(drop=True)
train_df.head()
test_data = []
test_source = 'nasa_tropical_storm_competition_test_source'
jpg_names = glob(str(download_dir / test_source / '**' / '*.jpg'))
for jpg_path in jpg_names:
jpg_path = Path(jpg_path)
# Get the IDs and file paths
features_path = jpg_path.parent / 'features.json'
image_id = '_'.join(jpg_path.parent.stem.rsplit('_', 3)[-2:])
storm_id = image_id.split('_')[0]
# Load the features data
with open(features_path) as src:
features_data = json.load(src)
test_data.append([
image_id,
storm_id,
int(features_data['relative_time']),
int(features_data['ocean']),
])
test_df = pd.DataFrame(
np.array(test_data),
columns=['Image ID', 'Storm ID', 'Relative Time', 'Ocean']
).sort_values(by=['Image ID']).reset_index(drop=True)
test_df.head()