Radiant MLHub Logo

How to use the Radiant MLHub API

The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the Radiant MLHub site and about the organization behind it at the Radiant Earth Foundation site.

This Jupyter notebook, which you may copy and adapt for any use, shows basic examples of how to use the API. Full documentation for the API is available at docs.mlhub.earth.

We'll show you how to set up your authorization, see the list of available collections and datasets, and retrieve the items (the data contained within them) from those collections.

Each item in our collection is explained in json format compliant with STAC label extension definition.

Dependencies

This notebook utilizes the radiant-mlhub Python client for interacting with the API. If you are running this notebooks using Binder, then this dependency has already been installed. If you are running this notebook locally, you will need to install this yourself.

See the official radiant-mlhub docs for more documentation of the full functionality of that library.

Authentication

Create an API Key

Access to the Radiant MLHub API requires an API key. To get your API key, go to dashboard.mlhub.earth. If you have not used Radiant MLHub before, you will need to sign up and create a new account. Otherwise, sign in. In the API Keys tab, you'll be able to create API key(s), which you will need. Do not share your API key with others: your usage may be limited and sharing your API key is a security risk.

Configure the Client

Once you have your API key, you need to configure the radiant_mlhub library to use that key. There are a number of ways to configure this (see the Authentication docs for details).

For these examples, we will set the MLHUB_API_KEY environment variable. Run the cell below to save your API key as an environment variable that the client library will recognize.

If you are running this notebook locally and have configured a profile as described in the Authentication docs, then you do not need to execute this cell.

In [ ]:
import os

os.environ['MLHUB_API_KEY'] = 'PASTE_YOUR_API_KEY_HERE'
In [ ]:
from radiant_mlhub import client, get_session

List data collections

A collection in the Radiant MLHub API is a STAC Collection representing a group of resources (represented as STAC Items and their associated assets) covering a given spatial and temporal extent. A Radiant MLHub collection may contain resources representing training labels, source imagery, or (rarely) both.

Use the client.list_collections function to list all available collections and view their properties. The following cell uses the client.list_collections function to print the ID, license (if available), and citation (if available) for all available collections.

In [ ]:
collections = client.list_collections()
for c in collections:
    collection_id = c['id']
    license = c.get('license', 'N/A')
    citation = c.get('sci:citation', 'N/A')

    print(f'ID:       {collection_id}\nLicense:  {license}\nCitation: {citation}\n')

Collection objects have many other properties besides the ones shown above. The cell below prints the ref_african_crops_kenya_01_labels collection object in its entirety.

In [ ]:
kenya_crops_labels = next(c for c in collections if c['id'] == 'ref_african_crops_kenya_01_labels')
kenya_crops_labels

Select an Item

Collections have items associated with them that are used to catalog assets (labels or source imagery) for that collection. Collections vary greatly in the number of items associated with them; some may contain only a handful of items, while others may contain hundreds of thousands of items.

The following cell uses the client.list_collection_items to get the first item in the ref_african_crops_kenya_01_labels collection. The client.list_collection_items is a Python generator that yields a dictionary for each item in the collection (you can read more about how to use Python generators here).

In [ ]:
# NOTE: Here we are using using the "id" property of the collection that we fetched above as the collection_id 
#  argument to the list_collection_items function
items_iterator = client.list_collection_items(kenya_crops_labels['id'])

# Get the first item
first_item = next(items_iterator)
first_item

IMPORTANT: Some collections may have hundreds of thousands of items (e.g. bigearthnet_v1_source). Looping over all of the items for these massive collections may take a very long time (perhaps on the order of hours), and is not recommended. To prevent accidentally looping over all assets, the client.list_collection_items function limits the total number of returned items to 100 by default. You can change this limit using the limit argument:

client.list_collection_items(collection['id'], limit=150)

If you would like to download all of the assets associated with a collection, it is far more efficient to use the client.download_archive method.

List Available Assets

Each STAC Item has assets associated with it, representing the actual source imagery or labels described by that Item.

The cell below summarizes the assets for the first item that we selected above by printing the key within the assets dictionary, the asset title and the media type.

In [ ]:
for asset_key, asset in first_item['assets'].items():
    title = asset['title']
    media_type = asset['type']
    print(f'{asset_key}: {title} [{media_type}]')

Download Assets

To download these assets, we will first set up a helper function to get the download link from the asset and then download the content to a local file.

NOTE: If you are running these notebooks using Binder these resources will be downloaded to the remote file system that the notebooks are running on and not to your local file system. If you want to download the files to your machine, you will need to clone the repo and run the notebook locally.

In [ ]:
import urllib.parse
from pathlib import Path
import requests


def download(item, asset_key, output_dir='.'):
    # Try to get the given asset and return None if it does not exist
    asset = item.get('assets', {}).get(asset_key)
    if asset is None:
        print(f'Asset "{asset_key}" does not exist in this item')
        return None
    
    # Try to get the download URL from the asset and return None if it does not exist
    download_url = asset.get('href')
    if download_url is None:
        print(f'Asset {asset_key} does not have an "href" property, cannot download.')
        return None
    
    session = get_session()
    r = session.get(download_url, allow_redirects=True, stream=True)
    
    filename = urllib.parse.urlsplit(r.url).path.split('/')[-1]
    output_path = Path(output_dir) / filename

    
    with output_path.open('wb') as dst:
        for chunk in r.iter_content(chunk_size=512 * 1024):
            if chunk:
                dst.write(chunk)
    
    print(f'Downloaded to {output_path.resolve()}')
    

Download Labels

We can download the labels asset of the selected_item by calling the following function:

In [ ]:
download(first_item, 'labels')

Download Metadata

Likewise, we can download the documentation pdf and property description csv.

In [ ]:
download(first_item, 'documentation')
download(first_item, 'property_descriptions')

Download Source Imagery

The Item that we fetched earlier represents a collection of labels. This item also contains references to all of the source imagery used to generate these labels in its links property. Any source imagery links will have a rel type of "source".

In the cell below we get a list of all the sources associated with this item and fetch the first one

In [ ]:
source_links = [link for link in first_item['links'] if link['rel'] == 'source']
print(f'Number of Source Items: {len(source_links)}')

session = get_session()
r = session.get(source_links[0]['href'])
source_item = r.json()
print('First Item\n--------')
source_item

Once we have the source item, we can use our download function to download assets associated with that item.

The cell below downloads just the first 3 bands of the source item that we just fetched (a Sentinel 2 scene).

In [ ]:
download(source_item, 'B01')
download(source_item, 'B02')
download(source_item, 'B03')

Download All Assets

Looping through all items and downloading the associated assets may be very time-consuming for larger datasets like BigEarthNet or LandCoverNet. Instead, MLHub provides TAR archives of all collections that can be downloaded using the /archive/{collection_id} endpoint.

The following cell uses the client.download_archive function to download the ref_african_crops_kenya_01_labels archive to the current working directory.

In [ ]:
client.download_archive('ref_african_crops_kenya_01_labels', output_dir='.')