#!/usr/bin/env python # coding: utf-8 # Radiant MLHub Logo # # ## Using the Radiant MLHub API # # The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the [Radiant MLHub site](https://mlhub.earth) and about the organization behind it at the [Radiant Earth Foundation site](https://radiant.earth). # # This Jupyter notebook, which you may copy and adapt for any use, shows basic examples of how to use the API. Full documentation for the API is available at [docs.mlhub.earth](docs.mlhub.earth). # # We'll show you how to set up your authentication, see the list of available collections and datasets, and retrieve the items (the data contained within them) from those collections. # # All collections in the Radiant MLHub repository are cataloged using [STAC](https://stacspec.org/). Collections that include labels/annotations are additionally described using the [Label Extension](https://github.com/stac-extensions/label). # ### Authentication # # Access to the Radiant MLHub API requires an API key. To get your API key, go to [mlhub.earth](https://mlhub.earth/) and click the "Sign in / Register" button in the top right to log in. If you have not used Radiant MLHub before, you will need to sign up and create a new account; otherwise, just sign in. Once you have signed in, click on your user avatar in the top right and select the "Settings & API keys" from the dropdown menu. # # In the **API Keys** section of this page, you will be able to create new API key(s). *Do not share* your API key with others as this may pose a security risk. # # Next, we will create a `MLHUB_API_KEY` variable that `pystac-client` will use later use to add our API key to all requests: # In[1]: import getpass MLHUB_API_KEY = getpass.getpass(prompt="MLHub API Key: ") MLHUB_ROOT_URL = "https://api.radiant.earth/mlhub/v1" # Finally, we connect to the Radiant MLHub API using our API key: # In[2]: import itertools as it import requests import shutil import tempfile import os.path from pprint import pprint from urllib.parse import urljoin from pystac_client import Client from pystac import ExtensionNotImplemented from pystac.extensions.scientific import ScientificExtension client = Client.open( MLHUB_ROOT_URL, parameters={"key": MLHUB_API_KEY}, ignore_conformance=True ) # ### List datasets # # A **dataset** in the Radiant MLHub API is a JSON object that represents a group of STAC Collections that belong together. A typical datasets will include 1 Collection of source imagery and 1 Collection of labels, but this is not always the case. Some datasets are comprised of a single Collection with both labels and source imagery, others may contain multiple source imagery or label Collections, and others may contain only labels. # # *Datasets are not a STAC entity* and therefore we must work with them by making direct requests to the API rather than using `pystac-client`. # # We start by creating a `requests.Session` instance so that we can include the API key in all of our requests. # In[3]: class MLHubSession(requests.Session): def __init__(self, *args, api_key=None, **kwargs): super().__init__(*args, **kwargs) self.params.update({"key": api_key}) def request(self, method, url, *args, **kwargs): url_prefix = MLHUB_ROOT_URL.rstrip("/") + "/" url = urljoin(url_prefix, url) return super().request(method, url, *args, **kwargs) session = MLHubSession(api_key=MLHUB_API_KEY) # Next, we list the available datasets using the `/datasets` endpoint # In[4]: response = session.get("/datasets") datasets = response.json() dataset_limit = 30 print(f"Total Datasets: {len(datasets)}") print("-----") for dataset in it.islice(datasets, dataset_limit): dataset_id = dataset["id"] dataset_title = dataset["title"] or "No Title" print(f"{dataset_id}: {dataset_title}") if len(datasets) > dataset_limit: print("...") # Let's take a look at the Kenya Crop Type dataset. # In[5]: crop_dataset = next( dataset for dataset in datasets if dataset["id"] == "ref_african_crops_kenya_02" ) pprint(crop_dataset) # We can see that the metadata includes and ID and title, citation information, a bounding box for the dataset, and list of collections included in the dataset. If we take a closer look at the `collections` list, we can see that each collection has an `id` and a `type`. We can use the `type` to figure out whether a collection contains labels, source imagery, or both, and we can use the ID to fetch that dataset (see below). # In[6]: pprint(crop_dataset["collections"]) # ### List data collections # # A **collection** in the Radiant MLHub API is a [STAC Collection](https://github.com/radiantearth/stac-spec/tree/master/collection-spec) representing a group of resources (represented as [STAC Items](https://github.com/radiantearth/stac-spec/tree/master/item-spec) and their associated assets) covering a given spatial and temporal extent. A Radiant MLHub collection may contain resources representing training labels, source imagery, or (rarely) both. # # Use the `client.list_collections` function to list all available collections and view their properties. The following cell uses the `client.list_collections` function to print the ID, license (if available), and citation (if available) for the first 20 available collections. # In[7]: collections = client.get_collections() for c in it.islice(collections, 20): collection_id = c.id license = c.license or "N/A" try: sci = ScientificExtension.ext(c) citation = sci.citation or "N/A" except ExtensionNotImplemented: citation = "N/A" print(f"ID: {collection_id}\nLicense: {license}\nCitation: {citation}\n") # Collection objects have many other properties besides the ones shown above. The cell below prints the `ref_african_crops_kenya_01_labels` collection object in its entirety. # In[8]: kenya_crops_labels = next( c for c in collections if c.id == "ref_african_crops_kenya_01_labels" ) kenya_crops_labels.to_dict() # #### Download Data Archives # # A typical workflow for downloading assets from a STAC Catalog would involve looping through all Items and downloading the associated assets. However, the ML training datasets published through Radiant MLHub can sometimes have thousands or hundreds of thousands of Items, making this workflow be *very* time-consuming for larger datasets. For faster access to the assets for an entire dataset, MLHub provides TAR archives of all collections that can be downloaded using the `/archive/{collection_id}` endpoint. # # We will use the `MLHubSession` instance we created above to ensure that our API key is sent with each request. # In[9]: # Create a temporary directory tmp_dir = tempfile.mkdtemp() archive_path = os.path.join(tmp_dir, "ref_african_crops_kenya_01_labels.tar.gz") # Fetch the archive and save to disk response = session.get( "/archive/ref_african_crops_kenya_01_labels", allow_redirects=True ) with open(archive_path, "wb") as dst: dst.write(response.content) # Finally, we clean up the temporary directory # In[10]: shutil.rmtree(tmp_dir) # ### Next Steps # # This tutorial was a quick introduction to working with the Radiant MLHub API in a notebook. For more, see: # # - [Reading Data from the STAC API](./reading-stac.ipynb) # - [How to use the Radiant MLHub API to browse and download the LandCoverNet dataset](https://github.com/microsoft/PlanetaryComputerExamples/blob/main/tutorials/radiant-mlhub-landcovernet.ipynb)