Radiant MLHub Logo

Radiant Earth Spot the Crop Challenge

A Guide to Access the data on Radiant MLHub

This notebook walks you through the steps to get access to Radiant MLHub and access the data for the Radiant Earth Spot the Crop Challenge.

Radiant MLHub API

The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the Radiant MLHub site and about the organization behind it at the Radiant Earth Foundation site.

Full documentation for the API is available at docs.mlhub.earth.

Each item in our collection is explained in json format compliant with STAC label extension definition.

Dependencies

This notebook utilizes the radiant-mlhub Python client for interacting with the API. This notebook also utilizes the pandas library. If you are running this notebooks using Binder, then these dependencies have already been installed. If you are running this notebook locally, you will need to install these yourself.

See the official radiant-mlhub docs for more documentation of the full functionality of that library.

In [1]:
# Required libraries
from radiant_mlhub import Collection
import tarfile
import os
import json
from pathlib import Path
import pandas as pd

Downloading Datasets and Loading Asset File Paths into a Pandas Dataframe

The cells in this notebook will show you how to download all of the datasets for this competition and read the STAC metadata into a pandas dataframe. There will be two dataframes, one for train and one for test, which contain all of the information you will need to filter based off datetime, satellite platform, and asset type. Contained in each row of the dataframe is also the file path for that asset being described. Assets which have a None value for the datetime and satellite_platform columns are assets which are related to the label item.

You must replace the YOUR_API_KEY_HERE text with your API key which you can obtain by creating a free account on the MLHub Dashboard within the API Keys tab at the top of the page.

In [5]:
os.environ['MLHUB_API_KEY'] = 'YOUR_API_KEY_HERE'

collections = [
    'ref_south_africa_crops_competition_v1_train_labels',
    'ref_south_africa_crops_competition_v1_train_source_s1', # Comment this out if you do not wish to download the Sentinel-1 Data
    'ref_south_africa_crops_competition_v1_train_source_s2',
    'ref_south_africa_crops_competition_v1_test_labels',
    'ref_south_africa_crops_competition_v1_test_source_s1', # Comment this out if you do not wish to download the Sentinel-1 Data
    'ref_south_africa_crops_competition_v1_test_source_s2'
]

def download(collection_id):
    print(f'Downloading {collection_id}...')
    collection = Collection.fetch(collection_id)
    path = collection.download('.')
    tar = tarfile.open(path, "r:gz")
    tar.extractall()
    tar.close()
    os.remove(path)
    
def resolve_path(base, path):
    return Path(os.path.join(base, path)).resolve()
    
def load_df(collection_id):
    collection = json.load(open(f'{collection_id}/collection.json', 'r'))
    rows = []
    item_links = []
    for link in collection['links']:
        if link['rel'] != 'item':
            continue
        item_links.append(link['href'])
        
    for item_link in item_links:
        item_path = f'{collection_id}/{item_link}'
        current_path = os.path.dirname(item_path)
        item = json.load(open(item_path, 'r'))
        tile_id = item['id'].split('_')[-1]
        for asset_key, asset in item['assets'].items():
            rows.append([
                tile_id,
                None,
                None,
                asset_key,
                str(resolve_path(current_path, asset['href']))
            ])
            
        for link in item['links']:
            if link['rel'] != 'source':
                continue
            link_path = resolve_path(current_path, link['href'])
            source_path = os.path.dirname(link_path)
            try:
                source_item = json.load(open(link_path, 'r'))
            except FileNotFoundError:
                continue
            datetime = source_item['properties']['datetime']
            satellite_platform = source_item['collection'].split('_')[-1]
            for asset_key, asset in source_item['assets'].items():
                rows.append([
                    tile_id,
                    datetime,
                    satellite_platform,
                    asset_key,
                    str(resolve_path(source_path, asset['href']))
                ])
    return pd.DataFrame(rows, columns=['tile_id', 'datetime', 'satellite_platform', 'asset', 'file_path'])

for c in collections:
    download(c)

train_df = load_df('ref_south_africa_crops_competition_v1_train_labels')
test_df = load_df('ref_south_africa_crops_competition_v1_test_labels')

Filter on Asset Types

This cell will select rows in the test dataframe which are the field_id rasters for the labels.

In [6]:
test_df.loc[test_df['asset'] == 'field_ids']
Out[6]:
tile_id datetime satellite_platform asset file_path
1 0590 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...
87 1026 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...
173 0100 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...
219 0332 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...
305 0756 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...
... ... ... ... ... ...
68277 0376 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...
68363 1062 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...
68409 0382 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...
68455 0349 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...
68541 0947 None None field_ids /Users/kevinbooth/Projects/notebooks/Projects/...

1137 rows × 5 columns

Filter on Satellite Platform

This cell will select only assets which are related to the Sentinel-1 Source Imagery.

In [7]:
test_df.loc[test_df['satellite_platform'] == 's1']
Out[7]:
tile_id datetime satellite_platform asset file_path
4 0590 2017-04-01T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
5 0590 2017-04-01T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...
6 0590 2017-04-06T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
7 0590 2017-04-06T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...
8 0590 2017-04-13T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
... ... ... ... ... ...
68621 0947 2017-11-15T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...
68622 0947 2017-11-20T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
68623 0947 2017-11-20T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...
68624 0947 2017-11-27T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
68625 0947 2017-11-27T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...

64078 rows × 5 columns

Filter on Datetime

This cell will select only assets which fall between the specified datetime range.

In [8]:
test_df.loc[(test_df['datetime'] >= '2017-04-01T00:00:00+0000') & (test_df['datetime'] < '2017-05-01T00:00:00+0000')]
Out[8]:
tile_id datetime satellite_platform asset file_path
4 0590 2017-04-01T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
5 0590 2017-04-01T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...
6 0590 2017-04-06T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
7 0590 2017-04-06T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...
8 0590 2017-04-13T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
... ... ... ... ... ...
68551 0947 2017-04-18T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...
68552 0947 2017-04-25T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
68553 0947 2017-04-25T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...
68554 0947 2017-04-30T00:00:00+0000 s1 VH /Users/kevinbooth/Projects/notebooks/Projects/...
68555 0947 2017-04-30T00:00:00+0000 s1 VV /Users/kevinbooth/Projects/notebooks/Projects/...

9270 rows × 5 columns