The Radiant MLHub API gives access to open Earth imagery training data for machine learning applications. You can learn more about the repository at the Radiant MLHub site and about the organization behind it at the Radiant Earth Foundation site.
Full documentation for the API is available at docs.mlhub.earth.
Each item in our collection is explained in json format compliant with STAC label extension definition.
This notebook utilizes the radiant-mlhub
Python client for interacting with the API. This notebook also utilizes the pandas
library. If you are running this notebooks using Binder, then these dependencies have already been installed. If you are running this notebook locally, you will need to install these yourself.
See the official radiant-mlhub
docs for more documentation of the full functionality of that library.
# Required libraries
from radiant_mlhub import Collection
import tarfile
import os
import json
from pathlib import Path
import pandas as pd
The cells in this notebook will show you how to download all of the datasets for this competition and read the STAC metadata into a pandas dataframe. There will be two dataframes, one for train and one for test, which contain all of the information you will need to filter based off datetime, satellite platform, and asset type. Contained in each row of the dataframe is also the file path for that asset being described. Assets which have a None
value for the datetime
and satellite_platform
columns are assets which are related to the label item.
You must replace the YOUR_API_KEY_HERE
text with your API key which you can obtain by creating a free account on the MLHub Dashboard within the API Keys
tab at the top of the page.
os.environ['MLHUB_API_KEY'] = 'YOUR_API_KEY_HERE'
collections = [
'ref_south_africa_crops_competition_v1_train_labels',
'ref_south_africa_crops_competition_v1_train_source_s1', # Comment this out if you do not wish to download the Sentinel-1 Data
'ref_south_africa_crops_competition_v1_train_source_s2',
'ref_south_africa_crops_competition_v1_test_labels',
'ref_south_africa_crops_competition_v1_test_source_s1', # Comment this out if you do not wish to download the Sentinel-1 Data
'ref_south_africa_crops_competition_v1_test_source_s2'
]
def download(collection_id):
print(f'Downloading {collection_id}...')
collection = Collection.fetch(collection_id)
path = collection.download('.')
tar = tarfile.open(path, "r:gz")
tar.extractall()
tar.close()
os.remove(path)
def resolve_path(base, path):
return Path(os.path.join(base, path)).resolve()
def load_df(collection_id):
collection = json.load(open(f'{collection_id}/collection.json', 'r'))
rows = []
item_links = []
for link in collection['links']:
if link['rel'] != 'item':
continue
item_links.append(link['href'])
for item_link in item_links:
item_path = f'{collection_id}/{item_link}'
current_path = os.path.dirname(item_path)
item = json.load(open(item_path, 'r'))
tile_id = item['id'].split('_')[-1]
for asset_key, asset in item['assets'].items():
rows.append([
tile_id,
None,
None,
asset_key,
str(resolve_path(current_path, asset['href']))
])
for link in item['links']:
if link['rel'] != 'source':
continue
link_path = resolve_path(current_path, link['href'])
source_path = os.path.dirname(link_path)
try:
source_item = json.load(open(link_path, 'r'))
except FileNotFoundError:
continue
datetime = source_item['properties']['datetime']
satellite_platform = source_item['collection'].split('_')[-1]
for asset_key, asset in source_item['assets'].items():
rows.append([
tile_id,
datetime,
satellite_platform,
asset_key,
str(resolve_path(source_path, asset['href']))
])
return pd.DataFrame(rows, columns=['tile_id', 'datetime', 'satellite_platform', 'asset', 'file_path'])
for c in collections:
download(c)
train_df = load_df('ref_south_africa_crops_competition_v1_train_labels')
test_df = load_df('ref_south_africa_crops_competition_v1_test_labels')
This cell will select rows in the test dataframe which are the field_id rasters for the labels.
test_df.loc[test_df['asset'] == 'field_ids']
This cell will select only assets which are related to the Sentinel-1 Source Imagery.
test_df.loc[test_df['satellite_platform'] == 's1']
This cell will select only assets which fall between the specified datetime range.
test_df.loc[(test_df['datetime'] >= '2017-04-01T00:00:00+0000') & (test_df['datetime'] < '2017-05-01T00:00:00+0000')]