Notebook

Tutorial: Data Submission to ESS-DIVE SANDBOX Using API¶

This notebook includes steps to submit data to ESS-DIVE through the API. This is a tutorial notebook and contains extra guidance and information about the API and metadata creation process.

The ESS-DIVE Dataset API is a service that enables projects to programmatically submit and manage datasets with ESS-DIVE. This is an alternative to using the ESS-DIVE Online form for data uploads. This service encodes metadata using the JSON-LD specification. JSON-LD is a schema to encode linked Data using JSON, and in the future will be used by Google to index metadata for searches. The use of the standardized JSON-LD schema will dramatically increase the visibility of datasets, and also enable projects to create one-time code that can be reused for periodic uploads of datasets to ESS-DIVE.

⭐ Contact ess-dive-support@lbl.gov to submit more than 10GB per upload attempt. Additional permissions are required.
⭐ Current Maximum Upload Limit: 500 GB per upload attempt
Please contact ess-dive-support@lbl.gov to submit more than 500GB of data at once.

Use Sandbox https://api-sandbox.ess-dive.lbl.gov when testing code to submit datasets to ESS-DIVE. All code examples use sandbox.
Once you have tested your code and you're ready to create new datasets for publication, use our production domain https://api.ess-dive.lbl.gov/.

For additional information about the API, review the documentation at https://api-sandbox.ess-dive.lbl.gov.
Email ESS-DIVE at ess-dive-support@lbl.gov if you require assistance

1. Get Authentication Token¶

Go to https://data-sandbox.ess-dive.lbl.gov
Sign in with Orcid
Click your Name in the right hand corner and select My Profile
Now Click the Settings > Authentication Token
Scroll down and click Copy on the “Token” tab to get your authentication token

⭐️ If you are not already registered to submit data with ESS-DIVE, follow the steps on the Register to Submit Data page: https://docs.ess-dive.lbl.gov/contributing-data/new-contributor-registration.

2. Setup¶

In [1]:

pip install requests

Requirement already satisfied: requests in /Users/emilyarobles/anaconda3/lib/python3.9/site-packages (2.28.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/emilyarobles/anaconda3/lib/python3.9/site-packages (from requests) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in /Users/emilyarobles/anaconda3/lib/python3.9/site-packages (from requests) (2021.10.8)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/emilyarobles/anaconda3/lib/python3.9/site-packages (from requests) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /Users/emilyarobles/anaconda3/lib/python3.9/site-packages (from requests) (3.3)
Note: you may need to restart the kernel to use updated packages.

Enter your Authentication Token below by running the cell, then pasting your token into the text box. Do not re-run the cell after entering your token, unless you are updating your token. See step 1 for instructions to access your token through ESS-DIVE.

Reminder: Tokens expire every 24 hours. Always repeat this setup step when you update your token.

In [7]:

import requests
import os
import json
from ipywidgets import widgets, interact
import re

token_text = widgets.Text("", description="Token:")
display(token_text)

Text(value='', description='Token:')

In [60]:

token = token_text.value
base = "https://api-sandbox.ess-dive.lbl.gov/"
header_authorization =  "bearer {}".format(token)
endpoint = "packages"

if token == '':
  print('Please enter your token after running the cell above.')
else:
  print('You have successfully entered your token. Please remember to update every 24hrs.')

You have successfully entered your token. Please remember to update every 24hrs.

3. Create Metadata¶

The following lines of code validate JSON-LD metadata for a single dataset. The original code for creating a JSON object can be found at https://docs.ess-dive.lbl.gov/programmatic-tools/ess-dive-dataset-api/python-example#create-metadata. For the purpose of this tutorial, we have organized elements into smaller groups.

For each metadata field we have provided general guidance on what to include. For more information about our requirements for metadata quality and to review checks that are performed during the publication review process, please see our Dataset Requirements documentation: https://docs.ess-dive.lbl.gov/contributing-data/package-level-metadata

Project Information - provider¶

First, let's add your information as provider. You will need your project name and personal information, the same as it appears in your ESS-DIVE profile:

ORCID
First and last name
Email
Job title (eg. Principal Investigator)

Revise the example information directly in the cell bellow. After running the cell, check that the output information matches your project information correctly.

In [61]:

provider_details = {
   "name": "SPRUCE",
   "member": {
     "@id": "http://orcid.org/0000-0001-7293-3561",
     "givenName": "Paul",
     "familyName": "Hanson",
     "email": "hansonpj@ornl.gov",
     "jobTitle": "Principal Investigator"
   }
 }

#DO NOT EDIT BELOW
print('Please review your information and revise if needed.')
print(f"Project name: {provider_details['name']}\nSubmitter ORCID: {provider_details['member']['@id']}\nFirst name: {provider_details['member']['givenName']}\nLast name: {provider_details['member']['familyName']}\nEmail: {provider_details['member']['email']}\nJob title: {provider_details['member']['jobTitle']}")

Please review your information and revise if needed.
Project name: SPRUCE
Submitter ORCID: http://orcid.org/0000-0001-7293-3561
First name: Paul
Last name: Hanson
Email: hansonpj@ornl.gov
Job title: Principal Investigator

Coming soon: Use project identifier instead of manually entering project metadata.

In [62]:

#provider_spruce = {
#            "identifier": {
#            "@type": "PropertyValue",
#                "propertyID": "ess-dive",
#                "value": "1e6d50d3-9532-43fb-a63f-bdcb4350bf0c"
#   }
# }

Dataset Author(s) - creator¶

Revise the examples in the cell below to include dataset authors in the order that you would like them to appear in the citation. Please add the ORCID for all authors, especially the first author, if possible. You can add or delete authors as needed.

**After running the cell, check that the output author listing is correct.**

In [63]:

creators =  [
   {
     "@id": "http://orcid.org/0000-0001-7293-3561",
     "givenName": "Paul J",
     "familyName": "Hanson",
     "affiliation": "Oak Ridge National Laboratory",
     "email": "hansonpj@ornl.gov"
   },
   {
     "givenName": "Jeffrey",
     "familyName": "Riggs",
     "affiliation": "Oak Ridge National Laboratory"
   },
   {
     "givenName": "C",
     "familyName": "Nettles",
     "affiliation": "Oak Ridge National Laboratory"
   },
   {
     "givenName": "William",
     "familyName": "Dorrance",
     "affiliation": "Oak Ridge National Laboratory"
   },
   {
     "givenName": "Les",
     "familyName": "Hook",
     "affiliation": "Oak Ridge National Laboratory"
   }
 ]

# DO NOT EDIT BELOW
author_list_preview = []
for i in creators:
  names = i['familyName'] + ' ' + i['givenName'][0]
  author_list_preview.append(names)


print(f'This is how your author list will appear in citation. Please make revisions if needed. \n{author_list_preview}')

This is how your author list will appear in citation. Please make revisions if needed. 
['Hanson P', 'Riggs J', 'Nettles C', 'Dorrance W', 'Hook L']

Dataset Contact - editor¶

List the person who should be contacted by users seeking further information for the data. Only one contact is allowed. Including the ORCID of this individual is strongly encouraged.

In [64]:

contact = {
   "@id": "http://orcid.org/0000-0001-7293-3561",
   "givenName": "Paul J",
   "familyName": "Hanson",
   "email": "hansonpj@ornl.gov"
 }

Dataset Title - name¶

Replace the example below with a title 7-20 words in length.

In [6]:

dataset_title = "SPRUCE S1 Bog Environmental Monitoring Data: 2010-2016"

# DO NOT EDIT
print(dataset_title)

SPRUCE S1 Bog Environmental Monitoring Data: 2010-2016

Dataset Variables and Keywords - variableMeasured, keywords¶

Enter all variables that you would like to include in your dataset, separated by commas. You can use your own variables, or choose standard names from the Global Change Master Directory (GCMD) Keywords list.

In [4]:

variables = 'ENTER, VARIABLES, HERE'

# DO NOT EDIT
variables_list = variables.split(',')
print(variables_list)

Repeat for keywords; enter all keywords that you would like to include in your dataset, separated by commas. You can use your own keywords, or choose standard names from the Global Change Master Directory (GCMD) Keywords list.

In [3]:

keywords = 'ENTER, KEYWORDS, HERE'

# DO NOT EDIT
keywords_list = keywords.split(',')
print(keywords_list)

Dataset Abstract - description¶

Enter an abstract for your dataset below. The abstract should be at least 100 words in length and include all details necessary to understand the purpose of your dataset and the research question it addresses, in addition to information required for data use and reproducibility.

In [55]:

abstract = ['This data set reports selected ambient environmental monitoring data from the S1 bog in Minnesota for the period June 2010 through December 2016. Measurements of the environmental conditions at these stations will serve as a pre-treatment baseline for experimental treatments and provide driver data for future modeling activities. The site is the S1 bog, a Picea mariana [black spruce] – Sphagnum spp. bog forest in northern Minnesota, 40 km north of Grand Rapids, in the USDA Forest Service Marcell Experimental Forest (MEF). There are/were three monitoring sites located in the bog: Stations 1 and 2 are co-located at the southern end of the bog and Station 3 is located north central and adjacent to an existing U.S. Forest Service monitoring well. There are eight data files with selected results of ambient environmental monitoring in the S1 bog for the period June 2010 through December 2016. One file has the complete set of measurements and the other seven have the available data for a given calendar year. Not all measurements started in June 2010 and EM3 measurements ended in May 2014. Further details about the data package are in the attached pdf file (SPRUCE_EM_DATA_2010_2016_20170620).']

# DO NOT EDIT
print(f'Abstract: {abstract[0]}')
for i in abstract:
    res = len(re.findall(r'\w+', i))
    print (f"\nDataset abstract is {str(res)} words in length.")

Abstract: This data set reports selected ambient environmental monitoring data from the S1 bog in Minnesota for the period June 2010 through December 2016. Measurements of the environmental conditions at these stations will serve as a pre-treatment baseline for experimental treatments and provide driver data for future modeling activities. The site is the S1 bog, a Picea mariana [black spruce] – Sphagnum spp. bog forest in northern Minnesota, 40 km north of Grand Rapids, in the USDA Forest Service Marcell Experimental Forest (MEF). There are/were three monitoring sites located in the bog: Stations 1 and 2 are co-located at the southern end of the bog and Station 3 is located north central and adjacent to an existing U.S. Forest Service monitoring well. There are eight data files with selected results of ambient environmental monitoring in the S1 bog for the period June 2010 through December 2016. One file has the complete set of measurements and the other seven have the available data for a given calendar year. Not all measurements started in June 2010 and EM3 measurements ended in May 2014. Further details about the data package are in the attached pdf file (SPRUCE_EM_DATA_2010_2016_20170620).

Dataset abstract is 196 words in length.

Geographic Coverage Information - spatialCoverage¶

Revise the example in the below cell to add the description and coordinates of data collection sites. You can create additional sites if needed.

In [57]:

geographic_info = [
   {
     "description": "Site ID: S1 Bog Site name: S1 Bog, Marcell Experimental Forest Description: The site is the 8.1-ha S1 bog, a Picea mariana [black spruce] - Sphagnum spp. ombrotrophic bog forest in northern Minnesota, 40 km north of Grand Rapids, in the USDA Forest Service Marcell Experimental Forest (MEF). The S1 bog was harvested in successive strip cuts in 1969 and 1974 and the cut areas were allowed to naturally regenerate. Stations 1 and 2 are located in a 1974 strip that is characterized by a medium density of 3-5 meter black spruce and larch trees with an open canopy. The area was suitable for siting a monitoring station for representative meteorological conditions on the S1 bog. Station 3 is located in a 1969 harvest strip that is characterized by a higher density of 3-5 meter black spruce and larch trees with a generally closed canopy. Measurements at this station represent conditions in the surrounding stand. Site Photographs are in the attached document",
     "geo": [
       {
         "name": "Northwest",
         "latitude": 47.50285,
         "longitude": -93.48283
       },
       {
         "name": "Southeast",
         "latitude": 47.50285,
         "longitude": -93.48283
       }
     ]
   },
    # SITE TWO
    {
     "description": "Description of sectond site",
     "geo": [
       {
         "name": "Northwest",
         "latitude": 47.50285,
         "longitude": -93.48283
       },
       {
         "name": "Southeast",
         "latitude": 47.50285,
         "longitude": -93.48283
       }
     ]
   }
 ]

Start and End Dates - temporalCoverage¶

Enter start and end dates for your data in YYYY-MM-DD format.

In [73]:

start_end_dates = {
   "startDate": "2010-07-16",
   "endDate": "2016-12-31"
 }

Methods - measurementTechnique¶

Revise the example text below

In [58]:

methods = ['The stations are equipped with standard sensors for measuring meteorological parameters, solar radiation, soil temperature and moisture, and groundwater temperature and elevation. Note that some sensor locations are relative to nearby vegetation and bog microtopographic features (i.e., hollows and hummocks). See Table 1 in the attached pdf (SPRUCE_EM_DATA_2010_2016_20170620) for a list of measurements and further details. Sensors and data loggers were initially installed and became operational in June, July, and August of 2010. Additional sensors were added in September 2011. Station 3 was removed from service on May 12, 2014. These data are considered at Quality Level 1. Level 1 indicates an internally consistent data product that has been subjected to quality checks and data management procedures. Established calibration procedures were followed.']

# DO NOT EDIT
print(f'Methods: {methods[0]}') 

Methods: The stations are equipped with standard sensors for measuring meteorological parameters, solar radiation, soil temperature and moisture, and groundwater temperature and elevation. Note that some sensor locations are relative to nearby vegetation and bog microtopographic features (i.e., hollows and hummocks). See Table 1 in the attached pdf (SPRUCE_EM_DATA_2010_2016_20170620) for a list of measurements and further details. Sensors and data loggers were initially installed and became operational in June, July, and August of 2010. Additional sensors were added in September 2011. Station 3 was removed from service on May 12, 2014. These data are considered at Quality Level 1. Level 1 indicates an internally consistent data product that has been subjected to quality checks and data management procedures. Established calibration procedures were followed.

Funding Information - funder¶

Revise the example text below to include information about project funding. All datasets submitted to ESS-DIVE should include DOE BER in their list of funders (see below), except in special cases.

In [70]:

funding = {
   "@id": "http://dx.doi.org/10.13039/100006206",
   "name": "U.S. DOE > Office of Science > Biological and Environmental Research (BER)"
 }

4. Create the JSON LD¶

The below cell incorporates all of your previous entries and creates your JSON LD object.
There are three additional fields you may wish to revise: "@id," "datePublished," and "license." If your dataset has been previously published and you would like to publish it on ESS-DIVE using the same DOI, enter this DOI in the @id field (as seen in example). datePublished will assume current date, unless your dataset has been previously published, in which case you should enter the year of publication (2015 in example). Additionally, all datasets have default license: http://creativecommons.org/licenses/by/4.0/

In [71]:

json_ld = {
 "@context": "http://schema.org/",
 "@type": "Dataset",
 "@id": "http://dx.doi.org/10.3334/CDIAC/spruce.001",
 "name": dataset_title,
 "description": abstract,
 "creator": creators,
 "datePublished": "2015",
 "keywords": keywords_list,
 "variableMeasured": variables_list,
 "license": "http://creativecommons.org/licenses/by/4.0/",
 "spatialCoverage": geographic_info,
 "funder": funding,
 "temporalCoverage": start_end_dates,
 "editor": contact,
 "provider": provider_details,
 "measurementTechnique": methods
}

5. Submit your dataset¶

There are three options for creating a new dataset:

submit metadata only
submit metadata and a single data file
submit metadata and multiple data files

Metadata Only¶

Use the following cell to submit only your JSON-LD object. This will create a dataset with only metadata and no files.

In [72]:

post_packages_url = "{}{}".format(base,endpoint)
post_package_response = requests.post(post_packages_url,
                                      headers={"Authorization":header_authorization},
                                      json=json_ld)

if post_package_response.status_code == 201:
    # Success
    response=post_package_response.json()
    print(f"View URL:{response['viewUrl']}")
    print(f"Name:{response['dataset']['name']}")
else:
    # There was an error
    print(post_package_response.text)

View URL:https://data-sandbox.ess-dive.lbl.gov/view/ess-dive-9de3d58449bb69a-20230511T050754345911
Name:SPRUCE S1 Bog Environmental Monitoring Data: 2010-2016

Metadata and Single Data File¶

To submit the JSON-LD object along with a data file, use the following cell block. Replace "file_path" with the path to your file.

In [ ]:

files_tuples_array = []
upload_file = "file_path"

files_tuples_array.append((("json-ld", json.dumps(json_ld))))
files_tuples_array.append(("data", open(upload_file ,'rb')))

post_packages_url = "{}{}".format(base,endpoint)
post_package_response = requests.post(post_packages_url,
                                    headers={"Authorization":header_authorization},
                                    files= files_tuples_array)

if post_package_response.status_code == 201:
    # Success
    response=post_package_response.json()
    print(f"View URL:{response['viewUrl']}")
    print(f"Name:{response['dataset']['name']}")
else:
    # There was an error
    print(post_package_response.text)

Metadata and Multiple Data Files¶

If you have many files to be uploaded, you can place them all inside a directory named 'files' and use the following code:

In [ ]:

files_tuples_array = []
files_upload_directory = "/Users/emilyarobles/Desktop/API_TEST_FILES/"
files = os.listdir(files_upload_directory)

files_tuples_array.append((("json-ld", json.dumps(json_ld))))

for filename in files:
   file_directory = files_upload_directory + filename
   files_tuples_array.append((("data", open(file_directory, 'rb'))))

post_packages_url = "{}{}".format(base,endpoint)
post_package_response = requests.post(post_packages_url,
                                    headers={"Authorization":header_authorization},
                                    files= files_tuples_array)

if post_package_response.status_code == 201:
   # Success
   response=post_package_response.json()
   print(f"View URL:{response['viewUrl']}")
   print(f"Name:{response['dataset']['name']}")
else:
   # There was an error
   print(post_package_response.text)

Revise Existing Datasets¶

It is possible to both update the metadata and data of an existing dataset. The following update scenarios are possible

update metadata only
replace/add files only
both update metadata and replace/add files.

These examples will demonstrate both updating metadata and adding new files to the dataset created in previous sections.

Update metadata only¶

Use the PUT function to update the metadata of a dataset. This example updates the title (name) of a dataset. You will need the ESS-DIVE identifier of the dataset that you want to revise.

In [ ]:

dataset_id = input('Enter an ESS-DIVE Identifier here: ')

In [ ]:

put_package_url = "{}{}/{}".format(base,endpoint, dataset_id)

metadata_update_dict = {"name": "Updated Dataset Name"}

put_package_response = requests.put(put_package_url,
                                    headers={"Authorization":header_authorization},
                                    json=metadata_update_dict)

Check the results for the changed metadata attribute

In [ ]:

# Check for errors
if put_package_response.status_code == 200:
   # Success
   response=put_package_response.json()
   print(f"View URL:{response['viewUrl']}")
   print(f"Name:{response['dataset']['name']}")
else:
   # There was an error
   print(put_package_response.text)

Metadata plus a new data file¶

Use the PUT function to update a dataset. This example updates the date published to 2019 of a dataset and adds a new data file.

In [ ]:

dataset_id = input('Enter an ESS-DIVE Identifier here: ')

In [ ]:

files_tuples_array = []
upload_file = "path/to/your_file"
files_tuples_array.append((("json-ld", json.dumps(metadata_update_dict))))
files_tuples_array.append(("data", open(upload_file ,'rb')))

put_package_url = "{}{}/{}".format(base,endpoint, dataset_id)



put_package_response = requests.put(put_package_url,
                                   headers={"Authorization":header_authorization},
                                   files= files_tuples_array)

Check the results for the changed metadata attribute.

In [ ]:

# Check for errors
if put_package_response.status_code == 200:
    # Success
    response=put_package_response.json()
    print(f"View URL:{response['viewUrl']}")
    print(f"Date Published:{response['dataset']['datePublished']}")
    print(f"Files In Dataset:{response['dataset']['distribution']}")
else:
   # There was an error
   print(put_package_response.text)

Check the results for the added metadata attribute.

In [ ]:

get_packages_url = "{}{}".format(base,endpoint)
get_packages_response = requests.get(get_packages_url, 
    headers={"Authorization":header_authorization})

if get_packages_response.status_code == 200:
   #Success
   print(get_packages_response.json())
else:
   # There was an error
   print(get_packages_response.text)