Notebook

Development of Algorithms for ERDDAP Dataset from ONC CTD Data¶

Figuring out how to transform a day's ONC CTD data from a SoG node into a netCDF file that is part of an ERDDAP dataset:

filter the ONC data to include only qaqcFlag == 1 samples
remove the qaqcFlag arrays as variable attributes
aggregate the data into 15 minute time bins;

mean, variance, and count for each variable in each time bin

store dataset as netCDF file
generate ERDDAP /opt/tomcat/content/erddap/datasets.xml fragment

In [46]:

from collections import OrderedDict
import os

import arrow
from lxml import etree
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

from salishsea_tools import data_tools
from salishsea_tools.places import PLACES

In [2]:

%matplotlib inline

Get Data from ONC `scalardata` Web Service¶

Access to the ONC web services requires a user token which you can generate on the Web Services API tab of your ONC account profile page. I have stored mine in an environment variable so as not to publish it to the world in this notebook.

In [4]:

TOKEN = os.environ['ONC_USER_TOKEN']

Request a day's worth of CTD salinity and temperature data and parse them into an xarray.Dataset:

In [22]:

onc_data = data_tools.get_onc_data(
    'scalardata', 'getByStation', TOKEN,
    station='SCVIP', deviceCategory='CTD',
    sensors='salinity,temperature',
    dateFrom=data_tools.onc_datetime('2015-12-27 00:00', 'utc'),
)

ctd_data = data_tools.onc_json_to_dataset(onc_data)

In [23]:

ctd_data

Out[23]:

<xarray.Dataset>
Dimensions:      (sampleTime: 86398)
Coordinates:
  * sampleTime   (sampleTime) datetime64[ns] 2015-12-27T00:00:00.259000 ...
Data variables:
    salinity     (sampleTime) float64 31.23 31.23 31.23 31.23 31.23 31.23 ...
    temperature  (sampleTime) float64 9.959 9.959 9.959 9.959 9.959 9.959 ...
Attributes:
    dateFrom: 2015-12-27T00:00:00.000Z
    deviceCategory: CTD
    dateTo: None
    rowLimit: None
    outputFormat: None
    sensors: salinity,temperature
    nextDateFrom: 2015-12-28T00:00:00.090Z
    totalActualSamples: 172796
    station: SCVIP

In [24]:

ctd_data.data_vars['salinity']

Out[24]:

<xarray.DataArray 'salinity' (sampleTime: 86398)>
array([ 31.22826354,  31.22836401,  31.22826354, ...,  28.26344878,
        28.26375019,  28.26415208])
Coordinates:
  * sampleTime  (sampleTime) datetime64[ns] 2015-12-27T00:00:00.259000 ...
Attributes:
    unitOfMeasure: g/kg
    qaqcFlag: [1 1 1 ..., 4 4 4]
    sensorName: Reference Salinity
    actualSamples: 86398

Filter Data Based on `qaqcFlag` Values¶

Filter the salinity data to exclude samples for which qaqcFlag != 1:

In [25]:

ctd_data.salinity

Out[25]:

<xarray.DataArray 'salinity' (sampleTime: 86398)>
array([ 31.22826354,  31.22836401,  31.22826354, ...,  28.26344878,
        28.26375019,  28.26415208])
Coordinates:
  * sampleTime  (sampleTime) datetime64[ns] 2015-12-27T00:00:00.259000 ...
Attributes:
    unitOfMeasure: g/kg
    qaqcFlag: [1 1 1 ..., 4 4 4]
    sensorName: Reference Salinity
    actualSamples: 86398

In [26]:

salinity_qaqc_mask = ctd_data.salinity.attrs['qaqcFlag'] == 1
salinity = xr.DataArray(
    name='salinity',
    data=ctd_data.salinity[salinity_qaqc_mask].values,
    coords={'time': ctd_data.salinity.sampleTime[salinity_qaqc_mask].values},
)

In [27]:

salinity

Out[27]:

<xarray.DataArray 'salinity' (time: 54393)>
array([ 31.22826354,  31.22836401,  31.22826354, ...,  31.22977061,
        31.22967014,  31.22977061])
Coordinates:
  * time     (time) datetime64[ns] 2015-12-27T00:00:00.259000 ...

Filter the temperature data to exclude samples for which qaqcFlag != 1:

In [28]:

ctd_data.temperature

Out[28]:

<xarray.DataArray 'temperature' (sampleTime: 86398)>
array([ 9.959 ,  9.959 ,  9.9591, ...,  9.9531,  9.9533,  9.9531])
Coordinates:
  * sampleTime  (sampleTime) datetime64[ns] 2015-12-27T00:00:00.259000 ...
Attributes:
    unitOfMeasure: C
    qaqcFlag: [1 1 1 ..., 1 1 1]
    sensorName: Temperature
    actualSamples: 86398

In [29]:

temperature_qaqc_mask = ctd_data.temperature.attrs['qaqcFlag'] == 1
temperature = xr.DataArray(
    name='temperature',
    data=ctd_data.temperature[temperature_qaqc_mask].values,
    coords={'time': ctd_data.temperature.sampleTime[temperature_qaqc_mask].values},
)

In [30]:

temperature

Out[30]:

<xarray.DataArray 'temperature' (time: 86395)>
array([ 9.959 ,  9.959 ,  9.9591, ...,  9.9531,  9.9533,  9.9531])
Coordinates:
  * time     (time) datetime64[ns] 2015-12-27T00:00:00.259000 ...

Create 15 Minute Resampled Dataset¶

Station-specific metadata for the dataset:

In [31]:

xr_metadata = {
    'SCVIP': {
        'place_name': 'Central node',
        'ONC_station': 'Central',
        'ONC_stationCode': PLACES['Central node']['ONC stationCode'],
        'ONC_stationDescription': 
            'Pacific, Salish Sea, Strait of Georgia, Central, Strait of Georgia VENUS Instrument Platform',
        'ONC_data_product_url': 'http://dmas.uvic.ca/DataSearch?location=SCVIP&deviceCategory=CTD',
    },
    'SEVIP': {
        'place_name': 'East node',
        'ONC_station': 'East',
        'ONC_stationCode': PLACES['East node']['ONC stationCode'],
        'ONC_stationDescription': 
            'Pacific, Salish Sea, Strait of Georgia, East, Strait of Georgia VENUS Instrument Platform',
        'ONC_data_product_url': 'http://dmas.uvic.ca/DataSearch?location=SEVIP&deviceCategory=CTD',
    },
}

Define an aggregation function to count the samples in each resampling interval:

In [32]:

def count(values, axis):
    return values.size

Create a dataset of resampled data and their statistics:

In [33]:

onc_station = 'SCVIP'

ds = xr.Dataset(
    data_vars={
        'salinity': xr.DataArray(
            name='salinity',
            data=salinity.resample('15Min', 'time', how='mean'),
            attrs={
                'ioos_category': 'Salinity',
                'standard_name': 'sea_water_reference_salinity',
                'long_name': 'reference salinity',
                'units': 'g/kg',
                'aggregation_operation': 'mean',
                'aggregation_interval': 15*60,
                'aggregation_interval_units': 'seconds',
            },
        ),
        'salinity_std_dev': xr.DataArray(
            name='salinity_std_dev',
            data=salinity.resample('15Min', 'time', how='std'),
            attrs={
                'ioos_category': 'Salinity',
                'standard_name':b 'sea_water_reference_salinity_standard_deviation',
                'long_name': 'reference salinity standard deviation',
                'units': 'g/kg',
                'aggregation_operation': 'standard deviation',
                'aggregation_interval': 15*60,
                'aggregation_interval_units': 'seconds',
            },
        ),
        'salinity_sample_count': xr.DataArray(
            name='salinity_sample_count',
            data=salinity.resample('15Min', 'time', how=count),
            attrs={
                'standard_name': 'sea_water_reference_salinity_sample_count',
                'long_name': 'reference salinity sample count',
                'aggregation_operation': 'count',
                'aggregation_interval': 15*60,
                'aggregation_interval_units': 'seconds',
            },
        ),
        'temperature': xr.DataArray(
            name='temperature',
            data=temperature.resample('15Min', 'time', how='mean'),
            attrs={
                'ioos_category': 'Temperature',
                'standard_name': 'sea_water_temperature',
                'long_name': 'temperature',
                'units': 'degrees_Celcius',
                'aggregation_operation': 'mean',
                'aggregation_interval': 15*60,
                'aggregation_interval_units': 'seconds',
            },
        ),
        'temperature_std_dev': xr.DataArray(
            name='temperature_std_dev',
            data=temperature.resample('15Min', 'time', how='std'),
            attrs={
                'ioos_category': 'Temperature',
                'standard_name': 'sea_water_temperature_standard_deviation',
                'long_name': 'temperature standard deviation',
                'units': 'degrees_Celcius',
                'aggregation_operation': 'standard deviation',
                'aggregation_interval': 15*60,
                'aggregation_interval_units': 'seconds',
            },
        ),
        'temperature_sample_count': xr.DataArray(
            name='temperature_sample_count',
            data=temperature.resample('15Min', 'time', how=count),
            attrs={
                'standard_name': 'sea_water_temperature_sample_count',
                'long_name': 'temperature sample count',
                'aggregation_operation': 'count',
                'aggregation_interval': 15*60,
                'aggregation_interval_units': 'seconds',
            },
        ),
    },
    coords={
        'depth': PLACES[xr_metadata[onc_station]['place_name']]['depth'],
        'longitude': PLACES[xr_metadata[onc_station]['place_name']]['lon lat'][0],
        'latitude': PLACES[xr_metadata[onc_station]['place_name']]['lon lat'][1],
    },
    attrs={
        'history': """
{0} Download raw data from ONC scalardata API.
{0} Filter to exclude data with qaqcFlag != 1.
{0} Resample data to 15 minute intervals using mean, standard deviation and count as aggregation functions.
{0} Store as netCDF4 file.
        """.format(arrow.now().format('YYYY-MM-DD HH:mm:ss')),
        'ONC_station': xr_metadata[onc_station]['ONC_station'],
        'ONC_stationCode': PLACES[xr_metadata[onc_station]['place_name']]['ONC stationCode'],
        'ONC_stationDescription': xr_metadata[onc_station]['ONC_stationDescription'],
        'ONC_data_product_url': xr_metadata[onc_station]['ONC_data_product_url'],
    },
)

If any of the DataArrays are short compared to the others the missing values are filled with NaNs. That makes sense for temperature and salinity values, and their standard deviations, but not their sample counts. So, we change NaNs to zeros in the sample count DataArrays:

In [49]:

ds.salinity_sample_count.values = np.nan_to_num(ds.salinity_sample_count.values)
ds.temperature_sample_count.values = np.nan_to_num(ds.temperature_sample_count.values)

In [50]:

ds

Out[50]:

<xarray.Dataset>
Dimensions:                   (time: 96)
Coordinates:
  * time                      (time) datetime64[ns] 2015-12-27 ...
    longitude                 float64 -123.4
    depth                     int64 294
    latitude                  float64 49.04
Data variables:
    salinity                  (time) float64 31.23 31.23 31.23 31.23 31.23 ...
    salinity_std_dev          (time) float64 0.0001472 0.0001813 0.0001711 ...
    temperature_sample_count  (time) int64 900 900 900 900 900 900 900 900 ...
    temperature_std_dev       (time) float64 0.0001285 0.0001598 0.000142 ...
    salinity_sample_count     (time) float64 900.0 899.0 900.0 900.0 900.0 ...
    temperature               (time) float64 9.959 9.959 9.959 9.959 9.958 ...
Attributes:
    history: 
2016-09-19 13:07:14 Download raw data from ONC scalardata API.
2016-09-19 13:07:14 Filter to exclude data with qaqcFlag != 1.
2016-09-19 13:07:14 Resample data to 15 minute intervals using mean, standard deviation and count as aggregation functions.
2016-09-19 13:07:14 Store as netCDF4 file.
        
    ONC_stationDescription: Pacific, Salish Sea, Strait of Georgia, Central, Strait of Georgia VENUS Instrument Platform
    ONC_stationCode: SCVIP
    ONC_station: Central
    ONC_data_product_url: http://dmas.uvic.ca/DataSearch?location=SCVIP&deviceCategory=CTD

Examine the Dataset¶

In [51]:

print(ds.salinity_sample_count)
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 6))
salinity.plot(ax=ax1)
ax1.set_title('Raw Data')
ds.salinity.plot(ax=ax2)
ax2.set_title('15 min Averaged')
ds.salinity_std_dev.plot(ax=ax3)
ax3.set_title('15 min Std Dev')

<xarray.DataArray 'salinity_sample_count' (time: 96)>
array([ 900.,  899.,  900.,  900.,  900.,  899.,  899.,  900.,  900.,
        900.,  899.,  897.,  900.,  896.,  899.,  900.,  899.,  899.,
        900.,  900.,  899.,  900.,  899.,  899.,  900.,  899.,  899.,
        899.,  899.,  897.,  900.,  899.,  900.,  900.,  900.,  899.,
        899.,  899.,  899.,  896.,  895.,  899.,  899.,  900.,  898.,
        898.,  900.,  899.,  899.,  900.,  900.,  899.,  900.,  899.,
        900.,  899.,  898.,  898.,  900.,  900.,  446.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,    0.,
          0.,    0.,    0.,    0.,    0.,    0.])
Coordinates:
  * time       (time) datetime64[ns] 2015-12-27 2015-12-27T00:15:00 ...
    longitude  float64 -123.4
    depth      int64 294
    latitude   float64 49.04
Attributes:
    standard_name: sea_water_reference_salinity_sample_count
    aggregation_operation: count
    aggregation_interval: 900
    aggregation_interval_units: seconds
    long_name: reference salinity sample count

Out[51]:

<matplotlib.text.Text at 0x7fab0b9b0e10>

In [52]:

print(ds.temperature_sample_count)
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 6))
temperature.plot(ax=ax1)
ax1.set_title('Raw Data')
ds.temperature.plot(ax=ax2)
ax2.set_title('15 min Averaged')
ds.temperature_std_dev.plot(ax=ax3)
ax3.set_title('15 min Std Dev')

<xarray.DataArray 'temperature_sample_count' (time: 96)>
array([900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900,
       900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900,
       900, 900, 900, 899, 901, 899, 900, 900, 900, 900, 900, 900, 900,
       900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900,
       900, 900, 900, 900, 900, 900, 900, 900, 900, 897, 900, 900, 900,
       900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900,
       900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900, 900,
       900, 900, 899, 900, 900])
Coordinates:
  * time       (time) datetime64[ns] 2015-12-27 2015-12-27T00:15:00 ...
    longitude  float64 -123.4
    depth      int64 294
    latitude   float64 49.04
Attributes:
    standard_name: sea_water_temperature_sample_count
    aggregation_operation: count
    aggregation_interval: 900
    aggregation_interval_units: seconds
    long_name: temperature sample count

Out[52]:

<matplotlib.text.Text at 0x7fab0b8d7be0>

Store the Dataset as a netCDF4 File¶

ERDDAP requires that all files in a dataset have the same units for their time variable. On the other hand, xarray defaults to using the 1st time value in the dataset as the time-base for the units. So, we have to explicitly define the time units as an encoding when the dataset is stored as a netCDF4 file.

In [53]:

ds.to_netcdf(
    '/results/observations/ONC/CTD/{station}/{station}_CTD_15m_20151227.nc'
    .format(station=onc_station),
    encoding={'time': {'units': 'minutes since 1970-01-01 00:00'}})

Generate an ERDDAP Dataset XML Fragment¶

Use the /opt/tomcat/webapps/erddap/WEB-INF/GenerateDatasetsXml.sh script generate the initial version of an XML fragment for a dataset:

$ cd /opt/tomcat/webapps/erddap/WEB-INF/
$ bash GenerateDatasetsXml.sh EDDTableFromNcFiles /results/observations/ONC/CTD/SCVIP/

The EDDTableFromNcFiles and /results/observations/ONC/CTD/SCVIP/ arguments tell the script which EDDType and what parent directory to use, avoiding having to type those in answer to prompts. Answer the remaining prompts, for example:

File name regex (e.g., ".*\.nc") (default="")
? .*SCVIP_CTD_15m_\d{8}\.nc$ 
A sample full file name (default="")
? /results/observations/ONC/CTD/SCVIP/SCVIP_CTD_15m_20160724.nc
DimensionsCSV (or "" for default) (default="")
? 
ReloadEveryNMinutes (e.g., 10080) (default="")
? 10080
PreExtractRegex (default="")
? 
PostExtractRegex (default="")
? 
ExtractRegex (default="")
? 
Column name for extract (default="")
? 
Sorted column source name (default="")
? 
Sort files by sourceName (default="")
? 
infoUrl (default="")
? https://salishsea-meopar-tools.readthedocs.org/en/latest/results_server/
institution (default="")
? UBC EOAS
summary (default="")
? 
title (default="")
? ONC, Strait of Georgia, Central Node, Salinity and Temperature, 15min, v1

The output is written to /results/erddap/logs/GenerateDatasetsXml.out

The metadata dictionary below contains information for dataset attribute tags whose values need to be changed, or that need to be added for all datasets.

The keys are the dataset attribute names.

The values are dicts containing a required text item and perhaps an optional after item.

The value associated with the text key is the text content for the attribute tag.

When present, the value associated with the after key is the name of the dataset attribute after which a new attribute tag containing the text value is to be inserted.

In [36]:

metadata = OrderedDict([
    ('cdm_data_type', {'text': 'TimeSeries'}),
    ('cdm_timeseries_variables', {
        'text': 'depth, longitude, latitude',
        'after': 'cdm_data_type',
    }),
    ('institution_fullname', {
        'text': 'Earth, Ocean & Atmospheric Sciences, University of British Columbia',
        'after': 'institution',
    }),
    ('license', {
        'text': '''The Salish Sea MEOPAR observation datasets are copyright 2013 – present
by the Salish Sea MEOPAR Project Contributors, The University of British Columbia, and Ocean Networks Canada.

They are licensed under the Apache License, Version 2.0. http://www.apache.org/licenses/LICENSE-2.0

Raw instrument data on which this dataset is based were provided by Ocean Networks Canada.''',
    }),
    ('project', {
        'text':'Salish Sea MEOPAR NEMO Model',
        'after': 'title',
    }),
    ('creator_name', {
        'text': 'Salish Sea MEOPAR Project Contributors',
    }),
    ('creator_email', {
        'text': 'sallen@eos.ubc.ca',
        'after': 'creator_name',
    }),
    ('creator_url', {
        'text': 'https://salishsea-meopar-docs.readthedocs.org/',
    }),
    ('acknowledgement', {
        'text': 'MEOPAR, ONC, Compute Canada',
        'after': 'creator_url',
    }),
    ('drawLandMask', {
        'text': 'over',
        'after': 'acknowledgement',
    }),
])

The datasets dictionary below provides the content for the dataset title and summary attributes.

The title attribute content appears in the the datasets list table (among other places).

The summary atribute content appears (among other places) when a user hovers the cursor over the ? icon beside the title content in the datasets list table. The text that is inserted into the summary attribute tag by code later in this notebook is the title content followed by the summary content, separated by a blank line.

The keys of the datasets dict are the datasetID strings that are used in many places by the ERDDAP server. They are structured as follows:

ubc to indicate that the dataset was produced at UBC
ONC to indicate that the dataset is a product of filter, resampling, etc.

raw instrument data provided by Ocean Networks Canada (ONC)

a description of the dataset variables; e.g. SCVIPCTD
the time interval of values in the dataset; e.g. 15m
the dataset version; e.g. V1

So:

ubcONCSCVIPCTD15mV1 is the version 1 dataset of 15 minute resampled CTD temperature and salinity data

from the ONC Strait of Georgia Central node VENUS instrument platform

The dataset version part of the datasetID is used to indicate changes in the variables contained in the dataset.

All datasets start at V1 and their summary ends with a notation about the variables that they contain; e.g.

v1: reference salinity, reference salinity standard deviation, reference salinity sample counts, 
temperature, temperature standard deviation, temperature sample counts variables

When the a dataset version is incremented a line describing the change is added to the end of its summary.

In [27]:

datasets = {
    'ubcONCSCVIPCTD15mV1' :{
        'type': 'resampled CTD',
        'title': 'ONC, Strait of Georgia, Central Node, Salinity and Temperature, 15min, v1',
        'summary':'''Temperature and salinity data from the Ocean Networks Canada (ONC)
Strait of Georgia Central Node VENUS Instrument Platform CTD.
The data are resampled from the raw instrument data to 15 minute mean values.
They are accompanied by standard deviations and sample counts for each of the 15 minute
aggregation intervals.

v1: reference salinity, reference salinity standard deviation, reference salinity sample counts, 
temperature, temperature standard deviation, temperature sample counts variables''',
        'keywords': '''15min aggregation, ONC Central Node VENUS Instrument Platform, Ocean Networks Canada,
depth, UBC EOAS, Strait of Georgia, latitude, longitude, ocean, SCVIP, observations, CTD,
Oceans &gt; Ocean Temperature &gt; Water Temperature,
reference salinity, salinity_sample_count, salinity_std_dev, sea_water_reference_salinity, 
sea_water_reference_salinity_sample_count, sea_water_reference_salinity_standard_deviation, 
sea_water_temperature, sea_water_temperature_sample_count, sea_water_temperature_standard_deviation, 
temperature, temperature_sample_count, temperature_std_dev, time''',
        'fileNameRegex': '.*SCVIP_CTD_15m_\d{8}\.nc$'
    },
    
    'ubcONCSEVIPCTD15mV1' :{
        'type': 'resampled CTD',
        'title': 'ONC, Strait of Georgia, East Node, Salinity and Temperature, 15min, v1',
        'summary':'''Temperature and salinity data from the Ocean Networks Canada (ONC)
Strait of Georgia East Node VENUS Instrument Platform CTD.
The data are resampled from the raw instrument data to 15 minute mean values.
They are accompanied by standard deviations and sample counts for each of the 15 minute
aggregation intervals.

v1: reference salinity, reference salinity standard deviation, reference salinity sample counts, 
temperature, temperature standard deviation, temperature sample counts variables''',
        'keywords': '''15min aggregation, ONC East Node VENUS Instrument Platform, Ocean Networks Canada,
depth, UBC EOAS, Strait of Georgia, latitude, longitude, ocean, SEVIP, observations, CTD,
Oceans &gt; Ocean Temperature &gt; Water Temperature,
reference salinity, salinity_sample_count, salinity_std_dev, sea_water_reference_salinity, 
sea_water_reference_salinity_sample_count, sea_water_reference_salinity_standard_deviation, 
sea_water_temperature, sea_water_temperature_sample_count, sea_water_temperature_standard_deviation, 
temperature, temperature_sample_count, temperature_std_dev, time''',
        'fileNameRegex': '.*SEVIP_CTD_15m_\d{8}\.nc$'
    },
}

A few convenience functions to reduce code repetition:

In [28]:

def print_tree(root):
    """Display an XML tree fragment with indentation.
    """
    print(etree.tostring(root, pretty_print=True).decode('ascii'))

In [29]:

def find_att(root, att):
    """Return the dataset attribute element named att
    or raise a ValueError exception if it cannot be found.
    """
    e = root.find('.//att[@name="{}"]'.format(att))
    if e is None:
        raise ValueError('{} attribute element not found'.format(att))
    return e

parse the output of GenerateDatasetsXml.sh into an XML tree data structure
set the datasetID dataset attribute value
change the recursive dataset attribute value to false
re-set the fileNameRegex dataset attribute value because it looses its \ characters during parsing(?)
add a cf_role attribute element with value timeseries_id to the time variable
update the metadata elements with the contents of the datasets dict defined above
set the colour map limit variables for the variables that they make sense for,

and delete from the variables for which they are nonsensical

In [30]:

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('/results/erddap/logs/GenerateDatasetsXml.out', parser)
root = tree.getroot()

datasetID = 'ubcONCSEVIPCTD15mV1'

In [31]:

root.attrib['datasetID'] = datasetID
root.find('.//recursive').text = 'false'
root.find('.//fileNameRegex').text = datasets[datasetID]['fileNameRegex']

In [32]:

vars = [e.text for e in tree.findall('//sourceName')]
vars

Out[32]:

['time',
 'salinity',
 'temperature',
 'temperature_std_dev',
 'salinity_std_dev',
 'salinity_sample_count',
 'temperature_sample_count',
 'latitude',
 'longitude',
 'depth']

In [33]:

e = etree.Element('att', name='cf_role')
e.text = 'timeseries_id'
tree.find('//dataVariable[{}]/addAttributes'.format(vars.index('time')+1)).append(e)

In [37]:

for att, info in metadata.items():
    e = etree.Element('att', name=att)
    e.text = info['text']
    try:
        root.find('.//att[@name="{}"]'.format(info['after'])).addnext(e)
    except KeyError:
        find_att(root, att).text = info['text']

In [38]:

title = datasets[datasetID]['title']
find_att(root, 'title').text = title
find_att(root, 'summary').text = '{0}\n\n{1}'.format(title, datasets[datasetID]['summary'])
find_att(root, 'keywords').text = datasets[datasetID]['keywords']

In [39]:

# Salinity colour map limits
e = tree.find(
    '//dataVariable[{}]/addAttributes/att[@name="colorBarMinimum"]'
    .format(vars.index('salinity')+1))
e.text = '0.0'
e = tree.find(
    '//dataVariable[{}]/addAttributes/att[@name="colorBarMaximum"]'
    .format(vars.index('salinity')+1))
e.text = '34.0'

# Temperature colour map limits
e = tree.find(
    '//dataVariable[{}]/addAttributes/att[@name="colorBarMinimum"]'
    .format(vars.index('temperature')+1))
e.text = '4.0'
e = tree.find(
    '//dataVariable[{}]/addAttributes/att[@name="colorBarMaximum"]'
    .format(vars.index('temperature')+1))
e.text = '20.0'

# Depth colour map limits
e = tree.find(
    '//dataVariable[{}]/addAttributes/att[@name="colorBarMinimum"]'
    .format(vars.index('depth')+1))
e.text = '0.0'
e = tree.find(
    '//dataVariable[{}]/addAttributes/att[@name="colorBarMaximum"]'
    .format(vars.index('depth')+1))
e.text = '450.0'

In [40]:

# Delete nonsensical colourBar* attributes
no_cbar_vars = [
    'temperature_sample_count', 'temperature_std_dev',
    'salinity_sample_count', 'salinity_std_dev']
for var in no_cbar_vars:
    for att in ('colorBarMinimum', 'colorBarMaximum'):
        e = tree.find(
            '//dataVariable[{0}]/addAttributes/att[@name="{1}"]'
            .format(vars.index(var)+1, att))
        e.getparent().remove(e)

Inspect the resulting dataset XML fragment below and edit the dicts and code cell above until it is what is required for the dataset:

In [41]:

print_tree(root)

<dataset type="EDDTableFromNcFiles" datasetID="ubcONCSEVIPCTD15mV1" active="true">
  <reloadEveryNMinutes>10080</reloadEveryNMinutes>
  <updateEveryNMillis>10000</updateEveryNMillis>
  <fileDir>/results/observations/ONC/CTD/SEVIP/</fileDir>
  <recursive>false</recursive>
  <fileNameRegex>.*SEVIP_CTD_15m_\d{8}\.nc$</fileNameRegex>
  <metadataFrom>last</metadataFrom>
  <preExtractRegex/>
  <postExtractRegex/>
  <extractRegex/>
  <columnNameForExtract/>
  <sortedColumnSourceName>time</sortedColumnSourceName>
  <sortFilesBySourceNames>time</sortFilesBySourceNames>
  <fileTableInMemory>false</fileTableInMemory>
  <accessibleViaFiles>false</accessibleViaFiles>
  <!-- sourceAttributes>
        <att name="_NCProperties">version=1|netcdflibversion=4.4.1|hdf5libversion=1.8.17</att>
        <att name="coordinates">latitude longitude depth</att>
        <att name="history">
2016-09-10 15:28:46 Download raw data from ONC scalardata API.
2016-09-10 15:28:46 Filter to exclude data with qaqcFlag != 1.
2016-09-10 15:28:46 Resample data to 15 minute intervals using mean, standard deviation and count as aggregation functions.
2016-09-10 15:28:46 Store as netCDF4 file.
        </att>
        <att name="ONC_data_product_url">http://dmas.uvic.ca/DataSearch?location=SEVIP&amp;deviceCategory=CTD</att>
        <att name="ONC_station">East</att>
        <att name="ONC_stationCode">SEVIP</att>
        <att name="ONC_stationDescription">Pacific, Salish Sea, Strait of Georgia, East, Strait of Georgia VENUS Instrument Platform</att>
    </sourceAttributes -->
  <!-- Please specify the actual cdm_data_type (TimeSeries?) and related info below, for example...
        <att name="cdm_timeseries_variables">station, longitude, latitude</att>
        <att name="subsetVariables">station, longitude, latitude</att>
    -->
  <addAttributes>
    <att name="cdm_data_type">TimeSeries</att>
    <att name="cdm_timeseries_variables">depth, longitude, latitude</att>
    <att name="Conventions">COARDS, CF-1.6, ACDD-1.3</att>
    <att name="creator_name">Salish Sea MEOPAR Project Contributors</att>
    <att name="creator_email">sallen@eos.ubc.ca</att>
    <att name="creator_url">https://salishsea-meopar-docs.readthedocs.org/</att>
    <att name="acknowledgement">MEOPAR, ONC, Compute Canada</att>
    <att name="drawLandMask">over</att>
    <att name="infoUrl">https://salishsea-meopar-tools.readthedocs.org/en/latest/results_server/</att>
    <att name="institution">UBC EOAS</att>
    <att name="institution_fullname">Earth, Ocean &amp; Atmospheric Sciences, University of British Columbia</att>
    <att name="keywords">15min aggregation, ONC East Node VENUS Instrument Platform, Ocean Networks Canada,
depth, UBC EOAS, Strait of Georgia, latitude, longitude, ocean, SEVIP, observations, CTD,
Oceans &amp;gt; Ocean Temperature &amp;gt; Water Temperature,
reference salinity, salinity_sample_count, salinity_std_dev, sea_water_reference_salinity, 
sea_water_reference_salinity_sample_count, sea_water_reference_salinity_standard_deviation, 
sea_water_temperature, sea_water_temperature_sample_count, sea_water_temperature_standard_deviation, 
temperature, temperature_sample_count, temperature_std_dev, time</att>
    <att name="keywords_vocabulary">GCMD Science Keywords</att>
    <att name="license">The Salish Sea MEOPAR observation datasets are copyright 2013 – present
by the Salish Sea MEOPAR Project Contributors, The University of British Columbia, and Ocean Networks Canada.

They are licensed under the Apache License, Version 2.0. http://www.apache.org/licenses/LICENSE-2.0

Raw instrument data on which this dataset is based were provided by Ocean Networks Canada.</att>
    <att name="sourceUrl">(local files)</att>
    <att name="standard_name_vocabulary">CF Standard Name Table v29</att>
    <att name="summary">ONC, Strait of Georgia, East Node, Salinity and Temperature, 15min, v1

Temperature and salinity data from the Ocean Networks Canada (ONC)
Strait of Georgia East Node VENUS Instrument Platform CTD.
The data are resampled from the raw instrument data to 15 minute mean values.
They are accompanied by standard deviations and sample counts for each of the 15 minute
aggregation intervals.

v1: reference salinity, reference salinity standard deviation, reference salinity sample counts, 
temperature, temperature standard deviation, temperature sample counts variables</att>
    <att name="title">ONC, Strait of Georgia, East Node, Salinity and Temperature, 15min, v1</att>
    <att name="project">Salish Sea MEOPAR NEMO Model</att>
  </addAttributes>
  <dataVariable>
    <sourceName>time</sourceName>
    <destinationName>time</destinationName>
    <dataType>long</dataType>
    <!-- sourceAttributes>
            <att name="calendar">proleptic_gregorian</att>
            <att name="units">minutes since 1970-01-01</att>
        </sourceAttributes -->
    <addAttributes>
      <att name="long_name">Time</att>
      <att name="standard_name">time</att>
      <att name="cf_role">timeseries_id</att>
    </addAttributes>
  </dataVariable>
  <dataVariable>
    <sourceName>salinity</sourceName>
    <destinationName>salinity</destinationName>
    <dataType>double</dataType>
    <!-- sourceAttributes>
            <att name="aggregation_interval" type="long">900</att>
            <att name="aggregation_interval_units">seconds</att>
            <att name="aggregation_operation">mean</att>
            <att name="ioos_category">Salinity</att>
            <att name="long_name">reference salinity</att>
            <att name="standard_name">sea_water_reference_salinity</att>
            <att name="units">g/kg</att>
        </sourceAttributes -->
    <addAttributes>
      <att name="colorBarMaximum" type="double">34.0</att>
      <att name="colorBarMinimum" type="double">0.0</att>
    </addAttributes>
  </dataVariable>
  <dataVariable>
    <sourceName>temperature</sourceName>
    <destinationName>temperature</destinationName>
    <dataType>double</dataType>
    <!-- sourceAttributes>
            <att name="aggregation_interval" type="long">900</att>
            <att name="aggregation_interval_units">seconds</att>
            <att name="aggregation_operation">mean</att>
            <att name="ioos_category">Temperature</att>
            <att name="long_name">temperature</att>
            <att name="standard_name">sea_water_temperature</att>
            <att name="units">degrees_Celcius</att>
        </sourceAttributes -->
    <addAttributes>
      <att name="colorBarMaximum" type="double">20.0</att>
      <att name="colorBarMinimum" type="double">4.0</att>
    </addAttributes>
  </dataVariable>
  <dataVariable>
    <sourceName>temperature_std_dev</sourceName>
    <destinationName>temperature_std_dev</destinationName>
    <dataType>double</dataType>
    <!-- sourceAttributes>
            <att name="aggregation_interval" type="long">900</att>
            <att name="aggregation_interval_units">seconds</att>
            <att name="aggregation_operation">standard deviation</att>
            <att name="ioos_category">Temperature</att>
            <att name="long_name">temperature standard deviation</att>
            <att name="standard_name">sea_water_temperature_standard_deviation</att>
            <att name="units">degrees_Celcius</att>
        </sourceAttributes -->
    <addAttributes/>
  </dataVariable>
  <dataVariable>
    <sourceName>salinity_std_dev</sourceName>
    <destinationName>salinity_std_dev</destinationName>
    <dataType>double</dataType>
    <!-- sourceAttributes>
            <att name="aggregation_interval" type="long">900</att>
            <att name="aggregation_interval_units">seconds</att>
            <att name="aggregation_operation">standard deviation</att>
            <att name="ioos_category">Salinity</att>
            <att name="long_name">reference salinity standard deviation</att>
            <att name="standard_name">sea_water_reference_salinity_standard_deviation</att>
            <att name="units">g/kg</att>
        </sourceAttributes -->
    <addAttributes/>
  </dataVariable>
  <dataVariable>
    <sourceName>salinity_sample_count</sourceName>
    <destinationName>salinity_sample_count</destinationName>
    <dataType>long</dataType>
    <!-- sourceAttributes>
            <att name="aggregation_interval" type="long">900</att>
            <att name="aggregation_interval_units">seconds</att>
            <att name="aggregation_operation">count</att>
            <att name="long_name">reference salinity sample count</att>
            <att name="standard_name">sea_water_reference_salinity_sample_count</att>
        </sourceAttributes -->
    <addAttributes/>
  </dataVariable>
  <dataVariable>
    <sourceName>temperature_sample_count</sourceName>
    <destinationName>temperature_sample_count</destinationName>
    <dataType>long</dataType>
    <!-- sourceAttributes>
            <att name="aggregation_interval" type="long">900</att>
            <att name="aggregation_interval_units">seconds</att>
            <att name="aggregation_operation">count</att>
            <att name="long_name">temperature sample count</att>
            <att name="standard_name">sea_water_temperature_sample_count</att>
        </sourceAttributes -->
    <addAttributes/>
  </dataVariable>
  <dataVariable>
    <sourceName>latitude</sourceName>
    <destinationName>latitude</destinationName>
    <dataType>double</dataType>
    <!-- sourceAttributes>
        </sourceAttributes -->
    <addAttributes>
      <att name="colorBarMaximum" type="double">90.0</att>
      <att name="colorBarMinimum" type="double">-90.0</att>
      <att name="long_name">Latitude</att>
      <att name="standard_name">latitude</att>
      <att name="units">degrees_north</att>
    </addAttributes>
  </dataVariable>
  <dataVariable>
    <sourceName>longitude</sourceName>
    <destinationName>longitude</destinationName>
    <dataType>double</dataType>
    <!-- sourceAttributes>
        </sourceAttributes -->
    <addAttributes>
      <att name="colorBarMaximum" type="double">180.0</att>
      <att name="colorBarMinimum" type="double">-180.0</att>
      <att name="long_name">Longitude</att>
      <att name="standard_name">longitude</att>
      <att name="units">degrees_east</att>
    </addAttributes>
  </dataVariable>
  <dataVariable>
    <sourceName>depth</sourceName>
    <destinationName>depth</destinationName>
    <dataType>long</dataType>
    <!-- sourceAttributes>
        </sourceAttributes -->
    <addAttributes>
      <att name="colorBarMaximum" type="double">450.0</att>
      <att name="colorBarMinimum" type="double">0.0</att>
      <att name="colorBarPalette">OceanDepth</att>
      <att name="long_name">Depth</att>
      <att name="standard_name">depth</att>
      <att name="units">m</att>
    </addAttributes>
  </dataVariable>
</dataset>

Store the XML fragment for the dataset:

In [42]:

with open('/results/erddap_datasets_xml/{}.xml'.format(datasetID), 'wb') as f:
    f.write(etree.tostring(root, pretty_print=True))

Edit /opt/tomcat/content/erddap/datasets.xml to include the XML fragment for the dataset that was stored by the abave cell.

Create a flag file to signal the ERDDAP server process to load the dataset:

$ cd /results/erddap/flag/
$ touch <datasetID>

Confirm that the dataset and its metadata were correctly added to ERDDAP by inspecting https://salishsea.eos.ubc.ca/erddap/tabledap/. If there is a problem, error messages can be found in /results/erddap/logs/log.txt.