Exploring netCDF Datasets from ERDDAP Servers

This notebook provides discussion, examples, and best practices for working with netCDF datasets from ERDDAP servers in Python. Topics include:

  • ERDDAP servers
  • The netcdf4-python library
  • The salishsea_tools.nc_tools code module
  • Reading netCDF datasets from an ERDDAP server into Python data structures
  • Exploring netCDF dataset dimensions, variables, and attributes
  • Working with netCDF variable data as NumPy arrays

This notebook is a companion to the Exploring netCDF Files notebook. That notebook focuses on reading data from netCDF files stored on your computer.

This notebook is about reading data from netCDF datasets stored on ERDDAP servers on the Internet. The Salish Sea project maintains an ERDDAP server at https://salishsea.eos.ubc.ca/erddap/ that we will use here as our data source.

ERDDAP is a web application tool created by NOAA's Environmental Research Division (ERD). To users it is a data server that gives you a simple, consistent way to download subsets of scientific datasets in common file formats like netCDF. ERDDAP servers provide access to data both via web pages for humans, and RESTful web services for computer programs.

Creating netCDF files and working with their attribute metadata is documented elsewhere: http://salishsea-meopar-docs.readthedocs.org/en/latest/code-notes/salishsea-nemo/nemo-forcing/netcdf4.html

The netcdf4-python library does all of the heavy lifting to let us work with netCDF datasets. Follow the link to get to the library documentation. The salishsea_tools.nc_tools code module provides some shortcut functions for exploring netCDF datasets. Let's go ahead and import those two packages. We'll also import numpy because we're going to use it later and it's good Python form to keep all of our imports at the top of the file.

This notebook assumes that you are working in Python 3. If you don't have a Python 3 environment set up, please see our Anaconda Python Distribution docs for instructions on how to set one up.

In [1]:
import netCDF4 as nc
import numpy as np

from salishsea_tools import nc_tools

Note that:

  • By convention, we alias netCDF4 to nc and numpy to np so that we don't have to type as much
  • For the same reason we use the from ... import ... form to get nc_tools so that we can avoid typing salishsea_tools.nc_tools everywhere

netCDF4 provides a Dataset object that allows us to load a netCDF dataset into a Python data structure by simply passing in the ERDDAP server URL and dataset ID. Let's explore the Salish Sea NEMO model bathymetry data:

In [2]:
grid = nc.Dataset('https://salishsea.eos.ubc.ca/erddap/griddap/ubcSSnBathymetry2V1')

How do we know what URL to use?

  • https://salishsea.eos.ubc.ca/erddap/ is the domain and the conventional name of an ERDDAP server.
  • griddap means that we want a gridded dataset (e.g. model results) in contrast to tabular data (e.g. observations)
  • ubcSSnBathymetry2V1 is the dataset ID that uniquely identifies the dataset on the server

You can see page that shows all of the datasets on an ERDDAP server by browsing to a URL like https://salishsea.eos.ubc.ca/erddap/info/.

netCDF datasets are organized around 4 big concepts:

  • groups
  • dimensions
  • variables
  • attributes

NEMO doesn't use netCDF groups, so we'll ignore them.

nc_tools provides useful (convenience) functions to look at the other 3.

In [4]:
nc_tools.show_dimensions(grid)
<class 'netCDF4._netCDF4.Dimension'>: name = 'gridX', size = 398

<class 'netCDF4._netCDF4.Dimension'>: name = 'gridY', size = 898

In [5]:
nc_tools.show_variables(grid)
odict_keys(['gridY', 'gridX', 'longitude', 'latitude', 'bathymetry'])

So, we have a dataset that has 2 dimensions called gridX and gridY of size 398 and 898, respectively, and 5 variables called gridY, gridX, longitude, latitude, and bathymetry. We'll see how the dimensions and variables are related, and how to work with the data in the variables in a moment, but first, let's look at the dataset attributes:

In [6]:
nc_tools.show_dataset_attrs(grid)
file format: NETCDF3_CLASSIC
acknowledgement: MEOPAR, ONC, Compute Canada
cdm_data_type: Grid
comment: Bathymetry, Latitudes and Longitudes
Conventions: CF-1.6, COARDS, ACDD-1.3
coverage_content_type: modelResult
creator_email: [email protected]
creator_name: Salish Sea MEOPAR Project Contributors
creator_url: https://salishsea-meopar-docs.readthedocs.org/
drawLandMask: over
history: [2016-02-05 16:35:19] Created dataset.
[2016-03-02 18:08:56] Changed all variables to zlib=True.
[2016-03-02 18:08:56] Added least_significant_digit=1 and fill_value=0 to bathymetry variable.
2016-03-03T20:34:57Z (local files)
2016-03-03T20:34:57Z http://skookum.eos.ubc.ca:8080/erddap/griddap/ubcSSnBathymetry2V1.das
infoUrl: https://salishsea-meopar-tools.readthedocs.org/en/latest/results_server/index.html#salish-sea-model-results
institution: UBC EOAS
institution_fullname: Earth, Ocean & Atmospheric Sciences, University of British Columbia
keywords: bathymetry, bottom, data, model results, depth, floor, latitude, longitude, nemo, ocean, oceans,
Oceans > Bathymetry/Seafloor Topography > Bathymetry, salishsea, sea, sea_floor_depth, seafloor, topography
keywords_vocabulary: GCMD Science Keywords
license: The Salish Sea MEOPAR NEMO model results are copyright 2013-2021
by the Salish Sea MEOPAR Project Contributors and The University of British Columbia.

They are licensed under the Apache License, Version 2.0. http://www.apache.org/licenses/LICENSE-2.0
project: Salish Sea MEOPAR NEMO Model
references: https://bitbucket.org/salishsea/nemo-forcing/src/tipgrid/mesh_mask_SalishSea2.nc
source: https://github.com/SalishSeaCast/tools/blob/master/bathymetry/NEMOBathymetryfromMeshMask.ipynb
sourceUrl: (local files)
standard_name_vocabulary: CF Standard Name Table v29
summary: Salish Sea NEMO Model Grid, Geo-location and Bathymetry, v1

Longitude, latitude, and bathymetry of the Salish Sea NEMO model grid.
The bathymetry values are those calculated by NEMO from the input bathymetry file.
NEMO modifies the input bathymetry to remove isolated holes, and too-small partial steps.
The model grid includes the Juan de Fuca Strait, the Strait of Georgia, Puget Sound,
and Johnstone Strait on the coasts of Washington State and British Columbia.

v1: longitude, latitude and bathymetry variables
title: Salish Sea NEMO Model Grid, Geo-location and Bathymetry, v1

netCDF attributes are metadata. In the cast of the dataset attributes they tell us about the dataset as a whole: how, when, and by whom it was created, how it has been modified, etc. The meanings of the various attributes and the conventions for them that we use in the Salish Sea MEOPAR project are documented elsewhere. Variables also have attributes and nc_tools provides a function to display them too:

In [7]:
nc_tools.show_variable_attrs(grid, 'longitude')
<class 'netCDF4._netCDF4.Variable'>
float32 longitude(gridY, gridX)
    _ChunkSize: [898 398]
    colorBarMaximum: 180.0
    colorBarMinimum: -180.0
    long_name: Longitude
    standard_name: longitude
    units: degrees_east
unlimited dimensions: 
current shape = (898, 398)
filling off

This tells us a whole lot of useful information about the longitude data values in our bathymetry dataset, for instance:

  • They are 32-bit floating point values
  • They are associated with the gridY and gridX dimensions, in that order
  • The units are degrees measured eastward (from the Greenwich meridian)
  • etc.

You can list as many variable names as you want in the show_variable_attrs() call to get information about several variables at once. If you don't provide any variables names, you get the attributes of all of the variables in the dataset:

In [8]:
nc_tools.show_variable_attrs(grid)
<class 'netCDF4._netCDF4.Variable'>
int32 gridY(gridY)
    actual_range: [  0 897]
    long_name: Y
    units: count
unlimited dimensions: 
current shape = (898,)
filling off

<class 'netCDF4._netCDF4.Variable'>
int16 gridX(gridX)
    actual_range: [  0 397]
    long_name: X
    units: count
unlimited dimensions: 
current shape = (398,)
filling off

<class 'netCDF4._netCDF4.Variable'>
float32 longitude(gridY, gridX)
    _ChunkSize: [898 398]
    colorBarMaximum: 180.0
    colorBarMinimum: -180.0
    long_name: Longitude
    standard_name: longitude
    units: degrees_east
unlimited dimensions: 
current shape = (898, 398)
filling off

<class 'netCDF4._netCDF4.Variable'>
float32 latitude(gridY, gridX)
    _ChunkSize: [898 398]
    colorBarMaximum: 90.0
    colorBarMinimum: -90.0
    long_name: Latitude
    standard_name: latitude
    units: degrees_north
unlimited dimensions: 
current shape = (898, 398)
filling off

<class 'netCDF4._netCDF4.Variable'>
float32 bathymetry(gridY, gridX)
    _ChunkSize: [898 398]
    _FillValue: 0.0
    colorBarMaximum: 450.0
    colorBarMinimum: 0.0
    colorBarPalette: OceanDepth
    grid: Salish Sea 2
    least_significant_digit: 1.0
    long_name: Depth of Bottom
    standard_name: sea_floor_depth
    units: m
unlimited dimensions: 
current shape = (898, 398)
filling off

Before we can go further exploring and working with the variables we need to associate them with Python variables names. We do that by accessing them by name in the variables attribute of our Dataset object. variables is a Python dict. We can use any Python variable names we like, so let's shorten them (being careful not to sacrifice readability for ease of typing):

In [9]:
lons = grid.variables['longitude']
lats = grid.variables['latitude']
bathy = grid.variables['bathymetry']

Having done that, we can now access the attributes of our variables using dotted notation:

In [10]:
bathy.units, bathy.long_name
Out[10]:
('m', 'Depth of Bottom')

Our variables are instances of the netCDF.Variable object. In addition to their attributes, they carry a bunch of other useful properties and methods that you can read about in the netCDF4-python docs. Perhaps more importantly the data associated with the variables are stored as NumPy arrays. So, we can use NumPy indexing and slicing to access the data values. For instance, to get the latitudes and longitudes of the 4 corners of the domain:

In [11]:
lats.shape
Out[11]:
(898, 398)
In [12]:
print('Latitudes and longitudes of domain corners:')
pt = (0, 0)
print('  0, 0:        ', lats[pt], lons[pt])
pt = (0, lats.shape[1] - 1)
print('  0, x-max:    ', lats[pt], lons[pt])
pt = (lats.shape[0] - 1, 0)
print('  y-max, 0:    ', lats[pt], lons[pt])
pt = (lats.shape[0] - 1, lats.shape[1] - 1)
print('  y-max, x-max:', lats[pt], lons[pt])
Latitudes and longitudes of domain corners:
  0, 0:         46.8597 -123.429
  0, x-max:     47.6009 -121.318
  y-max, 0:     50.3899 -126.4
  y-max, x-max: 51.1048 -124.342

You can also access the entire variable data array, or subsets of it using slicing. The [:] slice notation is a convenient shorthand that means "the entire array".

In [13]:
lats[:]
Out[13]:
array([[ 46.85966492,  46.86154556,  46.86342621, ...,  47.59721375,
         47.59906769,  47.60092163],
       [ 46.86278915,  46.86481476,  46.86677933, ...,  47.60125732,
         47.60311127,  47.60496521],
       [ 46.86606979,  46.86814499,  46.87015915, ...,  47.60529709,
         47.60715485,  47.60900879],
       ..., 
       [ 50.38191605,  50.38397598,  50.38602448, ...,  51.09400177,
         51.09560776,  51.09720612],
       [ 50.38591766,  50.38798523,  50.39004135, ...,  51.09781265,
         51.0994072 ,  51.10100174],
       [ 50.38992691,  50.39200592,  50.39406967, ...,  51.10162354,
         51.10321808,  51.10480118]], dtype=float32)
In [14]:
lons[42:45, 128:135]
Out[14]:
array([[-122.884552  , -122.87927246, -122.87399292, -122.86871338,
        -122.86342621, -122.85814667, -122.85286713],
       [-122.88778687, -122.88250732, -122.87722778, -122.87194824,
        -122.8666687 , -122.86138916, -122.85610962],
       [-122.89102936, -122.88574982, -122.88047028, -122.87519073,
        -122.86991119, -122.86463165, -122.85934448]], dtype=float32)
In [15]:
lons[:2, :2], lats[-2:, -2:]
Out[15]:
(array([[-123.42943573, -123.42411804],
        [-123.43196869, -123.42677307]], dtype=float32),
 array([[ 51.0994072 ,  51.10100174],
        [ 51.10321808,  51.10480118]], dtype=float32))

Note that the zero and maximum dimension values may be omitted for slices that extend to the ends of array dimensions.

In some cases, like our bathymetry depths, the netCDF variable has a _FillingValue attribute value that is equal to values in the variable data. In that case the data are represented by a NumPy Masked Array with the mask applied there the data values equal the _FillingValue:

In [16]:
bathy[:]
Out[16]:
masked_array(data =
 [[-- -- -- ..., -- -- --]
 [-- -- -- ..., -- -- --]
 [-- -- -- ..., -- -- --]
 ..., 
 [-- -- -- ..., -- -- --]
 [-- -- -- ..., -- -- --]
 [-- -- -- ..., -- -- --]],
             mask =
 [[ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]
 ..., 
 [ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]],
       fill_value = 0.0)

You can test to see if a variables data is masked like this:

In [17]:
np.ma.is_masked(bathy[:])
Out[17]:
True

Masked arrays are useful because require less storage than a comparable size fully populated array. Also, when masked arrays are plotted the maked values are all plotted in the same colour (white by default). We'll see in other example notebooks how this allows us to very easily plot our bathymetry in a meaningfully way, and use it, or other values to mask velocity component, salinity, etc. results so that they show values only in the water areas of the domain.