Exploring netCDF Files

This notebook provides discussion, examples, and best practices for working with netCDF files in Python. Topics include:

  • The netcdf4-python library
  • The salishsea_tools.nc_tools code module
  • Reading netCDF files into Python data structures
  • Exploring netCDF dataset dimensions, variables, and attributes
  • Working with netCDF variable data as NumPy arrays

This notebook is a companion to the Exploring netCDF Datasets from ERDDAP notebook. That notebooks focusses on reading data from netCDF datasets stored on ERDDAP servers on the Internet. The Salish Sea project maintains an ERDDAP server at https://salishsea.eos.ubc.ca/erddap/.

Creating netCDF files and working with their attribute metadata is documented elsewhere: http://salishsea-meopar-docs.readthedocs.org/en/latest/code-notes/salishsea-nemo/nemo-forcing/netcdf4.html

The netcdf4-python library does all of the heavy lifting to let us work with netCDF files and their data. Follow the link to get to the library documentation. The salishsea_tools.nc_tools code module provides some shortcut functions for exploring netCDF datasets. Let's go ahead and import those two packages, We'll also import numpy because we're going to use it later and it's good Python form to keep all of our imports at the top of the file.

This notebook assumes that you are working in Python 3. If you don't have a Python 3 environment set up, please see our Anaconda Python Distribution docs for instructions on how to set one up.

In [1]:
import netCDF4 as nc
import numpy as np

from salishsea_tools import nc_tools

Note that:

  • By convention, we alias netCDF4 to nc and numpy to np so that we don't have to type as much
  • For the same reason we use the from ... import ... form to get nc_tools so that we can avoid typing salishsea_tools.nc_tools everywhere

netCDF provides a Dataset object that allows us to load the contents of a netCDF file into a Python data structure by simply passing in the path and file name. Let's explore the Salish Sea NEMO model bathymetry data:

In [2]:
grid = nc.Dataset('../../NEMO-forcing/grid/bathy_meter_SalishSea2.nc')

Note:

netCDF4.Dataset can also access datasets stored on ERDDAP servers like https://salishsea.eos.ubc.ca/erddap/. If you don't have local access to the tools/NEMO-forcing/grid/bathy_meter_SalishSea2.nc, please see the Exploring netCDF Datasets from ERDDAP notebook.

netCDF files are organized around 4 big concepts:

  • groups
  • dimensions
  • variables
  • attributes

NEMO doesn't use netCDF groups, so we'll ignore them.

nc_tools provides useful (convenience) functions to look at the other 3.

In [3]:
nc_tools.show_dimensions(grid)
<class 'netCDF4._netCDF4.Dimension'>: name = 'y', size = 898

<class 'netCDF4._netCDF4.Dimension'>: name = 'x', size = 398

In [4]:
nc_tools.show_variables(grid)
odict_keys(['nav_lon', 'nav_lat', 'Bathymetry'])

So, we have a dataset that has 2 dimensions called y and x of size 898 and 398, respectively, and 3 variables called nav_lon, nav_lat, and Bathymetry. We'll see how the dimensions and variables are related, and how to work with the data in the variables in a moment, but first, let's look at the dataset attributes:

In [5]:
nc_tools.show_dataset_attrs(grid)
file format: NETCDF4
Conventions: CF-1.6
title: Salish Sea NEMO Bathymetry
institution: Dept of Earth, Ocean & Atmospheric Sciences, University of British Columbia
references: https://bitbucket.org/salishsea/nemo-forcing/src/tip/grid/bathy_meter_SalishSea.nc
comment: Based on 1_bathymetry_seagrid_WestCoast.nc file from 2-Oct-2013 WCSD_PREP tarball provided by J-P Paquin.
source: 
https://github.com/SalishSeaCast/tools/blob/master/bathymetry/SalishSeaBathy.ipynb
https://github.com/SalishSeaCast/tools/blob/master/bathymetry/SmoothMouthJdF.ipynb

history: 
[2013-10-30 13:18] Created netCDF4 zlib=True dataset.
[2013-10-30 15:22] Set depths between 0 and 4m to 4m and those >428m to 428m.
[2013-10-31 17:10] Algorithmic smoothing.
[2013-11-21 19:53] Reverted to pre-smothing dataset (repo rev 3b301b5b9b6d).
[2013-11-21 20:14] Updated dataset and variable attributes to CF-1.6 conventions & project standards.
[2013-11-21 20:47] Removed east end of Jervis Inlet and Toba Inlet region due to deficient source bathymetry data in Cascadia dataset.
[2013-11-21 21:52] Algorithmic smoothing.
[2014-01-01 14:44] Smoothed mouth of Juan de Fuca

netCDF attributes are metadata. In the cast of the dataset attributes they tell us about the dataset as a whole: how, when, and by whom it was created, how it has been modified, etc. The meanings of the various attributes and the conventions for them that we use in the Salish Sea MEOPAR project are documented elsewhere. Variables also have attributes and nc_tools provides a function to display them too:

In [6]:
nc_tools.show_variable_attrs(grid, 'nav_lon')
<class 'netCDF4._netCDF4.Variable'>
float64 nav_lon(y, x)
    units: degrees east
    valid_range: [-126.40029144 -121.31835175]
    long_name: Longitude
unlimited dimensions: 
current shape = (898, 398)
filling on, default _FillValue of 9.969209968386869e+36 used

This tells us a whole lot of useful information about the longitude data values in our bathymetry dataset, for instance:

  • They are 64-bit floating point values
  • They are associated with the y and x dimensions, in that order
  • The units are degrees measured eastward (presumably from the Greenwich meridian)
  • etc.

You can list as many variable names as you want in the show_variable_attrs() call to get information about several variables at once. If you don't provide any variables names, you get the attributes of all of the variables in the dataset:

In [7]:
nc_tools.show_variable_attrs(grid)
<class 'netCDF4._netCDF4.Variable'>
float64 nav_lon(y, x)
    units: degrees east
    valid_range: [-126.40029144 -121.31835175]
    long_name: Longitude
unlimited dimensions: 
current shape = (898, 398)
filling on, default _FillValue of 9.969209968386869e+36 used

<class 'netCDF4._netCDF4.Variable'>
float64 nav_lat(y, x)
    units: degrees north
    valid_range: [ 46.85966492  51.10480118]
    long_name: Latitude
unlimited dimensions: 
current shape = (898, 398)
filling on, default _FillValue of 9.969209968386869e+36 used

<class 'netCDF4._netCDF4.Variable'>
float64 Bathymetry(y, x)
    _FillValue: 0.0
    least_significant_digit: 1
    units: m
    valid_range: [   0.  428.]
    long_name: Depth
    positive: down
unlimited dimensions: 
current shape = (898, 398)
filling on

Before we can go further exploring and working with the variables we need to associate them with Python variables names. We do that by accessing them by name in the variables attribute of our Dataset object. variables is a Python dict. We can use any Python variable names we like, so let's shorten them (being careful not to sacrifice readability for ease of typing):

In [8]:
lons = grid.variables['nav_lon']
lats = grid.variables['nav_lat']
bathy = grid.variables['Bathymetry']

Having done that, we can now access the attributes of our variables using dotted notation:

In [9]:
bathy.units, bathy.valid_range
Out[9]:
('m', array([   0.,  428.]))

Our variables are instances of the netCDF.Variable object. In addition to their attributes, they carry a bunch of other useful properties and methods that you can read about in the netCDF4-python docs. Perhaps more importantly the data associated with the variables are stored as NumPy arrays. So, we can use NumPy indexing and slicing to access the data values. For instance, to get the latitudes and longitudes of the 4 corners of the domain:

In [10]:
lats.shape
Out[10]:
(898, 398)
In [11]:
print('Latitudes and longitudes of domain corners:')
pt = (0, 0)
print('  0, 0:        ', lats[pt], lons[pt])
pt = (0, lats.shape[1] - 1)
print('  0, x-max:    ', lats[pt], lons[pt])
pt = (lats.shape[0] - 1, 0)
print('  y-max, 0:    ', lats[pt], lons[pt])
pt = (lats.shape[0] - 1, lats.shape[1] - 1)
print('  y-max, x-max:', lats[pt], lons[pt])
Latitudes and longitudes of domain corners:
  0, 0:         46.859664917 -123.42943573
  0, x-max:     47.6009216309 -121.318351746
  y-max, 0:     50.3899269104 -126.400291443
  y-max, x-max: 51.104801178 -124.34198761

You can also access the entire variable data array, or subsets of it using slicing. The [:] slice notation is a convenient shorthand that means "the entire array".

In [12]:
lats[:]
Out[12]:
array([[ 46.85966492,  46.86154556,  46.86342621, ...,  47.59721375,
         47.59906769,  47.60092163],
       [ 46.86278915,  46.86481476,  46.86677933, ...,  47.60125732,
         47.60311127,  47.60496521],
       [ 46.86606979,  46.86814499,  46.87015915, ...,  47.60529709,
         47.60715485,  47.60900879],
       ..., 
       [ 50.38191605,  50.38397598,  50.38602448, ...,  51.09400177,
         51.09560776,  51.09720612],
       [ 50.38591766,  50.38798523,  50.39004135, ...,  51.09781265,
         51.0994072 ,  51.10100174],
       [ 50.38992691,  50.39200592,  50.39406967, ...,  51.10162354,
         51.10321808,  51.10480118]])
In [13]:
lons[42:45, 128:135]
Out[13]:
array([[-122.884552  , -122.87927246, -122.87399292, -122.86871338,
        -122.86342621, -122.85814667, -122.85286713],
       [-122.88778687, -122.88250732, -122.87722778, -122.87194824,
        -122.8666687 , -122.86138916, -122.85610962],
       [-122.89102936, -122.88574982, -122.88047028, -122.87519073,
        -122.86991119, -122.86463165, -122.85934448]])
In [14]:
lons[:2, :2], lats[-2:, -2:]
Out[14]:
(array([[-123.42943573, -123.42411804],
        [-123.43196869, -123.42677307]]), array([[ 51.0994072 ,  51.10100174],
        [ 51.10321808,  51.10480118]]))

Note that the zero and maximum dimension values may be omitted for slices that extend to the ends of array dimensions.

In some cases, like our bathymetry depths, the netCDF variable has a _FillingValue attribute value that is equal to values in the variable data. In that case the data are represented by a NumPy Masked Array with the mask applied there the data values equal the _FillingValue:

In [15]:
bathy[:]
Out[15]:
masked_array(data =
 [[-- -- -- ..., -- -- --]
 [-- -- -- ..., -- -- --]
 [-- -- -- ..., -- -- --]
 ..., 
 [-- -- -- ..., -- -- --]
 [-- -- -- ..., -- -- --]
 [-- -- -- ..., -- -- --]],
             mask =
 [[ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]
 ..., 
 [ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]
 [ True  True  True ...,  True  True  True]],
       fill_value = 0.0)

You can test to see if a variables data is masked like this:

In [16]:
np.ma.is_masked(bathy[:])
Out[16]:
True

Masked arrays are useful because require less storage than a comparable size fully populated array. Also, when masked arrays are plotted the maked values are all plotted in the same colour (white by default). We'll see in other example notebooks how this allows us to very easily plot our bathymetry in a meaningfully way, and use it, or other values to mask velocity component, salinity, etc. results so that they show values only in the water areas of the domain.