Exploring the COSIMA Cookbook

Statement of problem

COSIMA is producing a lot of data and we need to be able to find it to analyse it. The data is contained in multiple locations. One of these locations is in the outputs directory in the ik11 project. Contained within are subdirectories for each model resolution and within each of these directories are subdirectories for each model configuration

In [1]:
!ls /g/data/ik11/outputs
access-om2  access-om2-01  access-om2-025  README
In [2]:
!ls /g/data/ik11/outputs/access-om2-01
01deg_jra55v13_ryf9091		    01deg_jra55v13_ryf9091_tides_control
01deg_jra55v13_ryf9091_5Kv	    01deg_jra55v13_ryf9091_tides_fixed
01deg_jra55v13_ryf9091_k_smag_iso3  01deg_jra55v140_iaf
01deg_jra55v13_ryf9091_OFAM3visc    01deg_jra55v140_iaf_cycle2
01deg_jra55v13_ryf9091_qian_wp

All the data is contained in netCDF files, of which there are many. At the time of writing this, 48118 netCDF files in the above directories.

GOAL: access data by specifying an experiment and a variable

COSIMA Cookbook solution

In order to achieve the above goal the COSIMA Cookbook provides tools to search directories looking for netCDF data files, read metadata from the files about the data they contain, and then save this data to an SQL database.

The Cookbook also provides an API to query the database and retrieve data by experiment and variable name.

In [3]:
import cosima_cookbook as cc
In [4]:
session = cc.database.create_session()
In [5]:
cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='u', session=session, n=1)
Out[5]:
<xarray.DataArray 'u' (time: 3, st_ocean: 75, yu_ocean: 2700, xu_ocean: 3600)>
dask.array<open_dataset-ae10db66d09bd8fd874cff78fcc65433u, shape=(3, 75, 2700, 3600), dtype=float32, chunksize=(1, 19, 135, 180), chunktype=numpy.ndarray>
Coordinates:
  * xu_ocean  (xu_ocean) float64 -279.9 -279.8 -279.7 -279.6 ... 79.8 79.9 80.0
  * yu_ocean  (yu_ocean) float64 -81.09 -81.05 -81.0 -80.96 ... 89.92 89.96 90.0
  * st_ocean  (st_ocean) float64 0.5413 1.681 2.94 ... 5.511e+03 5.709e+03
  * time      (time) object 1958-01-16 12:00:00 ... 1958-03-16 12:00:00
Attributes:
    long_name:      i-current
    units:          m/sec
    valid_range:    [-10.  10.]
    cell_methods:   time: mean
    time_avg_info:  average_T1,average_T2,average_DT
    coordinates:    geolon_c geolat_c
    standard_name:  sea_water_x_velocity

The question then becomes, how do I find out what experiment to use, and what variables are available? Currently the API provides get_experiments to give a list of experiments and get_variables which returns a list of variables for a given experiment

In [6]:
cc.querying.get_experiments(session, all=True)
Out[6]:
experiment contact email created description notes root_dir ncfiles
0 01deg_jra55v13_ryf9091_OFAM3visc Andrew Kiss [email protected] 2020-03-29 00:00:00 0.1 degree ACCESS-OM2 global model configurati... None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 50
1 01deg_jra55v13_ryf9091_tides_fixed Adele Morrison [email protected] 2020-06-11 00:00:00 0.1 degree ACCESS-OM2 global model configurati... Mostly 1 month run lengths, but a couple of mo... /g/data/ik11/outputs/access-om2-01/01deg_jra55... 1823
2 01deg_jra55v13_ryf9091_k_smag_iso3 Andrew Kiss [email protected] 2020-03-29 00:00:00 0.1 degree ACCESS-OM2 global model configurati... None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 128
3 01deg_jra55v13_ryf9091_5Kv Ryan Holmes [email protected] 2020-03-01 00:00:00 As for 01deg_jra55v13_ryf9091 except with a ba... None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 102
4 1deg_jra55v131_ryf_nonuniform_albedo Andrew Kiss [email protected] 2020-03-24 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55v131... 260
5 01deg_jra55v13_ryf9091_tides_control None None NaT None None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 668
6 1deg_jra55v131_ryf_const_albedo Andrew Kiss [email protected] 2020-03-24 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55v131... 260
7 01deg_jra55v13_ryf9091_tides None None NaT None None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 2578
8 025deg_jra55_ryf9091_gadi_noGM Ryan Holmes [email protected] 2020-04-01 00:00:00 0.25 degree ACCESS-OM2 global model configurat... None /g/data/ik11/outputs/access-om2-025/025deg_jra... 316
9 1deg_jra55_iaf_v2.0.0rc3_nonuniform_albedo Andrew Kiss [email protected] 2020-05-30 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... 4660
10 025deg_jra55_ryf9091_gadi_norediGM Ryan Holmes [email protected] 2020-04-01 00:00:00 0.25 degree ACCESS-OM2 global model configurat... None /g/data/ik11/outputs/access-om2-025/025deg_jra... 312
11 01deg_jra55v13_ryf9091 Andy Hogg [email protected] 2020-06-11 00:00:00 0.1 degree ACCESS-OM2 global model configurati... Additional daily outputs saved from 1 Jan 1950... /g/data/ik11/outputs/access-om2-01/01deg_jra55... 11491
12 1deg_jra55_ryf9091_gadi Ryan Holmes [email protected] 2020-02-01 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55_ryf... 10266
13 025deg_jra55_ryf9091_gadi Ryan Holmes [email protected] 2020-02-01 00:00:00 0.25 degree ACCESS-OM2 global model configurat... None /g/data/ik11/outputs/access-om2-025/025deg_jra... 8840
14 1deg_jra55_iaf_v2.0.0rc3 Andrew Kiss [email protected] 2020-05-30 00:00:00 1 degree ACCESS-OM2 global model configuration... None /g/data/ik11/outputs/access-om2/1deg_jra55_iaf... 4660
15 01deg_jra55v13_ryf9091_qian_wp Qian Li [email protected] 2020-03-13 00:00:00 Wind perturbation experiment None /g/data/ik11/outputs/access-om2-01/01deg_jra55... 36
16 MRI-JRA55-do-1-4-0 Hiroyuki Tsujino [email protected] 2019-03-08 08:53:09 MRI JRA55-do 1.4.0 dataset prepared for input4... Based on JRA-55 reanalysis (1958-01 to 2019-01... /g/data/ik11/inputs/JRA-55/MRI-JRA55-do/MRI-JR... 682
17 JRA55-RYF-1-4 Kial Stewart [email protected] 2020-04-17 00:00:00 This dataset is derived from JRA55-do (JRA-55 ... Further information on source dataset availabl... /g/data/ik11/inputs/JRA-55/RYF/indexing/JRA55-... 10
18 1deg_jra55v13_iaf_spinup1_B1 None None NaT None None /g/data/hh5/tmp/cosima/access-om2/1deg_jra55v1... 3924
19 025deg_jra55v13_iaf_gmredi6 None None NaT None None /g/data/hh5/tmp/cosima/access-om2-025/025deg_j... 4726
20 01deg_jra55v13_iaf None None NaT None None /g/data/hh5/tmp/cosima/access-om2-01/01deg_jra... 2700
21 01deg_jra55v140_iaf Andrew Kiss [email protected] 2020-06-09 00:00:00 0.1 degree ACCESS-OM2 global model configurati... Source code: https://github.com/COSIMA/access-... /g/data/cj50/access-om2/raw-output/access-om2-... 26168
22 01deg_jra55v140_iaf_cycle2 Andrew Kiss [email protected] 2020-08-20 00:00:00 0.1 degree ACCESS-OM2 global model configurati... Run configuration and history: https://github.... /g/data/cj50/access-om2/raw-output/access-om2-... 25198
In [7]:
variables = cc.querying.get_variables(session, experiment='01deg_jra55v140_iaf')
variables
Out[7]:
name long_name frequency ncfile # ncfiles time_start time_end
0 pfmice_i None None output243/ocean/o2i.nc 244 None None
1 sslx_i None None output243/ocean/o2i.nc 244 None None
2 ssly_i None None output243/ocean/o2i.nc 244 None None
3 sss_i None None output243/ocean/o2i.nc 244 None None
4 sst_i None None output243/ocean/o2i.nc 244 None None
... ... ... ... ... ... ... ...
254 time time static output243/ocean/ocean-2d-drag_coeff.nc 3660 1900-01-01 00:00:00 2019-01-01 00:00:00
255 xt_ocean tcell longitude static output120/ocean/ocean-2d-dxt.nc 1708 1900-01-01 00:00:00 1900-01-01 00:00:00
256 xu_ocean ucell longitude static output243/ocean/ocean-2d-drag_coeff.nc 1952 1900-01-01 00:00:00 2019-01-01 00:00:00
257 yt_ocean tcell latitude static output120/ocean/ocean-2d-dxt.nc 1708 1900-01-01 00:00:00 1900-01-01 00:00:00
258 yu_ocean ucell latitude static output243/ocean/ocean-2d-drag_coeff.nc 1952 1900-01-01 00:00:00 2019-01-01 00:00:00

259 rows × 7 columns

But there are sometimes duplicate variables with different frequency:

In [8]:
variables[variables.name == 'surface_salt']
Out[8]:
name long_name frequency ncfile # ncfiles time_start time_end
54 surface_salt Practical Salinity 1 daily output243/ocean/ocean-2d-surface_salt-1-daily-... 244 1958-01-01 00:00:00 2019-01-01 00:00:00
197 surface_salt Practical Salinity 1 monthly output243/ocean/ocean-2d-surface_salt-1-monthl... 244 1958-01-01 00:00:00 2019-01-01 00:00:00

If you just try and load this data you will get an error because you will be trying to load data from different files with different temporal frequency

In [9]:
cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='surface_salt', session=session)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-4c4ac916b058> in <module>
----> 1 cc.querying.getvar(expt='01deg_jra55v140_iaf', variable='surface_salt', session=session)

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.10/lib/python3.8/site-packages/cosima_cookbook/querying.py in getvar(expt, variable, session, ncfile, start_time, end_time, n, frequency, **kwargs)
    178     xr_kwargs.update(kwargs)
    179 
--> 180     ds = xr.open_mfdataset(
    181         (str(f.NCFile.ncfile_path) for f in ncfiles),
    182         parallel=True,

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.10/lib/python3.8/site-packages/xarray/backends/api.py in open_mfdataset(paths, chunks, concat_dim, compat, preprocess, engine, lock, data_vars, coords, combine, autoclose, parallel, join, attrs_file, **kwargs)
    974             # Redo ordering from coordinates, ignoring how they were ordered
    975             # previously
--> 976             combined = combine_by_coords(
    977                 datasets,
    978                 compat=compat,

/g/data3/hh5/public/apps/miniconda3/envs/analysis3-20.10/lib/python3.8/site-packages/xarray/core/combine.py in combine_by_coords(datasets, compat, data_vars, coords, fill_value, join, combine_attrs)
    787             indexes = concatenated.indexes.get(dim)
    788             if not (indexes.is_monotonic_increasing or indexes.is_monotonic_decreasing):
--> 789                 raise ValueError(
    790                     "Resulting object does not have monotonic"
    791                     " global indexes along dimension {}".format(dim)

ValueError: Resulting object does not have monotonic global indexes along dimension time

Exploring a Cookbook Database

The COSIMA Cookbook explore submodule seeks to solve the issue of how to find relevant experiments and variables within a Cookbook database and simplify the process of loading this data.

It does this by providing GUI elements that users can embed in their jupyter notebooks that can be used to filter and query the database.

Requirements: The explorer submodule feature requires using the cosima-cookbook version found in conda/analysis3-20.07 (or later) kernel on NCI (or your own up-to-date cookbook installation).

In [10]:
from cosima_cookbook import explore

Database Explorer

The first component is DatabaseExplorer, which is used to find relevant experiments. Re-use an existing session or don't specify session and it will start with the default database.

Filtering can be applied to narrow down the number of experiments. Select one or more keywords to reduce the listed experiments to those that contain all the selected keywords. To show only those experiments which contain a given variable select the variable from the list of available variables in Database and push the '>>' button to move them to the right hand box. Now when filter is pushed only experiments which contain the variables in the right hand box will be shown. Variables can be removed from the filtering box by selecting and pushing '<<'. Note that the list of available variables contains all variables contained in the database. The filtering by keyword does not change the available variables. Both filtering methods are applied to find the list of matching experiments, but the two methods are independent in all other respects.

Note also that the list of available variables is pre-filtered: all variables from restart files and variables that can be unambiguously identified as coordinate variables are not listed. It is possible to remove this pre-filtering by deselecting the checkboxes underneath the variable list.

By default all variables from all model components are shown in the selection box. To display only variables from one model component select the required component from the dropdown menu which defaults to "All models".

The search box can be used to further narrow the list of available variables. When text is entered into the search box only variables that contain that text in their variable name or their long_name attribute will be displayed in the selection box.

When a variable is selected the long_name is displayed below the variable selector box. In some cases when filtering and/or searching a variable will be automatically selected but may show as highlighted in the selector box. This is undesirable, but currently unavoidable.

When an experiment is selected and the 'Load Experiment' button pushed, it open an Experiment Explorer gui element below the Database Explorer. A detailed explanation of the Experiment Explorer is in the next section.

(Note: The widgets have been exported to be viewable in an HTML page, but they will ONLY function properly if loaded as a jupyter notebook)

In [11]:
from cosima_cookbook import explore
dbx = explore.DatabaseExplorer(session=session)
dbx

Experiment Explorer

The ExperimentExplorer can be used independently of the DatabaseExplorer if you already know the experiment you wish to load.

You can re-use an existing database session, or not supply that argument and a new session will be created automatically with the default database. If you pass an experiment name this experiment will be loaded by default, but it is not necessary to do so, as any experiment present in the database can be selected from a drop-down menu at the top.

The box showing the available variables is the same as the one in the filtering element from DatabaseExplorer, with exactly the same functionality to show only variables from selected models, search by variable name and long name, and filter out coordinates and restarts.

When a variable is selected the long name is displayed below the box as before, but it also populates the frequency drop down and date range slider to the right. Identical variables can be present in a data set with different temporal frequencies. It is necessary to choose a frequency in this case as those variables cannot be loaded into the same xarray.DataArray. When a frequency is selected the date range slider may change the range of available dates if they differ between the two frequencies.

It is advisable to reduce the date range you load if you know you only need the data for a limited time range, as it is much quicker to load the metadata as fewer files need to be opened and their metadata checked.

Once you have selected a variable, confirmed the frequency and date range are correct, push the "Load" button and the data will be loaded into an xarray.DataArray object. When this is done the metadata from the loaded data will be displayed at the end of the cell output.

The relevant command used to load the data is displayed, so that it can be copied, reused, and/or modified.

The loaded data is available as the .data attribute of the ExperimentExplorer object. At any time a different variable from the same or a different experiment can be loaded, and the .data attribute will be updated to reflect the new data.

In [12]:
ee = explore.ExperimentExplorer(session=session, experiment='01deg_jra55v140_iaf')
ee
In [14]:
ee.data
Out[14]:
<xarray.DataArray 'surface_salt' (time: 99, yt_ocean: 2700, xt_ocean: 3600)>
dask.array<concatenate, shape=(99, 2700, 3600), dtype=float32, chunksize=(1, 540, 720), chunktype=numpy.ndarray>
Coordinates:
  * xt_ocean  (xt_ocean) float64 -279.9 -279.8 -279.7 ... 79.75 79.85 79.95
  * yt_ocean  (yt_ocean) float64 -81.11 -81.07 -81.02 ... 89.89 89.94 89.98
  * time      (time) object 2006-10-16 12:00:00 ... 2014-12-16 12:00:00
Attributes:
    long_name:      Practical Salinity
    units:          psu
    valid_range:    [-10. 100.]
    cell_methods:   time: mean
    time_avg_info:  average_T1,average_T2,average_DT
    coordinates:    geolon_t geolat_t
    standard_name:  sea_surface_salinity
In [ ]: