{bdg-primary}Ocean
{bdg-secondary}Modelling
{bdg-warning}Special Issue
{bdg-info}Python
Can data-driven models for weather and climate predictions learn the underlying physics of the system against which they are trained? Or are they simply capable of identifying statistical patterns without any clear link to the underlying physics? Furner et al. (2022) run a sensitivity analysis of a regression-based ocean temperature model, trained against simulations from a 3D ocean model setup, demostrating that regression models are capable of learning much of the physics of the underlying system.
This notebook aims to complement the science and methodological development embedded within the original paper, using an open infrastructure that allows users to combine interactive code with text and graphical objects, translating research outputs into findable, accessible, interoperable and reusable outputs and code.
The notebook demonstrates the inputs, the training of the regression model of ocean temperature and its sensitivity analysis.
{bibliography}
:style: plain
:list: bullet
:filter: topic % "3286b92f-4fae-4cc6-a29e-e408bc844542"
The authors of the notebook acknowledge the original creator for providing public code available at Regression Model OceanTemp. An updated and enhaced version of the code is available at Reproducibility of "A sensitivity analysis of a regression model of ocean temperature".
{admonition}
:class: dropdown
*The submitted notebook was implemented and tested within a [cloud computing infrastructure, kindly provided by the European Open Science Cloud (EOSC)](https://github.com/pangeo-data/pangeo-eosc). The following lines thoroughly described the characteristics of this environment.*
| | |
|:------- | ------:|
|**Operating System** | Ubuntu 22.04.1|
|**Architecture** | x86_64|
| CPU op-mode(s) | 32-bit, 64-bit|
| Address sizes | 48 bits physical, 48 bits virtual|
| Byte Order | Little Endian|
|**CPU(s)** | 128|
| On-line CPU(s) list | 0-127|
|**Vendor ID** | AuthenticAMD|
| Model name | AMD EPYC 7543 32-Core Processor|
| CPU family | 25|
| Model | 1|
| Thread(s) per core | 2|
| Core(s) per socket | 32|
| Socket(s) | 2|
| Stepping | 1|
| Frequency boost | enabled|
| CPU max MHz | 2800.0000|
| CPU min MHz | 1500.0000|
| BogoMIPS | 5589.72|
| Flags | fpu vme de pse|
# System imports
import sys
import warnings
import pandas as pd
# Data handling
import intake
# Modelling
from sklearn import linear_model
from sklearn.model_selection import GridSearchCV
# Custom functions from the authors' repo
sys.path.append("src/general/")
sys.path.append("src/pipelines/")
import CreateDataName as cn
import ReadRoutinesModifiedCloud as rr
from make_scatter_data_plots_cloud import make_scatter_plots
from make_error_data_plots_cloud import make_error_plots
# Custom functions of the reproducibility team
from constants_cloud import *
warnings.filterwarnings(action="ignore")
The training and validation data derive from running the Massachusetts Institute of Technology general circulation model (MITgcm) —a physically based model capable of simulating the ocean or the atmosphere dynamics due to isomorphisms in the governing equations. This research project relies on a simple 2° sector configuration that captures the fundamental dynamics of the ocean, including a realistic overturning circulation. This configuration features a single ocean basin, with:
The domain runs from 60°S to 60° N, and is just over 20° wide in longitude. The domain is bounded by land along its northern and southern edges, and a strip of land runs along the eastern/western boundary from 60° N to 40° S. Below this, in the southernmost 20°, the simulator has a periodic boundary condition, allowing flow that exits to the east and west to return to the domain at the western and eastern boundary. The domain has flat-bottom bathymetry of 5,000 m over most of the domain, with a 2° region of 2,500-m depth at the southernmost 20° of the eastern edge.
The simulator has:
The simulator has a 12-hr time step, with fields output daily. We focus on daily-mean outputs.
Note:
The submitted notebook (located at notebooks/notebook_submitted.ipynb
) downloads input datasets and saved models from Zenodo repositories. This notebook version skips the download step by fetching input files directly from a cloud storage kindly provided by the European Open Science Cloud (EOSC).
The original dataset is available here. This dataset is used to train the regression model described in the paper. To download this data locally, run the following line of code. Note that this process will take ∼1 hour and will require at least 52 GB of available space.
%%time
!bash "infrastructure/downloader_datasets.sh" > "../outputs/logs/download_datasets.txt"
Training this model and performing all the experiments presented in the original paper requires plenty of computing power, so, to facilitate running this notebook, we have publicly released few trained models, their predictions, and input arrays available here. To download this data locally, run the following line of code. This process will take ∼15 minutes and requires at least ∼20 GB available space.
%%time
!bash "infrastructure/downloader_experiments.sh" > "../outputs/logs/download_experiments.txt"
The paper uses a 2° sector configuration to simulate ocean dynamics. This configuration features a single ocean basin, with limited topography, simplified coastlines, and constant idealized forcing. This has been used in a number of idealized simulations of Southern Ocean processes and their impacts on the global circulation. This configuration, while relatively simple, captures the fundamental dynamics of the ocean, including a realistic overturning circulation.
In the following script, we load the MITGCM DATASET and visualize the configuration used to simulate the ocean dynamics. Focusing particuarly on:
%run "src/pipelines/graphs_section_cloud.py"
Notice that the domain:
Importantly, note from the figures presented, that the depth axis is scaled to give each GCM grid cell equal spacing. The simulator shows a realistic temperature distribution with warm surface water near the equator, and cooler water near the poles and in the deep ocean. Temperature changes are largest in the very north of the domain and throughout the southern region. Though changes per day are small, they accumulate over time to give cycles of around 0.2° in some regions of the domain.
Model training is performed for the purely linear regression model (only linear terms) and the nolinear regression model (linear and multiplicative terms). Run parameters for the control and withholding experiments were set through the run_vars
variable in the script src/pipelines/TrainLinearRegressionModel.py
.
For illustrative purposes, in the following section we showcase how to set parameters for a linear regression model that takes as input all the relevant variables, that is,:
Temperature, salinity, U (East–West) and V (North–South) current components, density, U, V, and W (vertical) components of the Gent-McWilliams (GM) bolus velocities, sea surface height (SSH), latitude, longitude, and depth. The GM bolus velocities are a parameterization of the velocities resulting from ocean eddies and are used in the GM scheme to calculate the advective effects of eddy action on tracers.
First, we define a dictionary named run_vars
, which has all the input variables set to True
and, poly_degree
set to 1
-- the latter parameter defines we will be running a purely linear regression.
# Run parameters
run_vars = {
"dimension": 3,
"lat": True,
"lon": True,
"dep": True,
"current": True,
"bolus_vel": True,
"sal": True,
"eta": True,
"density": True,
"poly_degree": 1,
"StepSize": 1,
"predict": "DelT",
}
data_name = data_prefix + cn.create_dataname(run_vars)
The following chunk of code performs the following tasks:
The training-validation split ratio is set by default to 0.7, while the validation-testing split ratio is set to 0.9.
# Load the data
cat = intake.open_catalog("data/inputs_paper.yml")
mitgcm_filename = cat["MITGCM_model"]
density_file = cat["density_data"]
clim_filename = cat["clim_data"]
trainval_ratio = 0.7
valtest_ratio = 0.9
(
norm_inputs_tr,
_,
_,
norm_outputs_tr,
norm_outputs_val,
_,
_,
_,
_,
_,
_,
_,
_,
) = rr.ReadMITGCM(mitgcm_filename, clim_filename, density_file, data_name, run_vars)
After splitting our dataset into the corresponding subsamples (training, validation and testing), we proceed to train the model setting the following hyperparameter:
Our cross-validation splitting strategy is set to 3 folds.
# Model hyperparameters for the grid search
parameters = [{"alpha": [0.001]}]
n_folds = 3
lr = linear_model.Ridge(fit_intercept=False)
lr = GridSearchCV(
lr, param_grid=parameters, cv=n_folds, scoring="neg_mean_squared_error", refit=True
)
# Fit the model
lr.fit(norm_inputs_tr, norm_outputs_tr)
# Parameters for the estimator.
pd.DataFrame.from_dict(lr.get_params(), orient="index", columns=["values"])
We update the dictionary run_vars
, such that poly_degree
is set to 2
-- This helps us in defining the model so that it has second-order pairwise polynomial terms, in order to capture a limited amount of nonlinear behavior through interaction between terms. This means that as well as the above inputs, we have multiplicative pairs of features
# Setting up the parameters for the run
run_vars = {
"dimension": 3,
"lat": True,
"lon": True,
"dep": True,
"current": True,
"bolus_vel": True,
"sal": True,
"eta": True,
"density": True,
"poly_degree": 2,
"StepSize": 1,
"predict": "DelT",
}
data_name = data_prefix + cn.create_dataname(run_vars)
model_name = model_prefix + data_name
Note: Some models were trained using large input datasets which will not work in the memory available of the standard Binder. Therefore, below we display the code needed for loading the input data in a markdown cell, rather than an executable cell. Instead, in the Sensitivity analysis (Figure 5) section we demonstrate how steps below work on another model which fits the memory and run the prediction function.
# Open data
cat = intake.open_catalog('data/inputs_models.yml')
#Loading Input Feature Arrays
norm_inputs_tr=cat[model_name].to_dask()['norm_inputs_tr']
norm_inputs_val=cat[model_name].to_dask()['norm_inputs_val']
The following script performs the following steps to make the scatter plot of predicted and true temperature change, allowing us to evaluate the performance of the control regressor.
make_scatter_plots(run_vars,data_name,model_name,norm_inputs_tr,norm_inputs_val,lim=15000,fig_prefix='fig2')
The model captures the central part of the distribution well. While the majority of the temperature change is dominated by small near-zero changes, capturing these is key to producing a good forecast system. To a lesser extent, the regressor also captures the tails of the distribution, where temperature changes are larger, although the underprediction is more significant here. However, it is noteworthy that the model still shows some skill for these points, given that the model used is very simple and there are a relatively limited number of training samples in the tails.
To understand how the regressor performs spatially, temporally averaged absolute errors are plotted. These averaged errors are shown in Figure 3. Note that the regressor is only applied away from boundary and land points (in its current form, it cannot deal with spatial locations that are not surrounded on all sides by ocean points); hence, points close to land are not included in these plots. The script below performs the following steps:
%run "src/pipelines/spatial_patterns_errors_cloud.py"
The regressor shows the largest errors are located in the north of the domain and in the Southern Ocean. We see that the errors in the north of the domain are co-located with regions of high vertical advective temperature fluxes, and regions of high convective fluxes. These results imply the regression model struggles to fully capture the vertical processes, and the associated heat flux, in the north of the domain. The high errors in the Southern Ocean are again co-located with regions of high vertical diffusive fluxes, this time both explicit and implicit, and vertical advection. Throughout the ocean interior where temperature changes and the fluxes associated with these are small, errors are also small as would be expected.
Figure 4a shows coefficients averaged over all input locations for each variable type (i.e., for most variables, there are 27 inputs, corresponding to the 27 neighboring cells; we average over these to give a single value for each variable (temperature, salinity, etc.) and for each polynomial combination of variables).
Figure 4b shows the coefficients related to polynomial interactions of temperature with temperature. These are the raw coefficients, without any averaging applied.
%run "src/pipelines/coecien_analysis_cloud.py"
High-weighted inputs (those with a large magnitude coefficient) are variables which are heavily used in the predictions and therefore considered important for the predictive skill. From the above figures we infer:
In this section, we run a series of withholding experiments. For each of the variables described with the exception of temperature, we train a new regressor leaving out that one variable group, for example, we train a new regressor with all the existing inputs except for salinity at all surrounding points and any multiplicative terms including salinity.
We update the dictionary run_vars
, such that poly_degree
is set to 1
-- This helps us in evaluating the performance of the purely linear model, that is, the model trained without any multiplicative terms.
# Setting up the parameters for the run - Setting Polynomial Degree=1
run_vars = {
"dimension": 3,
"lat": True,
"lon": True,
"dep": True,
"current": True,
"bolus_vel": True,
"sal": True,
"eta": True,
"density": True,
"poly_degree": 1,
"StepSize": 1,
"predict": "DelT",
}
data_name = data_prefix + cn.create_dataname(run_vars)
model_name = model_prefix + data_name
# Path to the input arrays from an intake catalog
cat = intake.open_catalog("data/inputs_models.yml")
inputs_tr_filename = cat[model_name](partition="Tr")
inputs_val_filename = cat[model_name](partition="Val")
# Loading Input Arrays
norm_inputs_tr = inputs_tr_filename.to_dask()["norm_inputs_tr"]
norm_inputs_val = inputs_val_filename.to_dask()["norm_inputs_val"]
make_scatter_plots(
run_vars,
data_name,
model_name,
norm_inputs_tr,
norm_inputs_val,
plot_val=False,
fig_prefix="fig5",
)
We see that, without multiplicative terms, the model can capture the mean behavior of the system (zero change in temperature) but is unable to capture any of the variability. This mean behavior alone does not provide useful forecasts, as can be seen from the statistics for this experiment. Nonlinearity is shown to be critical to modeling the variability of temperature change.
To assess how information about the vertical structure of the ocean impacts predictions, we look at spatially averaged errors from the model trained with only a 2D neighborhood of inputs, along with the difference in error between this and the control regressor from previous Section.
We update the dictionary run_vars
, such that dimension
is set to 2
.
# Loading the 2D model
run_vars = {
"dimension": 2,
"lat": True,
"lon": True,
"dep": True,
"current": True,
"bolus_vel": True,
"sal": True,
"eta": True,
"density": True,
"poly_degree": 2,
"StepSize": 1,
"predict": "DelT",
}
data_name = data_prefix + cn.create_dataname(run_vars)
model_name = model_prefix + data_name
exp_name = exp_prefix + model_name
cntrl_name = (
exp_prefix
+ model_prefix
+ data_prefix
+ "3dLatLonDepUVBolSalEtaDnsPolyDeg2_Step1_PredictDelT"
)
exp_name
make_error_plots(mitgcm_filename, exp_name, cntrl_name, fig_prefix="fig6")
Interestingly, this regressor shows some regions (the deep water in the south of the domain) where the errors are notably improved in a regressor using only 2D information. In this work, we have developed a regressor which learns one equation to be applied across all grid boxes in the domain. We optimize for best performance averaged over all relevant grid cells, but this does not enforce the best possible performance over each individual grid point/region, and so some of the resultant models will favor certain types of dynamics more than others. Given this, it is not unexpected that the new equations discovered for the withholding experiments (which again optimize for best performance averaged over the entire domain interior) may outperform the control in some locations, despite being poorer overall. Here, we see that the control model is able to perform well across the domain, and optimizes for good performance overall, rather than the much more varied performance seen in the withholding experiments. It seems that as the model which withholds vertical information is not capable of performing well in many regions of the domain, a solution is found which highly optimizes performance in other regions to minimize error overall.
We update the dictionary run_vars
, such that current
is set to False
-- This helps us in evaluating the impact of currents on the performance of the model.
# Setting the run parameters
run_vars = {
"dimension": 3,
"lat": True,
"lon": True,
"dep": True,
"current": False,
"bolus_vel": True,
"sal": True,
"eta": True,
"density": True,
"poly_degree": 2,
"StepSize": 1,
"predict": "DelT",
}
data_name = data_prefix + cn.create_dataname(run_vars)
model_name = model_prefix + data_name
exp_name = exp_prefix + model_name
cntrl_name = (
exp_prefix
+ model_prefix
+ data_prefix
+ "3dLatLonDepUVBolSalEtaDnsPolyDeg2_Step1_PredictDelT"
)
exp_name
make_error_plots(mitgcm_filename, exp_name, cntrl_name, fig_prefix="fig7")
The horizontal (U and V) components of the currents directly drive horizontal advection of temperature. They are also indirectly related to horizontal diffusion, as this is increased in regions of high currents and steep gradients. As such, we would expect that suppressing information about the horizontal currents would cause increases in error in regions where horizontal advection and horizontal diffusion is high. However, again, we note that this region of increased error is one where many processes are present, and the increased errors seen also coincide, to a lesser extent, with regions of high vertical processes (advection, diffusion, and convection), which is less in line with our physical understanding. Here, errors appear more closely matched to the horizontal processes, and so a reasonable interpretation is that the model here is behaving as expected.
Please see CITATION.cff for the full citation information. The citation file can be exported to APA or BibTex formats (learn more here).
Dataset: Malhotra, Garima, Pinto-Veizaga, Daniela, & Pena, Jorge. (2023). Reproducible Challenge - Team 3 - Sensitivity analysis- Models. https://doi.org/10.5281/zenodo.7954232 - Version 1
Furner, Rachel. (2021). MITgcm Dataset for paper: Sensitivity analysis of a data-driven model of ocean temperature (v1.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7919172 - Version 1.1
License: The code in this notebook is licensed under the MIT License. The Environmental Data Science book is licensed under the Creative Commons by Attribution 4.0 license. See further details here.
Contact: If you have any suggestion or report an issue with this notebook, feel free to create an issue or or send a direct message to environmental.ds.book@gmail.com.
from datetime import date
print('Notebook repository version: v1.0.19')
print(f'Last tested: {date.today()}')