Notebook

Running ibicus in larger environments: parallelization and dask¶

This notebooks demonstrates different options for running ibicus bias adjustment on larger areas and larger computing environments.

ibicus comes with an integrated parallelization option building upon the multiprocessing module. It also integrates easily with dask to run in HPC environments. In this notebook, we demonstrate these options using a CDFt and QuantileMapping debiaser.

In [2]:

from ibicus.debias import CDFt, QuantileMapping

Let's get some testing data. For an explanation of the steps please refer to the "Getting started" notebook:

In [3]:

import numpy as np

def get_data(variable, data_path = "testing_data/"):
    # Load in the data 
    data = np.load(f"{data_path}{variable}.npz", allow_pickle = True)
    # Return arrays
    return data["obs"], data["cm_hist"], data["cm_future"], {"time_obs": data["time_obs"], "time_cm_hist": data["time_cm_hist"], "time_cm_future": data["time_cm_future"]}

In [4]:

obs, cm_hist, cm_future, dates = get_data("tas")

1. Parallelization¶

Parallelization can be activated in the existing ibicus functionalities by simply specifying parallel = True in the debiaser.apply-function:

In [5]:

debiaser = CDFt.from_variable("tas")
debiased_cm_future = debiaser.apply(obs, cm_hist, cm_future, **dates, parallel = True, nr_processes = 8)

/home/jakobwes/Desktop/ESoWC/ibicus/notebooks/own_testing_notebooks/../../ibicus/debias/_debiaser.py:535: UserWarning: progressbar argument is ignored when parallel = True.
  warnings.warn("progressbar argument is ignored when parallel = True.")

The number of processes that run in parallel can be controlled using the nr_processes option. The default option are 4 processes. For more details see the ibicus API reference. Important to note: no progressbar is shown in parallelized execution.

We recommend using parallelization if users are interested in speeding up the execution of bias adjustment on a single machine.

2. Dask¶

For some problems the speedup provided by the simple parallelization presented above does not provide enough flexibility: for example if users are interested in scaling debiasing in an HPC environment on many machines or if the observation and climate model data does not fit into RAM.

To address these issues, ibicus integrates easily with dask. dask is an open-source python library for parallel computing allowing users to easily scale their python code from multi-core machines to large clusters. It is integrated in both xarray and iris (see here for the xarray dask integration and here for the iris one). In both both libraries, it is possible to extract the underlying dask arrays needed for computation.

For a dask introduction see here and for a practical introduction on how to use dask on a HPC cluster see this tutorial. We will only use the dask.array module here:

In [6]:

import dask.array as da

Let's get some larger testing data:

In [7]:

obs = da.from_array(np.random.normal(270, 20, size = 50*50*10000).reshape((50, 50, 10000)), chunks=(5, 10, 10000))
cm_hist = da.from_array(np.random.normal(265, 15, size = 50*50*10000).reshape((50, 50, 10000)), chunks=(5, 10, 10000))
cm_future = da.from_array(np.random.normal(280, 30, size = 50*50*10000).reshape((50, 50, 10000)), chunks=(5, 10, 10000))

For our purposes it is crucial that the dask arrays are chunked in the spatial dimension meaning chunks can be defined in the first two dimensions, but always need to include the full time dimension at each location. This is required to calculate the climatology at each location.

Given correctly chunked arrays applying dask is easily possible by just mapping the debiaser.apply function over all chunks using eg. map_blocks:

In [8]:

debiaser = QuantileMapping.from_variable("tas")

collection = da.map_blocks(debiaser.apply, obs, cm_hist, cm_future, dtype=obs.dtype, progressbar = False, parallel = False)
debiased_cm_future = collection.compute(num_workers=8)

It is also possible to use other dask mapping functions such as blockwise. To use the ibicus apply function together with dask it is important to specify two arguments:

progressbar = False otherwise the progressbar output will fill the output log. A dask progressbar can be used by importing dask.diagnostics.ProgressBar.
parallel = False (default) because otherwise ibicus parallelisation will interfere with the dask one.

In the case of bias adjustment methods where the apply function requires additional information such as time/dates, this can be specified as keywords arguments to map_blocks. For very big runs it is also recommended to specify failsafe = True to make sure that if the debiaser fails at some locations the output for the other ones can still be saved. When doing so it is even more important to check the logs for any errors and to evaluate the output carefully.

Dask itself provides a big variety of customization options and we recommend checking those out.

What about logging and warnings?¶

A brief note on logging and warnings: when ibicus encounters issues during code execution a warning or error message will be raised and the standard python tools to handle these can be used. ibicus also writes logs during the execution and logs errors during failsafe mode. The logs are written to the "ibicus" logger (ibicus.utils.get_library_logger()) and utils provides some options to set the logging level for ibicus. The logging outputs can be handled in the usual way as specified by the logging library: they can be formatted, written to file, ignored, etc.