Notebook

Tuning the hyperparameters of a neural network using EasyVVUQ and FabSim3¶

In this tutorial we will use the EasyVVUQ GridSampler to perform a grid search on the hyperparameters of a simple Keras neural network model, trained to recognize hand-written digits. This is the famous MNIST data set, of which 4 input features (of size 28 x 28) are show below. These are fed into a standard feed-forward neural network, which will predict the label 0-9.

The (Keras) neural network script is located in mnist/keras_mnist.template, which will form the input template for the EasyVVUQ encoder. We will assume you are familiar with the basic EasyVVUQ building blocks. If not, you can look at the basic tutorial.

We need EasyVVUQ, TensorFlow and the TensorFlow data sets to execute this tutorial. If you need to install these, uncomment the corresponding line below.

In [1]:

# !pip install easyvvuq
# !pip install tensorflow
# !pip install tensorflow_datasets

FabSim3¶

While running on the localhost, we will use the FabSim3 automation toolkit for the data processing workflow, i.e. to move the UQ ensemble to/from the localhost. To connect EasyVVUQ with FabSim3, the FabUQCampaign plugin must be installed.

The advantage of this construction is that we could offload the ensemble to a remote supercomputer using this same script by simply changing the MACHINE='localhost' flag, provided that FabSIm3 is set up on the remote resource.

For an example without FabSim3, see tutorials/hyperparameter_tuning_tutorial.ipynb.

For now, import the required libraries below. fabsim3_cmd_api is an interface with fabSim3 such that the command-line FabSim3 commands can be executed in a Python script. It is stored locally in fabsim3_cmd_api.py.

In [2]:

import easyvvuq as uq
import os
import numpy as np

############################################
# Import the FabSim3 commandline interface #
############################################
import fabsim3_cmd_api as fab

We now set some flags:

In [3]:

# Work directory, where the EasyVVUQ directory will be placed
WORK_DIR = '/tmp'
# machine to run ensemble on
MACHINE = "localhost"
# target output filename generated by the code
TARGET_FILENAME = 'output.csv'
# EasyVVUQ campaign name
CAMPAIGN_NAME = 'grid_test'

# FabSim3 config name
CONFIG = 'grid_search'
# Use QCG PilotJob or not
PILOT_JOB = False

Most of these are self explanatory. Here, CONFIG is the name of the script that gets executed for each sample, in this case grid_search, which is located in FabUQCampaign/templates/grid_search. Its contents are essentially just runs our Python code hyper_param_tune.py:

cd $job_results
$run_prefix

/usr/bin/env > env.log

python3 hyper_param_tune.py

Here, hyper_param_tune is generated by the EasyVVUQ encoder, see below. The flag PILOT_JOB regulates the use of the QCG PilotJob mechanism. If True, FabSim will submit the ensemble to the (remote) host as a QCG PilotJob, which essentially means that all invididual jobs of the ensemble will get packaged into a single job allocation, thereby circumventing the limit on the maximum number of simultaneous jobs that is present on many supercomputers. For more info on the QCG PilotJob click here. In this example we'll run the samples on the localhost (see MACHINE), and hence we set PILOT_JOB=False.

As is standard in EasyVVUQ, we now define the parameter space. In this case these are 4 hyperparameters. There is one hidden layer with n_neurons neurons, a Dropout layer after the input and hidden layer, with dropout probability dropout_prob_in and dropout_prob_hidden respectively. We made the learning_rate tuneable as well.

In [4]:

params = {}
params["n_neurons"] = {"type":"integer", "default": 32}
params["dropout_prob_in"] = {"type":"float", "default": 0.0}
params["dropout_prob_hidden"] = {"type":"float", "default": 0.0}
params["learning_rate"] = {"type":"float", "default": 0.001}

These 4 hyperparameter appear as flags in the input template mnist/keras_mnist.template. Typically this is generated from an input file used by some simualtion code. In this case however, mnist/keras_mnist.template is directly our Python script, with the hyperparameters replaced by flags. For instance:

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dropout($dropout_prob_in),
  tf.keras.layers.Dense($n_neurons, activation='relu'),
  tf.keras.layers.Dropout($dropout_prob_hidden),
  tf.keras.layers.Dense(10)
])

is simply the neural network construction part with flags for the dropout probabilities and the number of neurons in the hidden layer. The encoder reads the flags and replaces them with numeric values, and it subsequently writes the corresponding target_filename=hyper_param_tune.py:

In [5]:

encoder = uq.encoders.GenericEncoder('./mnist/keras_mnist.template', target_filename='hyper_param_tune.py')

Now we create the first set of EasyVVUQ actions to create separate run directories and to encode the template:

In [6]:

# actions: create directories and encode input template, placing 1 hyper_param_tune.py file in each directory.
actions = uq.actions.Actions(
    uq.actions.CreateRunDirectory(root=WORK_DIR, flatten=True),
    uq.actions.Encode(encoder),
)

# create the EasyVVUQ main campaign object
campaign = uq.Campaign(
    name=CAMPAIGN_NAME,
    work_dir=WORK_DIR,
)

# add the param definitions and actions to the campaign
campaign.add_app(
    name=CAMPAIGN_NAME,
    params=params,
    actions=actions
)

As with the uncertainty-quantification (UQ) samplers, the vary is used to select which of the params we actually vary. Unlike the UQ samplers we do not specify an input probability distribution. This being a grid search, we simply specify a list of values for each hyperparameter. Parameters not in vary, but with a flag in the template, will be given the default value specified in params.

In [7]:

vary = {"n_neurons": [64, 128], "learning_rate": [0.005, 0.01, 0.015]}

Note: we are mixing integer and floats in the vary dict. Other data types (string, boolean) can also be used.

The vary dict is passed to the Grid_Sampler. As can be seen, it created a tensor product of all 1D points specified in vary. If a single tensor product is not useful (e.g. because it creates combinations of parameters that do not makes sense), you can also pass a list of different vary dicts. For even more flexibility you can also write the required parameter combinations to a CSV file, and pass it to the CSV_Sampler instead.

In [8]:

# create an instance of the Grid Sampler
sampler = uq.sampling.Grid_Sampler(vary)

# Associate the sampler with the campaign
campaign.set_sampler(sampler)

# print the points
print("There are %d points:" % (sampler.n_samples()))
sampler.points

There are 6 points:

Out[8]:

[array([[64, 0.005],
        [64, 0.01],
        [64, 0.015],
        [128, 0.005],
        [128, 0.01],
        [128, 0.015]], dtype=object)]

Run the actions (create directories with hyper_param_tune.py files in it)

In [9]:

###############################
# execute the defined actions #
###############################

campaign.execute().collate()

To run the ensemble, execute:

In [10]:

###################################################
# run the UQ ensemble using the FabSim3 interface #
###################################################

fab.run_uq_ensemble(CONFIG, campaign.campaign_dir, script='grid_search',
                    machine=MACHINE, PJ=PILOT_JOB)

# wait for job to complete
fab.wait(machine=MACHINE)

Executing fabsim localhost run_uq_ensemble:grid_search,campaign_dir=/tmp/grid_testrebm6ntq,script=grid_search,skip=0,PJ=False

2023-03-02 11:35:56.557670: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:35:56.725197: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-02 11:35:56.725224: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-02 11:35:57.644413: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:35:57.644488: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:35:57.644497: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-02 11:35:59.393841: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-02 11:35:59.393866: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-02 11:35:59.393886: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist
2023-03-02 11:35:59.394178: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:36:12.314798: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:36:12.475403: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-02 11:36:12.475430: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-02 11:36:13.409427: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:36:13.409501: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:36:13.409511: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-02 11:36:15.210445: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-02 11:36:15.210470: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-02 11:36:15.210490: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist
2023-03-02 11:36:15.210784: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:36:27.814654: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:36:27.985756: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-02 11:36:27.985783: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-02 11:36:28.926507: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:36:28.926585: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:36:28.926596: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-02 11:36:30.685925: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-02 11:36:30.685950: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-02 11:36:30.685969: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist
2023-03-02 11:36:30.686252: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:36:42.235332: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:36:42.397849: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-02 11:36:42.397876: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-02 11:36:43.325167: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:36:43.325318: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:36:43.325331: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-02 11:36:45.073851: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-02 11:36:45.073875: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-02 11:36:45.073894: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist
2023-03-02 11:36:45.074174: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:36:56.730036: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:36:56.899197: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-02 11:36:56.899225: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-02 11:36:57.892828: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:36:57.892931: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:36:57.892948: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-02 11:36:59.710915: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-02 11:36:59.710945: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-02 11:36:59.710971: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist
2023-03-02 11:36:59.711346: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:37:11.878783: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 11:37:12.045043: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-02 11:37:12.045066: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-02 11:37:12.946743: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:37:12.946813: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-02 11:37:12.946822: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2023-03-02 11:37:14.633522: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-03-02 11:37:14.633546: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2023-03-02 11:37:14.633564: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist
2023-03-02 11:37:14.633830: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Out[10]:

True

In [11]:

# check if all output files are retrieved from the remote machine, returns a Boolean flag
all_good = fab.verify(CONFIG, campaign.campaign_dir, TARGET_FILENAME, machine=MACHINE)

Executing fabsim localhost fetch_results
Executing fabsim localhost verify_last_ensemble:grid_search,campaign_dir=/tmp/grid_testrebm6ntq,target_filename=output.csv,machine=localhost

In [12]:

if all_good:
    # copy the results from the FabSim results dir to the EasyVVUQ results dir
    fab.get_uq_samples(CONFIG, campaign.campaign_dir, sampler.n_samples(), machine=MACHINE)
else:
    print("Not all samples executed correctly")
    import sys
    sys.exit()

Executing fabsim localhost get_uq_samples:grid_search,campaign_dir=/tmp/grid_testrebm6ntq,number_of_samples=6,skip=0

Briely:

fab.run_uq_ensemble: this command submits the ensemble to the (remote) host for execution. Under the hood it uses the FabSim3 campaign2ensemble subroutine to copy the run directories from WORK_DIR to the FabSim3 SWEEP directory, located in config_files/grid_search/SWEEP. From there the ensemble will be sent to the (remote) host.
fab.wait: this will check every minute on the status of the jobs on the remote host, and sleep otherwise, halting further execution of the script. On the localhost this command doesn't do anything.
fab.verify: this will execute the verify_last_ensemble subroutine to see if the output file target_filename for each run in the SWEEP directory is present in the corresponding FabSim3 results directory. Returns a boolean flag. fab.verify will also call the FabSim fetch_results method, which actually retreives the results from the (remote) host. So, if you want to just get the results without verifying the presence of output files, call fab.fetch_results(machine=MACHINE) instead. However, if something went wrong on the (remote) host, this will cause an error later on since not all required output files will be transfered on the EasyVVUQ WORK_DIR.
fab.get_uq_samples: copies the samples from the (local) FabSim results directory to the (local) EasyVVUQ campaign directory. It will not delete the results from the FabSim results directory. If you want to save space, you can delete the results on the FabSim side (see results directory in your FabSim home directory). You can also call fab.clear_results(machine, name_results_dir) to remove a specific FabSim results directory on a given machine.

Error handling¶

If all_good == False something went wrong on the (remote) host, and sys.exit() is called in our example, giving you the opportunity of investigating what went wrong. It can happen that a (small) number of jobs did not get executed on the remote host for some reason, whereas (most) jobs did execute succesfully. In this case simply resubmitting the failed jobs could be an option:

fab.remove_succesful_runs(CONFIG, campaign.campaign_dir)
fab.resubmit_previous_ensemble(CONFIG, 'grid_search')

The first command removes all succesful run directories from the SWEEP dir for which the output file TARGET_FILENAME has been found. For this to work, fab.verify must have been called. Then, fab.resubmit_previous_ensemble simply resubmits the runs that are present in the SWEEP directory, which by now only contains the failed runs. After the jobs have finished, call fab.verify again to see if now TARGET_FILENAME is present in the results directory, for every run in the SWEEP dir.

Once we are sure we have all required output files, the role of FabSim is over, and we proceed with decoding the output files. In this case, our Python script wrote the training and test accuracy to a CSV file, hence we use the SimpleCSV decoder.

Note: It is also possible to use a more flexible HDF5 format, by using uq.decoders.HDF5 instead.

In [13]:

#############################################
# All output files are present, decode them #
#############################################
output_columns = ["accuracy_train", "accuracy_test"]

decoder = uq.decoders.SimpleCSV(
    target_filename=TARGET_FILENAME,
    output_columns=output_columns)

actions = uq.actions.Actions(
    uq.actions.Decode(decoder),
)

campaign.replace_actions(CAMPAIGN_NAME, actions)

###########################
# Execute decoding action #
###########################

campaign.execute().collate()

data_frame = campaign.get_collation_result()
data_frame

Out[13]:

	run_id	n_neurons	learning_rate	dropout_prob_in	dropout_prob_hidden	accuracy_train	accuracy_test
	0	0	0	0	0	0	0
0	1	64	0.005	0.0	0.0	0.959267	0.9544
1	2	64	0.010	0.0	0.0	0.974133	0.9653
2	3	64	0.015	0.0	0.0	0.979717	0.9712
3	4	128	0.005	0.0	0.0	0.963333	0.9592
4	5	128	0.010	0.0	0.0	0.978667	0.9718
5	6	128	0.015	0.0	0.0	0.983650	0.9744

Display the hyperparameters with the maximum test accuracy

In [14]:

print("Best hyperparameters with %.2f%% test accuracy:" % (data_frame['accuracy_test'].max().values * 100,))
data_frame.loc[data_frame['accuracy_test'].idxmax()][vary.keys()]

Best hyperparameters with 97.44% test accuracy:

Out[14]:

	n_neurons	learning_rate
	0	0
5	128	0.015

Executing a grid search on a remote host¶

To run the example script on a remote host, a number of changes must be made. Ensure the remote host is defined in machines.yml in your FabSim3 directory, as well as the user login information. Assuming we'll run the ensemble on the Eagle super computer at the Poznan Supercomputing and Networking Center , the entry in machines_user.yml could look similar to the following:

eagle_vecma:
  username: "<your_username>"
  home_path_template: "/tmp/lustre/<your_username>"
  budget: "plgvecma2021"
  cores: 1
  # job wall time for each job, format Days-Hours:Minutes:Seconds
  job_wall_time : "0-0:59:00" # job wall time for each single job without PJ
  PJ_size : "1" # number of requested nodes for PJ
  PJ_wall_time : "0-00:59:00" # job wall time for PJ
  modules:
    loaded: ["python/3.7.3"] 
    unloaded: []

Here:

home_path_template: the remote root directory for FabSim3, such that for instance the results on the remote machine will be stored in home_path_template/FabSim3/results.
budget: the name of the computational budget that you are allowed to use.
cores: the number of cores to use per run. Our simple Keras script justs need a single core, but applications which already have some built-in paralellism will require more cores.
job_wall_time: a time limit per run, and without the use of the QCG PilotJob framework.
PJ_size: the number of nodes, in the case with the use of the QCG PilotJob framework.
PJ_wall_time: a total time limit, and with the use of the QCG PilotJob framework.

To automatically setup the ssh keys, and prevent having to login manually for every random sample, run the following from the command line:

fabsim eagle_vecma setup_ssh_keys

Once the remote machine is properly setup, we can just set:

# Use QCG PilotJob or not
PILOT_JOB = False
# machine to run ensemble on
MACHINE = "eagle_vecma"

If you now re-run the example script, the ensemble will execute on the remote host, submitting each run as a separate job. By setting PILOT_JOB=True, all runs will be packaged in a single job.

In [ ]: