In this tutorial we will use the EasyVVUQ GridSampler
to perform a grid search on the hyperparameters of a simple Keras neural network model, trained to recognize hand-written digits. This is the famous MNIST data set, of which 4 input features (of size 28 x 28) are show below. These are fed into a standard feed-forward neural network, which will predict the label 0-9.
The (Keras) neural network script is located in mnist/keras_mnist.template
, which will form the input template for the EasyVVUQ encoder. We will assume you are familiar with the basic EasyVVUQ building blocks. If not, you can look at the basic tutorial.
We need EasyVVUQ, TensorFlow and the TensorFlow data sets to execute this tutorial. If you need to install these, uncomment the corresponding line below.
# !pip install easyvvuq
# !pip install tensorflow
# !pip install tensorflow_datasets
While running on the localhost, we will use the FabSim3 automation toolkit for the data processing workflow, i.e. to move the UQ ensemble to/from the localhost. To connect EasyVVUQ with FabSim3, the FabUQCampaign plugin must be installed.
The advantage of this construction is that we could offload the ensemble to a remote supercomputer using this same script by simply changing the MACHINE='localhost'
flag, provided that FabSIm3 is set up on the remote resource.
For an example without FabSim3, see tutorials/hyperparameter_tuning_tutorial.ipynb
.
For now, import the required libraries below. fabsim3_cmd_api
is an interface with fabSim3 such that the command-line FabSim3 commands can be executed in a Python script. It is stored locally in fabsim3_cmd_api.py
.
import easyvvuq as uq
import os
import numpy as np
############################################
# Import the FabSim3 commandline interface #
############################################
import fabsim3_cmd_api as fab
We now set some flags:
# Work directory, where the EasyVVUQ directory will be placed
WORK_DIR = '/tmp'
# machine to run ensemble on
MACHINE = "localhost"
# target output filename generated by the code
TARGET_FILENAME = 'output.csv'
# EasyVVUQ campaign name
CAMPAIGN_NAME = 'grid_test'
# FabSim3 config name
CONFIG = 'grid_search'
# Use QCG PilotJob or not
PILOT_JOB = False
Most of these are self explanatory. Here, CONFIG
is the name of the script that gets executed for each sample, in this case grid_search
, which is located in FabUQCampaign/templates/grid_search
. Its contents are essentially just runs our Python code hyper_param_tune.py
:
cd $job_results
$run_prefix
/usr/bin/env > env.log
python3 hyper_param_tune.py
Here, hyper_param_tune
is generated by the EasyVVUQ encoder, see below. The flag PILOT_JOB
regulates the use of the QCG PilotJob mechanism. If True
, FabSim will submit the ensemble to the (remote) host as a QCG PilotJob, which essentially means that all invididual jobs of the ensemble will get packaged into a single job allocation, thereby circumventing the limit on the maximum number of simultaneous jobs that is present on many supercomputers. For more info on the QCG PilotJob click here. In this example we'll run the samples on the localhost (see MACHINE
), and hence we set PILOT_JOB=False
.
As is standard in EasyVVUQ, we now define the parameter space. In this case these are 4 hyperparameters. There is one hidden layer with n_neurons
neurons, a Dropout layer after the input and hidden layer, with dropout probability dropout_prob_in
and dropout_prob_hidden
respectively. We made the learning_rate
tuneable as well.
params = {}
params["n_neurons"] = {"type":"integer", "default": 32}
params["dropout_prob_in"] = {"type":"float", "default": 0.0}
params["dropout_prob_hidden"] = {"type":"float", "default": 0.0}
params["learning_rate"] = {"type":"float", "default": 0.001}
These 4 hyperparameter appear as flags in the input template mnist/keras_mnist.template
. Typically this is generated from an input file used by some simualtion code. In this case however, mnist/keras_mnist.template
is directly our Python script, with the hyperparameters replaced by flags. For instance:
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dropout($dropout_prob_in),
tf.keras.layers.Dense($n_neurons, activation='relu'),
tf.keras.layers.Dropout($dropout_prob_hidden),
tf.keras.layers.Dense(10)
])
is simply the neural network construction part with flags for the dropout probabilities and the number of neurons in the hidden layer. The encoder reads the flags and replaces them with numeric values, and it subsequently writes the corresponding target_filename=hyper_param_tune.py
:
encoder = uq.encoders.GenericEncoder('./mnist/keras_mnist.template', target_filename='hyper_param_tune.py')
Now we create the first set of EasyVVUQ actions
to create separate run directories and to encode the template:
# actions: create directories and encode input template, placing 1 hyper_param_tune.py file in each directory.
actions = uq.actions.Actions(
uq.actions.CreateRunDirectory(root=WORK_DIR, flatten=True),
uq.actions.Encode(encoder),
)
# create the EasyVVUQ main campaign object
campaign = uq.Campaign(
name=CAMPAIGN_NAME,
work_dir=WORK_DIR,
)
# add the param definitions and actions to the campaign
campaign.add_app(
name=CAMPAIGN_NAME,
params=params,
actions=actions
)
As with the uncertainty-quantification (UQ) samplers, the vary
is used to select which of the params
we actually vary. Unlike the UQ samplers we do not specify an input probability distribution. This being a grid search, we simply specify a list of values for each hyperparameter. Parameters not in vary
, but with a flag in the template, will be given the default value specified in params
.
vary = {"n_neurons": [64, 128], "learning_rate": [0.005, 0.01, 0.015]}
Note: we are mixing integer and floats in the vary
dict. Other data types (string, boolean) can also be used.
The vary
dict is passed to the Grid_Sampler
. As can be seen, it created a tensor product of all 1D points specified in vary
. If a single tensor product is not useful (e.g. because it creates combinations of parameters that do not makes sense), you can also pass a list of different vary
dicts. For even more flexibility you can also write the required parameter combinations to a CSV file, and pass it to the CSV_Sampler
instead.
# create an instance of the Grid Sampler
sampler = uq.sampling.Grid_Sampler(vary)
# Associate the sampler with the campaign
campaign.set_sampler(sampler)
# print the points
print("There are %d points:" % (sampler.n_samples()))
sampler.points
There are 6 points:
[array([[64, 0.005], [64, 0.01], [64, 0.015], [128, 0.005], [128, 0.01], [128, 0.015]], dtype=object)]
Run the actions
(create directories with hyper_param_tune.py
files in it)
###############################
# execute the defined actions #
###############################
campaign.execute().collate()
To run the ensemble, execute:
###################################################
# run the UQ ensemble using the FabSim3 interface #
###################################################
fab.run_uq_ensemble(CONFIG, campaign.campaign_dir, script='grid_search',
machine=MACHINE, PJ=PILOT_JOB)
# wait for job to complete
fab.wait(machine=MACHINE)
Executing fabsim localhost run_uq_ensemble:grid_search,campaign_dir=/tmp/grid_testrebm6ntq,script=grid_search,skip=0,PJ=False
2023-03-02 11:35:56.557670: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:35:56.725197: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2023-03-02 11:35:56.725224: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-03-02 11:35:57.644413: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:35:57.644488: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:35:57.644497: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2023-03-02 11:35:59.393841: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2023-03-02 11:35:59.393866: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303) 2023-03-02 11:35:59.393886: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist 2023-03-02 11:35:59.394178: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:36:12.314798: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:36:12.475403: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2023-03-02 11:36:12.475430: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-03-02 11:36:13.409427: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:36:13.409501: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:36:13.409511: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2023-03-02 11:36:15.210445: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2023-03-02 11:36:15.210470: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303) 2023-03-02 11:36:15.210490: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist 2023-03-02 11:36:15.210784: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:36:27.814654: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:36:27.985756: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2023-03-02 11:36:27.985783: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-03-02 11:36:28.926507: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:36:28.926585: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:36:28.926596: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2023-03-02 11:36:30.685925: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2023-03-02 11:36:30.685950: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303) 2023-03-02 11:36:30.685969: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist 2023-03-02 11:36:30.686252: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:36:42.235332: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:36:42.397849: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2023-03-02 11:36:42.397876: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-03-02 11:36:43.325167: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:36:43.325318: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:36:43.325331: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2023-03-02 11:36:45.073851: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2023-03-02 11:36:45.073875: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303) 2023-03-02 11:36:45.073894: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist 2023-03-02 11:36:45.074174: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:36:56.730036: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:36:56.899197: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2023-03-02 11:36:56.899225: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-03-02 11:36:57.892828: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:36:57.892931: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:36:57.892948: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2023-03-02 11:36:59.710915: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2023-03-02 11:36:59.710945: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303) 2023-03-02 11:36:59.710971: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist 2023-03-02 11:36:59.711346: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:37:11.878783: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 11:37:12.045043: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2023-03-02 11:37:12.045066: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2023-03-02 11:37:12.946743: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:37:12.946813: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory 2023-03-02 11:37:12.946822: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2023-03-02 11:37:14.633522: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2023-03-02 11:37:14.633546: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303) 2023-03-02 11:37:14.633564: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (wouter-XPS-13-7390): /proc/driver/nvidia/version does not exist 2023-03-02 11:37:14.633830: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
True
# check if all output files are retrieved from the remote machine, returns a Boolean flag
all_good = fab.verify(CONFIG, campaign.campaign_dir, TARGET_FILENAME, machine=MACHINE)
Executing fabsim localhost fetch_results Executing fabsim localhost verify_last_ensemble:grid_search,campaign_dir=/tmp/grid_testrebm6ntq,target_filename=output.csv,machine=localhost
if all_good:
# copy the results from the FabSim results dir to the EasyVVUQ results dir
fab.get_uq_samples(CONFIG, campaign.campaign_dir, sampler.n_samples(), machine=MACHINE)
else:
print("Not all samples executed correctly")
import sys
sys.exit()
Executing fabsim localhost get_uq_samples:grid_search,campaign_dir=/tmp/grid_testrebm6ntq,number_of_samples=6,skip=0
Briely:
fab.run_uq_ensemble
: this command submits the ensemble to the (remote) host for execution. Under the hood it uses the FabSim3 campaign2ensemble
subroutine to copy the run directories from WORK_DIR
to the FabSim3 SWEEP
directory, located in config_files/grid_search/SWEEP
. From there the ensemble will be sent to the (remote) host.fab.wait
: this will check every minute on the status of the jobs on the remote host, and sleep otherwise, halting further execution of the script. On the localhost this command doesn't do anything.fab.verify
: this will execute the verify_last_ensemble
subroutine to see if the output file target_filename
for each run in the SWEEP
directory is present in the corresponding FabSim3 results directory. Returns a boolean flag. fab.verify
will also call the FabSim fetch_results
method, which actually retreives the results from the (remote) host. So, if you want to just get the results without verifying the presence of output files, call fab.fetch_results(machine=MACHINE)
instead. However, if something went wrong on the (remote) host, this will cause an error later on since not all required output files will be transfered on the EasyVVUQ WORK_DIR
.fab.get_uq_samples
: copies the samples from the (local) FabSim results directory to the (local) EasyVVUQ campaign directory. It will not delete the results from the FabSim results directory. If you want to save space, you can delete the results on the FabSim side (see results
directory in your FabSim home directory). You can also call fab.clear_results(machine, name_results_dir)
to remove a specific FabSim results directory on a given machine.If all_good == False
something went wrong on the (remote) host, and sys.exit()
is called in our example, giving you the opportunity of investigating what went wrong. It can happen that a (small) number of jobs did not get executed on the remote host for some reason, whereas (most) jobs did execute succesfully. In this case simply resubmitting the failed jobs could be an option:
fab.remove_succesful_runs(CONFIG, campaign.campaign_dir)
fab.resubmit_previous_ensemble(CONFIG, 'grid_search')
The first command removes all succesful run directories from the SWEEP
dir for which the output file TARGET_FILENAME
has been found. For this to work, fab.verify
must have been called. Then, fab.resubmit_previous_ensemble
simply resubmits the runs that are present in the SWEEP
directory, which by now only contains the failed runs. After the jobs have finished, call fab.verify
again to see if now TARGET_FILENAME
is present in the results directory, for every run in the SWEEP
dir.
Once we are sure we have all required output files, the role of FabSim is over, and we proceed with decoding the output files. In this case, our Python script wrote the training and test accuracy to a CSV file, hence we use the SimpleCSV
decoder.
Note: It is also possible to use a more flexible HDF5 format, by using uq.decoders.HDF5
instead.
#############################################
# All output files are present, decode them #
#############################################
output_columns = ["accuracy_train", "accuracy_test"]
decoder = uq.decoders.SimpleCSV(
target_filename=TARGET_FILENAME,
output_columns=output_columns)
actions = uq.actions.Actions(
uq.actions.Decode(decoder),
)
campaign.replace_actions(CAMPAIGN_NAME, actions)
###########################
# Execute decoding action #
###########################
campaign.execute().collate()
data_frame = campaign.get_collation_result()
data_frame
run_id | iteration | n_neurons | learning_rate | dropout_prob_in | dropout_prob_hidden | accuracy_train | accuracy_test | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
0 | 1 | 0 | 64 | 0.005 | 0.0 | 0.0 | 0.959267 | 0.9544 |
1 | 2 | 0 | 64 | 0.010 | 0.0 | 0.0 | 0.974133 | 0.9653 |
2 | 3 | 0 | 64 | 0.015 | 0.0 | 0.0 | 0.979717 | 0.9712 |
3 | 4 | 0 | 128 | 0.005 | 0.0 | 0.0 | 0.963333 | 0.9592 |
4 | 5 | 0 | 128 | 0.010 | 0.0 | 0.0 | 0.978667 | 0.9718 |
5 | 6 | 0 | 128 | 0.015 | 0.0 | 0.0 | 0.983650 | 0.9744 |
Display the hyperparameters with the maximum test accuracy
print("Best hyperparameters with %.2f%% test accuracy:" % (data_frame['accuracy_test'].max().values * 100,))
data_frame.loc[data_frame['accuracy_test'].idxmax()][vary.keys()]
Best hyperparameters with 97.44% test accuracy:
n_neurons | learning_rate | |
---|---|---|
0 | 0 | |
5 | 128 | 0.015 |
To run the example script on a remote host, a number of changes must be made. Ensure the remote host is defined in machines.yml
in your FabSim3 directory, as well as the user login information. Assuming we'll run the ensemble on the Eagle super computer at the Poznan Supercomputing and Networking Center , the entry in machines_user.yml
could look similar to the following:
eagle_vecma:
username: "<your_username>"
home_path_template: "/tmp/lustre/<your_username>"
budget: "plgvecma2021"
cores: 1
# job wall time for each job, format Days-Hours:Minutes:Seconds
job_wall_time : "0-0:59:00" # job wall time for each single job without PJ
PJ_size : "1" # number of requested nodes for PJ
PJ_wall_time : "0-00:59:00" # job wall time for PJ
modules:
loaded: ["python/3.7.3"]
unloaded: []
Here:
home_path_template
: the remote root directory for FabSim3, such that for instance the results on the remote machine will be stored in home_path_template/FabSim3/results
.budget
: the name of the computational budget that you are allowed to use.cores
: the number of cores to use per run. Our simple Keras script justs need a single core, but applications which already have some built-in paralellism will require more cores.job_wall_time
: a time limit per run, and without the use of the QCG PilotJob framework.PJ_size
: the number of nodes, in the case with the use of the QCG PilotJob framework.PJ_wall_time
: a total time limit, and with the use of the QCG PilotJob framework.To automatically setup the ssh keys, and prevent having to login manually for every random sample, run the following from the command line:
fabsim eagle_vecma setup_ssh_keys
Once the remote machine is properly setup, we can just set:
# Use QCG PilotJob or not
PILOT_JOB = False
# machine to run ensemble on
MACHINE = "eagle_vecma"
If you now re-run the example script, the ensemble will execute on the remote host, submitting each run as a separate job. By setting PILOT_JOB=True
, all runs will be packaged in a single job.