Licensed under the MIT License.

Testing different Hyperparameters and Benchmarking¶

In this notebook, we'll cover how to test different hyperparameters for a particular dataset and how to benchmark different parameters across a group of datasets. Note that this re-uses functionality which was already introduced and described in the classification/notebooks/11_exploring_hyperparameters.ipynb notebook. Please refer to that notebook for all explanations, which this notebook will not repeat.

For an example of how to scale up with remote GPU clusters on Azure Machine Learning, please view 24_exploring_hyperparameters_on_azureml.ipynb.

Testing hyperparameters¶

Ensure edits to libraries are loaded and plotting is shown in the notebook.

In [1]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline

We start by importing the utilities we need.

In [2]:

import sys
import numpy as np
import scrapbook as sb
import torch
import fastai
from fastai.vision import DatasetType

sys.path.append("../../")
from utils_cv.classification.data import Urls
from utils_cv.common.data import unzip_url
from utils_cv.classification.parameter_sweeper import ParameterSweeper, clean_sweeper_df, plot_sweeper_df
from utils_cv.similarity.data import comparative_set_builder
from utils_cv.similarity.metrics import positive_image_ranks
from utils_cv.similarity.model import compute_features_learner

fastai.__version__

Out[2]:

'1.0.48'

Define the datasets and parameters we will use in this notebook.

In [3]:

DATA_PATHS = [unzip_url(Urls.fridge_objects_path, exist_ok=True), unzip_url(Urls.fridge_objects_watermark_path, exist_ok=True)]
REPS = 3
LEARNING_RATES = [1e-3, 1e-4, 1e-5]
IM_SIZES = [300, 500]
EPOCHS = [16]
DROPOUTS = [0]  #Leave dropout at zero. Higher values tend to perform significantly worse

Similiarity accuracy metric¶

For image classification, we used the percentage of correctly labeled images to measure accuracy. For image retrieval, our measure is the rank of the positive example among a large number of negatives. This was described in the 01_training_and_evaluation_introduction.ipynb notebook, and we will re-use some of the code from that notebook in the definition of the retrieval_rank() function below.

In [4]:

def retrieval_rank(learn):
    data = learn.data

    # Build multiple sets of comparative images from the validation images
    comparative_sets = comparative_set_builder(
        data.valid_ds, num_sets=1000, num_negatives=99
    )

    # Use penultimate layer as image representation
    embedding_layer = learn.model[1][-2]
        
    # Compute DNN features for all validation images
    valid_features = compute_features_learner(
        data, DatasetType.Valid, learn, embedding_layer
    )
    assert len(list(valid_features.values())[0]) == 512

    # For each comparative set compute the distances between the query image and all reference images
    for cs in comparative_sets:
        cs.compute_distances(valid_features)

    # Compute the median rank of the positive example over all comparative sets
    ranks = positive_image_ranks(comparative_sets)
    median_rank = np.median(ranks)
    return median_rank

Using Python ¶

We start by creating the Parameter Sweeper object. Before we start testing, it's a good idea to see what the default parameters are. We can use a the property parameters to easily see those default values.

In [5]:

sweeper = ParameterSweeper(metric_name="rank")
sweeper.parameters

Out[5]:

OrderedDict([('learning_rate', [0.0001]),
             ('epochs', [15]),
             ('batch_size', [16]),
             ('im_size', [299]),
             ('architecture',
              [<Architecture.resnet18: functools.partial(<function resnet18 at 0x7f443ed99598>)>]),
             ('transform', [True]),
             ('dropout', [0.5]),
             ('weight_decay', [0.01]),
             ('training_schedule',
              [<TrainingSchedule.head_first_then_body: 'head_first_then_body'>]),
             ('discriminative_lr', [False]),
             ('one_cycle_policy', [True])])

Now that we know the defaults, we can pass it the parameters we want to test, and run the parameter sweep.

In [6]:

sweeper.update_parameters(learning_rate=LEARNING_RATES, im_size=IM_SIZES, epochs=EPOCHS, dropout=DROPOUTS)
df = sweeper.run(datasets=DATA_PATHS, reps=REPS, metric_fct=retrieval_rank); 
df

this Learner object self-destroyed - it still exists, but no longer usable

Out[6]:

			duration	rank
0	PARAMETERS [learning_rate: 0.0001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 300]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	27.527227	3.0
		fridgeObjectsWatermark	29.327206	7.0
	PARAMETERS [learning_rate: 0.0001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 500]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	46.494100	7.0
		fridgeObjectsWatermark	40.436677	9.0
	PARAMETERS [learning_rate: 0.001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 300]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	29.745073	2.0
		fridgeObjectsWatermark	28.454158	1.0
	PARAMETERS [learning_rate: 0.001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 500]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	44.277393	1.0
		fridgeObjectsWatermark	40.866518	1.0
	PARAMETERS [learning_rate: 1e-05]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 300]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	28.009722	20.0
		fridgeObjectsWatermark	29.721222	22.0
	PARAMETERS [learning_rate: 1e-05]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 500]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	40.376158	25.0
		fridgeObjectsWatermark	42.627545	34.0
1	PARAMETERS [learning_rate: 0.0001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 300]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	30.931857	4.0
		fridgeObjectsWatermark	26.125927	8.5
	PARAMETERS [learning_rate: 0.0001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 500]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	46.117437	11.0
		fridgeObjectsWatermark	40.555442	10.0
	PARAMETERS [learning_rate: 0.001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 300]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	29.870988	1.0
		fridgeObjectsWatermark	25.864497	1.0
	PARAMETERS [learning_rate: 0.001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 500]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	46.807896	1.0
		fridgeObjectsWatermark	41.351353	1.0
	PARAMETERS [learning_rate: 1e-05]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 300]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	25.873023	26.0
		fridgeObjectsWatermark	25.889981	26.0
	PARAMETERS [learning_rate: 1e-05]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 500]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	41.558083	23.0
		fridgeObjectsWatermark	41.196609	30.0
2	PARAMETERS [learning_rate: 0.0001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 300]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	25.954923	3.0
		fridgeObjectsWatermark	25.766089	4.0
	PARAMETERS [learning_rate: 0.0001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 500]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	40.162561	9.0
		fridgeObjectsWatermark	41.274331	9.0
	PARAMETERS [learning_rate: 0.001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 300]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	26.026493	3.0
		fridgeObjectsWatermark	25.616917	1.0
	PARAMETERS [learning_rate: 0.001]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 500]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	40.691592	1.0
		fridgeObjectsWatermark	40.468641	1.0
	PARAMETERS [learning_rate: 1e-05]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 300]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	26.239308	19.0
		fridgeObjectsWatermark	26.347744	23.0
	PARAMETERS [learning_rate: 1e-05]\|[epochs: 16]\|[batch_size: 16]\|[im_size: 500]\|[arch: resnet18]\|[transforms: True]\|[dropout: 0]\|[weight_decay: 0.01]\|[training_schedule: head_first_then_body]\|[discriminative_lr: False]\|[one_cycle_policy: True]	fridgeObjects	41.121185	33.0
		fridgeObjectsWatermark	40.931316	30.0

Visualize Results ¶

When we read in multi-index dataframe, index 0 represents the run number, index 1 represents a single permutation of parameters, and index 2 represents the dataset. To see the results, show the df using the clean_sweeper_df helper function. This will display all the hyperparameters in a nice, readable way.

In [7]:

df = clean_sweeper_df(df)

Since we've run our benchmarking over 3 repetitions, we may want to just look at the averages across the different run numbers.

In [8]:

df.mean(level=(1,2)).T

Out[8]:

	P: [learning_rate: 0.0001] [im_size: 300]		P: [learning_rate: 0.0001] [im_size: 500]		P: [learning_rate: 0.001] [im_size: 300]		P: [learning_rate: 0.001] [im_size: 500]		P: [learning_rate: 1e-05] [im_size: 300]		P: [learning_rate: 1e-05] [im_size: 500]
	fridgeObjects	fridgeObjectsWatermark	fridgeObjects	fridgeObjectsWatermark	fridgeObjects	fridgeObjectsWatermark	fridgeObjects	fridgeObjectsWatermark	fridgeObjects	fridgeObjectsWatermark	fridgeObjects	fridgeObjectsWatermark
duration	28.138002	27.073074	44.258033	40.755483	28.547518	26.645191	43.925627	40.895504	26.707351	27.319649	41.018475	41.585157
rank	3.333333	6.500000	9.000000	9.333333	2.000000	1.000000	1.000000	1.000000	21.666667	23.666667	27.000000	31.333333

Print the average accuracy over the different runs for each dataset independently.

In [9]:

ax = df.mean(level=(1,2))["rank"].unstack().plot(kind='bar', figsize=(12, 6))

Additionally, we may want simply to see which set of hyperparameters perform the best across the different datasets. We can do that by averaging the results of the different datasets.

In [10]:

df.mean(level=(1)).T

Out[10]:

	P: [learning_rate: 0.0001] [im_size: 300]	P: [learning_rate: 0.0001] [im_size: 500]	P: [learning_rate: 0.001] [im_size: 300]	P: [learning_rate: 0.001] [im_size: 500]	P: [learning_rate: 1e-05] [im_size: 300]	P: [learning_rate: 1e-05] [im_size: 500]
duration	27.605538	42.506758	27.596354	42.410565	27.013500	41.301816
rank	4.916667	9.166667	1.500000	1.000000	22.666667	29.166667

To make it easier to see which permutation did the best, we can plot the results using the plot_sweeper_df helper function. This plot will help us easily see which parameters offer the highest accuracies.

In [11]:

plot_sweeper_df(df.mean(level=(1)), sort_by="rank")

In [12]:

# Preserve some of the notebook outputs
sb.glue("nr_elements", len(df))
sb.glue("ranks", list(df.mean(level=(1))["rank"]))
sb.glue("max_duration", df.max().duration)
sb.glue("min_duration", df.min().duration)