Notebook

Licensed under the MIT License.

Testing different Hyperparameters and Benchmarking¶

In this notebook, we'll cover how to test different hyperparameters for a particular dataset and how to benchmark different parameters across a group of datasets using AzureML. We assume familiarity with the basic concepts and parameters, which are discussed in the 01_training_introduction.ipynb, 02_multilabel_classification.ipynb and 03_training_accuracy_vs_speed.ipynb notebooks.

Similar to 11_exploring_hyperparameters.ipynb, we will learn more about how different learning rates and different image sizes affect our model's accuracy when restricted to 16 epochs, and we want to build an AzureML experiment to test out these hyperparameters.

We will be using a ResNet18 model to classify a set of images into 4 categories: 'can', 'carton', 'milk_bottle', 'water_bottle'. We will then conduct hyper-parameter tuning to find the best set of parameters for this model. For this, we present an overall process of utilizing AzureML, specifically Hyperdrive component to run this tuning in parallel (and not successively).We demonstrate the following key steps:

Configure AzureML Workspace
Create Remote Compute Target (GPU cluster)
Prepare Data
Prepare Training Script
Setup and Run Hyperdrive Experiment
Model Import, Re-train and Test

For key concepts of AzureML see this tutorial on model training and evaluation.

In [1]:

import os
import sys
sys.path.append("../../")

import fastai
from fastai.vision import *
import scrapbook as sb

import azureml.core
from azureml.core import Workspace, Experiment
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
import azureml.data
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import (
    RandomParameterSampling, BanditPolicy, HyperDriveConfig, PrimaryMetricGoal, choice, uniform
)
import azureml.widgets as widgets

from utils_cv.classification.data import Urls
from utils_cv.common.data import unzip_url

Ensure edits to libraries are loaded and plotting is shown in the notebook.

In [2]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline

We now define some parameters which will be used in this notebook:

In [3]:

# Azure resources
subscription_id = "YOUR_SUBSCRIPTION_ID"
resource_group = "YOUR_RESOURCE_GROUP_NAME"  
workspace_name = "YOUR_WORKSPACE_NAME"  
workspace_region = "YOUR_WORKSPACE_REGION" #Possible values eastus, eastus2, etc.

# Choose a size for our cluster and the maximum number of nodes
VM_SIZE = "STANDARD_NC6" #"STANDARD_NC6S_V3"
MAX_NODES = 12

# Hyperparameter search space
IM_SIZES = [150, 300]
LEARNING_RATE_MAX = 1e-3
LEARNING_RATE_MIN = 1e-5
MAX_TOTAL_RUNS = 10 #Set to higher value to test more parameter combinations

# Image data
DATA = unzip_url(Urls.fridge_objects_path, exist_ok=True)

1. Config AzureML workspace¶

Below we setup (or load an existing) AzureML workspace, and get all its details as follows. Note that the resource group and workspace will get created if they do not yet exist. For more information regaring the AzureML workspace see also the 20_azure_workspace_setup.ipynb notebook.

To simplify clean-up (see end of this notebook), we recommend creating a new resource group to run this notebook.

In [ ]:

from utils_cv.common.azureml import get_or_create_workspace

ws = get_or_create_workspace(
        subscription_id,
        resource_group,
        workspace_name,
        workspace_region)

# Print the workspace attributes
print('Workspace name: ' + ws.name, 
      'Workspace region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

2. Create Remote Target¶

We create a GPU cluster as our remote compute target. If a cluster with the same name already exists in our workspace, the script will load it instead. This link provides more information about how to set up a compute target on different locations.

By default, the VM size is set to use STANDARD_NC6 machines. However, if quota is available, our recommendation is to use STANDARD_NC6S_V3 machines which come with the much faster V100 GPU. We set the minimum number of nodes to zero so that the cluster won't incur additional compute charges when not in use.

In [5]:

CLUSTER_NAME = "gpu-cluster"

try:
    # Retrieve if a compute target with the same cluster name already exists
    compute_target = ComputeTarget(workspace=ws, name=CLUSTER_NAME)
    print('Found existing compute target.')
    
except ComputeTargetException:
    # If it doesn't already exist, we create a new one with the name provided
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size=VM_SIZE,
                                                           min_nodes=0,
                                                           max_nodes=MAX_NODES)

    # create the cluster
    compute_target = ComputeTarget.create(ws, CLUSTER_NAME, compute_config)
    compute_target.wait_for_completion(show_output=True)

# we can use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Creating a new compute target...
Creating
Succeeded
AmlCompute wait for completion finished
Minimum number of nodes requested have been provisioned
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-08-06T15:57:12.457000+00:00', 'errors': None, 'creationTime': '2019-08-06T15:56:43.315467+00:00', 'modifiedTime': '2019-08-06T15:57:25.740370+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 12, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_NC6'}

3. Prepare data¶

In this notebook, we'll use the Fridge Objects dataset, which is already stored in the correct format. We then upload our data to the AzureML workspace.

In [ ]:

# Retrieving default datastore that got automatically created when we setup a workspace
ds = ws.get_default_datastore()

# We now upload the data to the 'data' folder on the Azure portal
ds.upload(
    src_dir=os.path.dirname(DATA),
    target_path='data',
    overwrite=True, # with "overwrite=True", if this data already exists on the Azure blob storage, it will be overwritten
    show_progress=True
)

Here's where you can see the data in your portal:

Datastore screenshot for Hyperdrive notebook run

4. Prepare training script¶

Next step is to prepare scripts that AzureML Hyperdrive will use to train and evaluate models with selected hyperparameters.

In [7]:

# creating a folder for the training script here
script_folder = os.path.join(os.getcwd(), "hyperdrive")
os.makedirs(script_folder, exist_ok=True)

In [8]:

%%writefile $script_folder/train.py

import argparse
import numpy as np
import os
from sklearn.externals import joblib
import sys

import fastai
from fastai.vision import *
from fastai.vision.data import *

from azureml.core import Run

run = Run.get_context()


#------------------------------------------------------------------
# Define parameters that we are going to use for training
ARCHITECTURE = models.resnet18
EPOCHS_HEAD = 4
EPOCHS_BODY = 12
BATCH_SIZE = 16
#------------------------------------------------------------------


# Parse arguments passed by Hyperdrive
parser = argparse.ArgumentParser()

# Data path
parser.add_argument('--data-folder', type=str, dest='DATA_DIR', help="Datastore path")
parser.add_argument('--im_size', type=int, dest='IM_SIZE')
parser.add_argument('--learning_rate', type=float, dest='LEARNING_RATE')
args = parser.parse_args()
params = vars(args)

if params['IM_SIZE'] is None:
     raise ValueError("Image Size empty")
if params['LEARNING_RATE'] is None:
    raise ValueError("Learning Rate empty")
if params['DATA_DIR'] is None:
    raise ValueError("Data folder empty")

# Getting training and validation data
path = params['DATA_DIR'] + '/data/fridgeObjects'
data = (ImageList.from_folder(path)
        .split_by_rand_pct(valid_pct=0.5, seed=10)
        .label_from_folder() 
        .transform(size=params['IM_SIZE']) 
        .databunch(bs=BATCH_SIZE) 
        .normalize(imagenet_stats))

# Get model and run training
learn = cnn_learner(
    data,
    ARCHITECTURE,
    metrics=[accuracy]
)
learn.fit_one_cycle(EPOCHS_HEAD, params['LEARNING_RATE'])
learn.unfreeze()
learn.fit_one_cycle(EPOCHS_BODY, params['LEARNING_RATE'])

# Add log entries
training_losses = [x.numpy().ravel()[0] for x in learn.recorder.losses]
accuracy = [100*x[0].numpy().ravel()[0] for x in learn.recorder.metrics][-1]
run.log('data_dir',params['DATA_DIR'])
run.log('im_size', params['IM_SIZE'])
run.log('learning_rate', params['LEARNING_RATE'])
run.log('accuracy', float(accuracy))  # Logging our primary metric 'accuracy'

# Save trained model
current_directory = os.getcwd()
output_folder = os.path.join(current_directory, 'outputs')
model_name = 'im_classif_resnet'  # Name we will give our model both locally and on Azure
os.makedirs(output_folder, exist_ok=True)
learn.export(os.path.join(output_folder, model_name + ".pkl"))

Overwriting C:\Users\pabuehle\Desktop\ComputerVision\classification\notebooks\hyperparameter/train.py

5. Setup and run Hyperdrive experiment¶

5.1 Create Experiment¶

Experiment is the main entry point into experimenting with AzureML. To create new Experiment or get the existing one, we pass our experimentation name 'hyperparameter-tuning'.

In [9]:

experiment_name = 'hyperparameter-tuning'
exp = Experiment(workspace=ws, name=experiment_name)

5.2. Define search space¶

Now we define the search space of hyperparameters. As shown below, to test discrete parameter values use 'choice()', and for uniform sampling use 'uniform()'. For more options, see Hyperdrive parameter expressions.

Hyperdrive provides three different parameter sampling methods: 'RandomParameterSampling', 'GridParameterSampling', and 'BayesianParameterSampling'. Details about each method can be found here. Here, we use the 'RandomParameterSampling'.

In [10]:

# Hyperparameter search space
param_sampling = RandomParameterSampling( {
        '--learning_rate': uniform(LEARNING_RATE_MIN, LEARNING_RATE_MAX),
        '--im_size': choice(IM_SIZES)
    }
)

early_termination_policy = BanditPolicy(slack_factor=0.15, evaluation_interval=1, delay_evaluation=20)

AzureML Estimator is the building block for training. An Estimator encapsulates the training code and parameters, the compute resources and runtime environment for a particular training scenario. We create one for our experimentation with the dependencies our model requires as follows:

pip_packages=['fastai']
conda_packages=['scikit-learn']

In [11]:

script_params = {
    '--data-folder': ds.as_mount()
}

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=compute_target,
                entry_script='train.py',
                use_gpu=True,
                pip_packages=['fastai'],
                conda_packages=['scikit-learn'])

We now create a HyperDriveConfig object which includes information about parameter space sampling, termination policy, primary metric, estimator and the compute target to execute the experiment runs on. We feed the following parameters to it:

our estimator object that we created in the above cell
hyperparameter sampling method, in this case it is Random Parameter Sampling
early termination policy, in this case we use Bandit Policy
primary metric name reported by our runs, in this case it is accuracy
the goal, which determines whether the primary metric has to be maximized/minimized, in this case it is to maximize our accuracy
number of total child-runs

The bigger the search space, the more child-runs get triggered for better results.

In [12]:

hyperdrive_run_config = HyperDriveConfig(estimator=est,
                                         hyperparameter_sampling=param_sampling,
                                         policy=early_termination_policy,
                                         primary_metric_name='accuracy',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=MAX_TOTAL_RUNS,
                                         max_concurrent_runs=MAX_NODES)

5.3 Run Experiment¶

In [13]:

# Now we submit the Run to our experiment. 
hyperdrive_run = exp.submit(config=hyperdrive_run_config)

# We can see the experiment progress from this notebook by using 
widgets.RunDetails(hyperdrive_run).show()

_HyperDriveWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO'…

In [14]:

hyperdrive_run.wait_for_completion()

Out[14]:

{'runId': 'hyperparameter-tuning_1565107066432',
 'target': 'gpu-cluster',
 'status': 'Completed',
 'startTimeUtc': '2019-08-06T15:57:46.90426Z',
 'endTimeUtc': '2019-08-06T16:13:21.185098Z',
 'properties': {'primary_metric_config': '{"name": "accuracy", "goal": "maximize"}',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive',
  'platform': 'AML',
  'baggage': 'eyJvaWQiOiAiNWFlYTJmMzAtZjQxZC00ZDA0LWJiOGUtOWU0NGUyZWQzZGQ2IiwgInRpZCI6ICI3MmY5ODhiZi04NmYxLTQxYWYtOTFhYi0yZDdjZDAxMWRiNDciLCAidW5hbWUiOiAiMDRiMDc3OTUtOGRkYi00NjFhLWJiZWUtMDJmOWUxYmY3YjQ2In0',
  'ContentSnapshotId': 'c662f56a-ff58-432e-b732-8a3bc6818778'},
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://pabuehlestorage1c7e31216.blob.core.windows.net/azureml/ExperimentRun/dcid.hyperparameter-tuning_1565107066432/azureml-logs/hyperdrive.txt?sv=2018-11-09&sr=b&sig=8D2gwxb%2BYn7nbzgGVHE7QSzJ%2FG7C1swzmLD7%2Fior2vE%3D&st=2019-08-06T17%3A36%3A08Z&se=2019-08-07T01%3A46%3A08Z&sp=r'}}

Or we can check from the Azure portal with the url link we get by running

hyperdrive_run.get_portal_url().```

To load an existing Hyperdrive Run instead of start new one, we can use 
```python
hyperdrive_run = azureml.train.hyperdrive.HyperDriveRun(exp, <your-run-id>, hyperdrive_run_config=hyperdrive_run_config)

We also can cancel the Run with

hyperdrive_run_config.cancel().

Once all the child-runs are finished, we can get the best run and the metrics.

In [15]:

# Get best run and print out metrics
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['arguments']
best_parameters = dict(zip(parameter_values[::2], parameter_values[1::2]))

print(f"* Best Run Id:{best_run.id}")
print(best_run)
print("\n* Best hyperparameters:")
print(best_parameters)
print(f"Accuracy = {best_run_metrics['accuracy']}")
#print("Learning Rate =", best_run_metrics['learning_rate'])

* Best Run Id:hyperparameter-tuning_1565107066432_8
Run(Experiment: hyperparameter-tuning,
Id: hyperparameter-tuning_1565107066432_8,
Type: azureml.scriptrun,
Status: Completed)

* Best hyperparameters:
{'--data-folder': '$AZUREML_DATAREFERENCE_workspaceblobstore', '--im_size': '150', '--learning_rate': '0.000552896672441507'}
Accuracy = 92.53731369972229

6. Download and test the model¶

We can download the best model from the outputs/ folder and inspect it.

In [16]:

import joblib
current_directory = os.getcwd()
output_folder = os.path.join(current_directory, 'outputs')
os.makedirs(output_folder, exist_ok=True)

for f in best_run.get_file_names():
    if f.startswith('outputs/im_classif_resnet'):
        print("Downloading {}..".format(f))
        best_run.download_file('outputs/im_classif_resnet.pkl')
saved_model =joblib.load('im_classif_resnet.pkl')

Downloading outputs/im_classif_resnet.pkl..

We can now use the retrieved best model to get predictions on unseen images as done in 03_training_accuracy_vs_speed.ipynb notebook using

saved_model.predict(image)

7. Clean up¶

To avoid unnecessary expenses, all resources which were created in this notebook need to get deleted once parameter search is concluded. To simplify this clean-up step, we recommend creating a new resource group to run this notebook. This resource group can then be deleted, e.g. using the Azure Portal, which will remove all created resources.

In [ ]:

# Log some outputs using scrapbook which are used during testing to verify correct notebook execution
sb.glue("best_accuracy", best_run_metrics['accuracy'])