Licensed under the MIT License.

Building Models for Accuracy vs. Speed¶

The goal of this notebook is to understand how to train a model with different parameters to achieve either a highly accurate but slow during inference model, or a model with fast inference but lower accuracy.

For example, in IoT settings the inferencing device has limited computational capabilities. This means we need to design our models to have a small memory footprint. In contrast, medical scenarios often require the highest possible accuracy because the cost of mis-classification could impact the well-being of a patient. In this scenario, the accuracy of the model can not be compromised.

We have conducted various experiments on diverse datasets to find parameters which work well in a wide variety of settings balancing high accuracy or fast inference. In this notebook, we provide these parameters so that your initial models can be trained without any parameter tuning. For most datasets, these parameters are close to optimal. In the second part of the notebook, we provide guidelines on how to fine-tune these parameters based on how they impact the model.

We recommend first training your model with the default parameters, evaluating the results, and then fine-tuning parameters to achieve better results as necessary.

Table of Contents:¶

Training a High Accuracy, Fast Inference, or Small Size Classifier
Fine tuning our models
Appendix

Training a High Accuracy, Fast Inference, or Small Size Classifier ¶

Let's first verify our fast.ai version:

In [1]:

import fastai
fastai.__version__

Out[1]:

'1.0.57'

Ensure edits to libraries are loaded and plotting is shown in the notebook.

In [2]:

%reload_ext autoreload
%autoreload 2
%matplotlib inline

Import all the functions we need.

In [3]:

import sys
sys.path.append("../../")
import os
from pathlib import Path
import scrapbook as sb

from fastai.metrics import accuracy
from fastai.vision import (
    models, ImageList, imagenet_stats, cnn_learner, get_transforms, open_image, partial
)

from utils_cv.classification.data import Urls, is_data_multilabel
from utils_cv.classification.model import hamming_accuracy, TrainMetricsRecorder
from utils_cv.common.data import unzip_url
from utils_cv.common.gpu import db_num_workers, which_processor

In [4]:

print(f"Fast.ai version = {fastai.__version__}")
which_processor()

Fast.ai version = 1.0.57
Torch is using GPU: Tesla V100-PCIE-16GB

Now that we've set up our notebook, let's set the hyperparameters based on which model type was selected.

Choosing between types of models ¶

For most scenarios, computer vision practitioners want to create a high accuracy model, a fast-inference model or a small size model. Set your MODEL_TYPE variable to one of the following: "high_accuracy", "fast_inference", or "small_size".

We will use the FridgeObjects dataset from a previous notebook again. You can replace the DATA_PATH variable with your own data.

When choosing the batch size, remember that even mid-level GPUs run out of memory when training a deeper ResNet model with larger image resolutions. If you get an out of memory error, try reducing the batch size by a factor of 2.

In [5]:

# Choose between "high_accuracy", "fast_inference", or "small_size"
MODEL_TYPE = "fast_inference"

# Path to your data
DATA_PATH = unzip_url(Urls.fridge_objects_path, exist_ok=True)

# Epochs to train for
EPOCHS_HEAD = 4
EPOCHS_BODY = 12
LEARNING_RATE = 1e-4
BATCH_SIZE = 16 

#Set parameters based on your selected model.
assert MODEL_TYPE in ["high_accuracy", "fast_inference", "small_size"]
if MODEL_TYPE == "high_accuracy":
    ARCHITECTURE = models.resnet50
    IM_SIZE = 500 
    
if MODEL_TYPE == "fast_inference":
    ARCHITECTURE = models.resnet18
    IM_SIZE = 300 

if MODEL_TYPE == "small_size":
    ARCHITECTURE = models.squeezenet1_1
    IM_SIZE = 300 

We'll automatically determine if your dataset is a multi-label or traditional (single-label) classification problem. To do so, we'll use the is_data_multilabel helper function. In order to detect whether or not a dataset is multi-label, the helper function will check to see if the datapath contains a csv file that has a column 'labels' where the values are space-delimited. You can inspect the function by calling is_data_multilabel??.

This function assumes that your multi-label dataset is structured in the recommended format shown in the multilabel notebook.

In [6]:

multilabel = is_data_multilabel(DATA_PATH)
metric = accuracy if not multilabel else hamming_accuracy

Pre-processing ¶

JPEG decoding represents a performance bottleneck on systems with powerful GPUs which can slow down training significantly. We recommend creating a down-sized copy of the dataset if training takes too long, or if you require multiple training runs to evaluate different parameters.

The following function will automate image downsizing.

from utils_cv.classification.data import downsize_imagelist

downsize_imagelist(
    im_list = ImageList.from_folder(Path(DATA_PATH)),
    out_dir = "downsized_images", 
    max_dim = IM_SIZE
)

Once complete, update the DATA_PATH variable to point to out_dir so that this notebook uses these resized images.

Training ¶

We'll now re-apply the same steps we did in the 01_training_introduction notebook here.

Load the data:

In [7]:

label_list = (
    (
        ImageList.from_folder(Path(DATA_PATH))
        .split_by_rand_pct(valid_pct=0.2, seed=10)
        .label_from_folder()
    )
    if not multilabel
    else (
        ImageList.from_csv(Path(DATA_PATH), "labels.csv", folder="images")
        .split_by_rand_pct(valid_pct=0.2, seed=10)
        .label_from_df(label_delim=" ")
    )
)

In [8]:

data = (
    label_list.transform(tfms=get_transforms(), size=IM_SIZE)
    .databunch(bs=BATCH_SIZE, num_workers = db_num_workers())
    .normalize(imagenet_stats)
)

Create the learner.

In [9]:

learn = cnn_learner(data, ARCHITECTURE, metrics=metric, 
                    callback_fns=[partial(TrainMetricsRecorder, show_graph=True)])

Train the last layer for a few epochs, this can use a larger rate since most of the DNN is fixed.

In [10]:

learn.fit_one_cycle(EPOCHS_HEAD, 10 * LEARNING_RATE)

epoch	train_loss	valid_loss	train_accuracy	valid_accuracy	time
0	2.419261	1.657503	0.250000	0.307692	00:14
1	1.778618	0.761048	0.583333	0.653846	00:09
2	1.316656	0.523479	0.781250	0.769231	00:11
3	1.052359	0.491936	0.833333	0.807692	00:09

Unfreeze the layers.

In [11]:

learn.unfreeze()

Fine-tune the network for the remaining epochs.

In [12]:

learn.fit_one_cycle(EPOCHS_BODY, LEARNING_RATE)

epoch	train_loss	valid_loss	train_accuracy	valid_accuracy	time
0	0.367272	0.446086	0.843750	0.807692	00:11
1	0.401298	0.381388	0.854167	0.807692	00:11
2	0.326690	0.309591	0.906250	0.846154	00:11
3	0.280659	0.304625	0.906250	0.884615	00:11
4	0.268848	0.249857	0.927083	0.846154	00:11
5	0.229108	0.192554	0.968750	0.923077	00:11
6	0.208633	0.224482	0.979167	0.923077	00:10
7	0.191411	0.206568	0.968750	0.923077	00:10
8	0.169821	0.233692	0.979167	0.923077	00:09
9	0.157189	0.247892	0.989583	0.884615	00:05
10	0.146495	0.253730	0.979167	0.923077	00:04
11	0.131543	0.267920	0.989583	0.884615	00:04

Evaluation ¶

In 01_training introduction, we demonstrated evaluating a CV model using the performance metrics for precision, recall and ROC. In this section, we will evaluate our model using the following characteristics:

accuracy (performance)
inference speed
parameter export size / memory footprint required

Performance¶

To keep things simple, we just look at the final evaluation metric on the validation set.

In [13]:

_, validation_accuracy = learn.validate(learn.data.valid_dl, metrics=[metric])
print(f"{metric.__name__} on validation set: {float(validation_accuracy):2.2f}")

accuracy on validation set: 0.88

Inference speed¶

Time model inference speed.

In [14]:

im_folder = learn.data.classes[0] if not multilabel else 'images'
im = open_image(f"{(Path(DATA_PATH)/im_folder).ls()[0]}")

In [15]:

%%timeit
learn.predict(im)

23.1 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Memory footprint¶

Export the model to inspect the size of the model file.

In [16]:

learn.export(f"{MODEL_TYPE}")

In [17]:

size_in_mb = os.path.getsize(Path(DATA_PATH)/MODEL_TYPE) / (1024*1024.)
print(f"'{MODEL_TYPE}' is {round(size_in_mb, 2)}MB.")

'fast_inference' is 44.77MB.

In [18]:

# Preserve some of the notebook outputs
training_accuracies = [x[0].numpy().ravel()[0] for x in learn.recorder.metrics]
sb.glue("training_accuracies", training_accuracies)
sb.glue("validation_accuracy", float(validation_accuracy))

Fine-tuning parameters ¶

If you use the default parameters we have provided, you can get good results across a wide variety of datasets. However, as in most machine learning projects, getting the best possible results for a new dataset often requires tuning the parameters further. The following section provides guidelines on optimizing for accuracy, inference speed, or model size for a given dataset. We'll go through the parameters that will make the largest impact on your model as well as the parameters that may not be worth modifying.

Generally speaking, models for image classification come with a trade-off between training time versus model accuracy. The four parameters that have the biggest impact on this trade-off are the DNN architecture, image resolution, learning rate, and number of epochs. DNN architecture and image resolution will additionally affect the model's inference time and memory footprint. As a rule of thumb, deeper networks with high image resolution will achieve higher accuracy at the cost of large model sizes and low training and inference speeds. Shallow networks with low image resolution will result in models with fast inference speed, fast training speeds and low model sizes at the cost of the model accuracy.

DNN architectures ¶

When choosing an architecture, we want to make sure it fits our requirements for accuracy, memory footprint, inference speed and training speeds. Some DNNs have hundreds of layers and end up with a large memory footprint and millions of parameters to tune, while others are compact and small enough to fit onto memory limited edge devices.

Lets take a squeezenet1_1 model, a resnet18 model and resnet50 model and compare these using an experiment over diverse set of 6 datasets. (More about the datasets in the appendix below.)

architecture_comparisons

As you can see from the graph, there is a clear trade-off when deciding between the models.

In terms of accuracy, resnet50 outperforms the rest, but it also suffers from having the highest memory footprint, and the longest training and inference times. Alternatively, squeezenet1_1 performs the worst in terms of accuracy, but has the smallest memory footprint.

Generally speaking, given enough data, the deeper DNN and the higher the image resolution, the higher the accuracy you'll be able to achieve with your model.

See the code to generate the graphs

Code snippet to generate graphs in this cell¶

import pandas as pd
from utils_cv.classification.parameter_sweeper import add_value_labels

%matplotlib inline

df = pd.DataFrame(
    {
        "accuracy": [0.9472, 0.9190, 0.8251],
        "training_duration": [385.3, 280.5, 272.5],
        "inference_duration": [34.2, 27.8, 27.6],
        "memory": [99, 45, 4.9],
        "model": ["resnet50", "resnet18", "squeezenet1_1"],
    }
).set_index("model")

ax1, ax2, ax3, ax4 = df.plot.bar(
    rot=90, subplots=True, legend=False, figsize=(8, 10)
)

for ax in [ax1, ax2, ax3, ax4]:
    for i in [0, 1, 2]:
        if i == 0:
            ax.get_children()[i].set_color("r")
        if i == 1:
            ax.get_children()[i].set_color("g")
        if i == 2:
            ax.get_children()[i].set_color("b")

ax1.set_title("Accuracy (%)")
ax2.set_title("Training Duration (seconds)")
ax3.set_title("Inference Time (seconds)")
ax4.set_title("Memory Footprint (mb)")

ax1.set_ylabel("%")
ax2.set_ylabel("seconds")
ax3.set_ylabel("seconds")
ax4.set_ylabel("mb")

ax1.set_ylim(top=df["accuracy"].max() * 1.3)
ax2.set_ylim(top=df["training_duration"].max() * 1.3)
ax3.set_ylim(top=df["inference_duration"].max() * 1.3)
ax4.set_ylim(top=df["memory"].max() * 1.3)

add_value_labels(ax1, percentage=True)
add_value_labels(ax2)
add_value_labels(ax3)
add_value_labels(ax4)

Key parameters ¶

This section examines some of the key parameters when training a deep learning model for image classification. The table below shows default parameters we recommend using.

Key Parameters ¶

This section examines some of the key parameters used in training a deep learning model for image classification. The table below shows default parameters:

Parameter	Default Value
Learning Rate	1e-4
Epochs	15
Batch Size	16
Image Size	300 X 300

Learning rate

Learning rate or the step size is used when optimizing your model with gradient descent and tends to be one of the most important parameters to set when training your model. If your learning rate is set too low, training will progress very slowly since we're only making tiny updates to the weights in your network. However, if your learning rate is too high, it can cause undesirable divergent behavior in your loss function. Generally speaking, choosing a learning rate of 1e-4 was shown to work pretty well for most datasets. If you want to reduce training time (by training for fewer epochs), you can try setting the learning rate to 5e-3, but if you notice a spike in the training or validation loss, you may want to try reducing your learning rate.

The learning rate section of appendix below has more detail.

Epochs

An epoch is a full gradient descent iteration cycle across the DNN architecture. Unless your are working with small datasets, using around 15 epochs tends to work well in most cases. When it comes to choosing the number of epochs, a common question is - Won't too many epochs cause overfitting? It turns out that the accuracy on the test set typically does not get worse, even if training for too many epochs. Unless your are working with small datasets, using around 15 epochs tends to work pretty well in most cases.

Batch Size

Batch size is the number of training samples you use in order to make one update to the model parameters. A batch size of 16 or 32 works well for most cases. Larger batch sizes help speed training time, but at the expense of an increased DNN memory consumption. Depending on your dataset and the GPU you have, you can start with a batch size of 32, and move down to 16 if your GPU doesn't have enough memory. After a certain batch size, improvements to training speed become marginal, hence we found 16 (or 32) to be a good trade-off between training speed and memory consumption. If you reduce the batch size, you may also have to reduce the learning rate.

Image size

The default image size is 300 X 300 pixels. Using higher image resolutions can help improve model accuracy but will result in longer training and inference times.

The appendix below discussed impact of image resolution in detail.

Additional parameters ¶

There are many hyperparameters used to tune DNNs, though in our experience the exact value of these parameters does not have a large impact on model performance, training/inference speed, or memory footprint.

Parameter	Good Default Value
Dropout	0.5 or (0.5 on the final layer and 0.25 on all previous layers)
Weight Decay	0.01
Momentum	0.9 or (min=0.85 and max=0.95 when using cyclical momentum)

Dropout

Dropout is used to discard activations at random when training your model. It is a way to keep the model from over-fitting on the training data. In fast.ai, dropout is set to 0.5 by default on the final layer, and 0.25 on all other layer. Unless there is clear evidence of over-fitting, this dropout tends to work well.

Weight decay (L2 regularization)

Weight decay is a regularization term applied to help minimize the network loss function. We can think of it as a penalty applied to the weights after an update to prevent the weights from growing too large (the model may not converge if the weights get too large). In fast.ai, the default weight decay is 0.1, which we find to be almost always acceptable.

Momentum

Momentum is a way to accelerate convergence when training a model. Momentum uses a weighted average of the most recent updates applied to the current update. Fast.ai implements cyclical momentum when calling fit_one_cycle(), so the momentum will fluctuate over the course of the training cycle. We control this by setting a min and max value for the momentum.

When using fit_one_cycle(), the default values of max=0.95 and min=0.85 are known to work well. If using fit(), the default value of 0.9 has been shown to work well. These defaults represent a good trade-off between training speed and the ability of the model to converge to a good solution.

Testing parameters ¶

The ParameterSweeper module can be used to search over the parameter space to locate the "best" value for that parameter. See the exploring hyperparameters notebook for more information.

Appendix ¶

Learning rate ¶

Setting a low learning rate requires training for many epochs to reach convergence. However, each additional epoch directly increases the model training time in a linear fashion. To efficiently build a model, it helps to set the learning rate in the correct range. To demonstrate this, we've tested various learning rates on 6 different datasets, training the full network for 3 or 15 epochs.

lr_comparisons

Understanding the diagram

The figure on the left shows results of different learning rates on different datasets at 15 epochs. We see that a learning rate of 1e-4 results in the the best overall accuracy for the datasets we have tested. Notice there is a pretty significant variance between the datasets and a learning rate of 1-e3 may work better for some datasets.

In the figure on the right, at 15 epochs, the results of 1e-4 are only slightly better than that of 1e-3. However, at only 3 epochs, a learning rate of 1e-3 out performs the smaller learning rates. This makes sense since we're limiting the training to only 3 epochs so a model that updates weights more quickly should perform better. Effectively a larger learning rate gets closer to the model convergence. This result indicates higher learning rates (such as 1e-3) may help minimize the training time, and lower learning rates (such as 1e-5) may be better if training time is not constrained.

In both figures, we can see that a learning rate of 1e-3 and 1e-4 tends to workin general. We observe that training with 3 epochs results in lower accuracy compared to 15 epochs. And in some cases, smaller learning rates may prevent the DNN from converging.

Fast.ai has implemented one cycle policy with cyclical momentum which adaptively optimizes the learning rate. This function takes a maximum learning rate value as an argument to help the method avoid the convergence problem. Replace the fit() method with fit_one_cycle() to use this capability.

See the code to generate the graphs

Code snippet to generate graphs in this cell¶

import matplotlib.pyplot as plt
%matplotlib inline

df_dataset_comp = pd.DataFrame(
    {
        "fashionTexture": [0.8749, 0.8481, 0.2491, 0.670318, 0.1643],
        "flickrLogos32Subset": [0.9069, 0.9064, 0.2179, 0.7175, 0.1073],
        "food101Subset": [0.9294, 0.9127, 0.6891, 0.9090, 0.555827],
        "fridgeObjects": [0.9591, 0.9727, 0.272727, 0.6136, 0.181818],
        "lettuce": [0.8992, 0.9104, 0.632, 0.8192, 0.5120],
        "recycle_v3": [0.9527, 0.9581, 0.766, 0.8591, 0.2876],
        "learning_rate": [0.000100, 0.001000, 0.010000, 0.000010, 0.000001],
    }
).set_index("learning_rate")

df_epoch_comp = pd.DataFrame(
    {
        "3_epochs": [0.823808, 0.846394, 0.393808, 0.455115, 0.229120],
        "15_epochs": [0.920367, 0.918067, 0.471138, 0.764786, 0.301474],
        "learning_rate": [0.000100, 0.001000, 0.010000, 0.000010, 0.000001],
    }
).set_index("learning_rate")

plt.figure(1)
ax1 = plt.subplot(121)
ax2 = plt.subplot(122)

vals = ax2.get_yticks()

df_dataset_comp.sort_index().plot(kind="bar", rot=0, figsize=(15, 6), ax=ax1)
vals = ax1.get_yticks()
ax1.set_yticklabels(["{:,.2%}".format(x) for x in vals])
ax1.set_ylim(0, 1)
ax1.set_ylabel("Accuracy (%)")
ax1.set_title("Accuracy of Learning Rates by Datasets @ 15 Epochs")
ax1.legend(loc=2)

df_epoch_comp.sort_index().plot(kind="bar", rot=0, figsize=(15, 6), ax=ax2)
ax2.set_yticklabels(["{:,.2%}".format(x) for x in vals])
ax2.set_ylim(0, 1)
ax2.set_title("Accuracy of Learning Rates by Epochs")
ax2.legend(loc=2)

Image resolution ¶

A model's input image resolution also impacts model accuracy. Usually, convolutional neural networks are able to take advantage of higher resolution images, especially if the object-of-interest is small in the overall image. But how does image size impact other model aspects?

We find that image size doesn't significantly affect the model's memory footprint given the same network architecture, but it has a huge effect on GPU memory. Image size also impacts training and inference speeds.

imsize_comparisons

From the results, we can see that an increase in image resolution from 300 X 300 to 500 X 500 will increase the performance marginally at the cost of a longer training duration and slower inference speed.

See the code to generate the graphs

Code snippet to generate graphs in this cell¶

import pandas as pd
from utils_cv.classification.parameter_sweeper import add_value_labels
%matplotlib inline

df = pd.DataFrame(
    {
        "accuracy": [0.9472, 0.9394, 0.9190, 0.9164, 0.8366, 0.8251],
        "training_duration": [385.3, 218.8, 280.5, 184.9, 272.5, 182.3],
        "inference_duration": [34.2, 23.2, 27.8, 17.8, 27.6, 17.3],
        "model": [
            "resnet50 X 499",
            "resnet50 X 299",
            "resnet18 X 499",
            "resnet18 X 299",
            "squeezenet1_1 X 499",
            "squeezenet1_1 X 299",
        ],
    }
).set_index("model")
df

ax1, ax2, ax3 = df.plot.bar(
    rot=90, subplots=True, legend=False, figsize=(12, 12)
)

for i in range(len(df)):
    if i < len(df) / 3:
        ax1.get_children()[i].set_color("r")
        ax2.get_children()[i].set_color("r")
        ax3.get_children()[i].set_color("r")
    if i >= len(df) / 3 and i < 2 * len(df) / 3:
        ax1.get_children()[i].set_color("g")
        ax2.get_children()[i].set_color("g")
        ax3.get_children()[i].set_color("g")
    if i >= 2 * len(df) / 3:
        ax1.get_children()[i].set_color("b")
        ax2.get_children()[i].set_color("b")
        ax3.get_children()[i].set_color("b")

ax1.set_title("Accuracy (%)")
ax2.set_title("Training Duration (seconds)")
ax3.set_title("Inference Speed (seconds)")

ax1.set_ylabel("%")
ax2.set_ylabel("seconds")
ax3.set_ylabel("seconds")

ax1.set_ylim(top=df["accuracy"].max() * 1.2)
ax2.set_ylim(top=df["training_duration"].max() * 1.2)
ax3.set_ylim(top=df["inference_duration"].max() * 1.2)

add_value_labels(ax1, percentage=True)
add_value_labels(ax2)
add_value_labels(ax3)

How we found good default parameters ¶

We conducted various experiments to explore the impact of different hyperparameters on a model's accuracy, training duration, inference speed, and memory footprint.

Datasets ¶

For our experiments, we relied on a set of six different classification datasets. When selecting these datasets, we wanted to have a variety of image types with different amounts of data and number of classes.

Dataset Name	Number of Images	Number of Classes
food101Subset	5000	5
flickrLogos32Subset	2740	33
fashionTexture	1716	11
recycle_v3	564	11
lettuce	380	2
fridgeObjects	134	4

Model Characteristics ¶

In our experiment, we look at these characteristics to evaluate the impact of various parameters. Here is how we calculated each of the following metrics:

Accuracy metric is averaged over 5 runs for each dataset.
Training Duration metric is the average duration over 5 runs for each dataset.
Inference Speed is the time it takes the model to run 1000 predictions.
Memory Footprint is the size of the model pickle file output from the learn.export(...) method.

In [ ]: