Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
Image similarity methods can be used to build Image Retrieval systems where, given a query image, the goal is to find all similar images in a reference set. These systems can be used e.g. on a shopping website to suggest related products.
In this tutorial we build an image retrieval system based on leveraging DNNs trained for image classification. Representing images as the output of a DNN is a powerful approach and shown to give good results on a wide variety of tasks. Given a query image, we find the most similar images in the reference set by computing the pairwise distances as illustrated below, and by returning the images with the lowest distance to the query image.
The distance between two images is computed by:
This notebook starts by loading a dataset and splitting it into a training and a validation set. The training set is used to refine an ImageNet pre-trained ResNet-18 DNN, which is then used to compute the DNN features for each image. The validation set is used in an image retrieval example where, given a query image, the top similar images are displayed. This is followed by a quantitative evaluation of the proposed image similarity system.
# Ensure edits to libraries are loaded and plotting is shown in the notebook.
%matplotlib inline
%reload_ext autoreload
%autoreload 2
# Regular python libraries
import sys
import numpy as np
from pathlib import Path
import random
import scrapbook as sb
# fast.ai
import fastai
from fastai.vision import (
accuracy,
cnn_learner,
DatasetType,
ImageList,
imagenet_stats,
models,
partial,
)
# Computer Vision repository
sys.path.extend([".", "../.."]) # to access the utils_cv library
from utils_cv.classification.data import Urls
from utils_cv.classification.model import TrainMetricsRecorder
from utils_cv.common.data import unzip_url
from utils_cv.common.gpu import which_processor, db_num_workers
from utils_cv.similarity.data import comparative_set_builder
from utils_cv.similarity.metrics import (
compute_distances,
positive_image_ranks,
recall_at_k,
)
from utils_cv.similarity.model import compute_features, compute_features_learner
from utils_cv.similarity.plot import (
plot_comparative_set,
plot_distances,
plot_ranks_distribution,
plot_recalls,
)
print(f"Fast.ai version = {fastai.__version__}")
which_processor()
Fast.ai version = 1.0.57 Torch is using GPU: Tesla V100-PCIE-16GB
We start with parameter specifications and data preparation. We use the Fridge objects dataset, which is composed of 134 images, divided into 4 classes: can, carton, milk bottle and water bottle. To train your own image retrieval systems, simply change the DATA_PATH
variable below to point to a different (single-label) dataset.
# Set dataset, model and evaluation parameters
DATA_PATH = unzip_url(Urls.fridge_objects_path, exist_ok=True)
# DNN configuration and learning parameters
EPOCHS_HEAD = 4
EPOCHS_BODY = 12
LEARNING_RATE = 10* 1e-4
BATCH_SIZE = 16
ARCHITECTURE = models.resnet18
IM_SIZE = 300
We can now build our training data object, and split it to get a certain percentage (here 20%) assigned to a validation set.
# Load images into fast.ai's ImageDataBunch object
random.seed(642)
data = (
ImageList.from_folder(DATA_PATH)
.split_by_rand_pct(valid_pct=0.2, seed=20)
.label_from_folder()
.transform(size=IM_SIZE)
.databunch(bs=BATCH_SIZE, num_workers = db_num_workers())
.normalize(imagenet_stats)
)
print(f"""\
Training set: {len(data.train_ds.x)} images
Validation set: {len(data.valid_ds.x)} images\
"""
)
# Display example images
data.show_batch(rows=3, figsize=(6, 6))
Training set: 108 images Validation set: 26 images
We begin by retrieving a ResNet18 CNN from fast.ai's library which is pre-trained on ImageNet, and fine-tune the model on our training set. We use the same training parameters and take the same approach as what we did in our classification notebooks, training first the (new) last layer only, and then the full DNN.
Note how we train the DNN here on an image classification task but will use it as featurizer later for image similarity.
learn = cnn_learner(
data,
ARCHITECTURE,
metrics=[accuracy],
callback_fns=[partial(TrainMetricsRecorder, show_graph=True)],
ps=0 #Leave dropout at zero. Higher values tend to perform significantly worse
)
# Train the last layer using a larger rate since most of the DNN is fixed.
learn.fit_one_cycle(EPOCHS_HEAD, 10* LEARNING_RATE)
epoch | train_loss | valid_loss | train_accuracy | valid_accuracy | time |
---|---|---|---|---|---|
0 | 1.037022 | 1.059019 | 0.645833 | 0.653846 | 00:05 |
1 | 0.652866 | 0.458422 | 0.937500 | 0.846154 | 00:03 |
2 | 0.510966 | 0.241510 | 0.937500 | 0.846154 | 00:03 |
3 | 0.418584 | 0.090530 | 0.947917 | 0.961538 | 00:03 |
Let's now unfreeze all the layers and fine-tuning the model more
learn.unfreeze()
learn.fit_one_cycle(EPOCHS_BODY, LEARNING_RATE)
epoch | train_loss | valid_loss | train_accuracy | valid_accuracy | time |
---|---|---|---|---|---|
0 | 0.100561 | 0.821248 | 0.968750 | 0.884615 | 00:03 |
1 | 0.136889 | 5.214930 | 0.958333 | 0.653846 | 00:03 |
2 | 0.384597 | 3.188640 | 0.875000 | 0.653846 | 00:03 |
3 | 0.627707 | 14.539825 | 0.781250 | 0.576923 | 00:03 |
4 | 0.689022 | 11.872132 | 0.833333 | 0.769231 | 00:03 |
5 | 0.655718 | 4.452466 | 0.906250 | 0.500000 | 00:03 |
6 | 0.592736 | 9.349227 | 0.906250 | 0.576923 | 00:03 |
7 | 0.543367 | 5.243893 | 0.895833 | 0.730769 | 00:03 |
8 | 0.485164 | 0.528972 | 0.958333 | 0.923077 | 00:03 |
9 | 0.412898 | 0.050779 | 1.000000 | 0.961538 | 00:04 |
10 | 0.362047 | 0.058257 | 0.968750 | 0.961538 | 00:03 |
11 | 0.350966 | 0.031828 | 0.968750 | 1.000000 | 00:03 |
Before computing the feature representation for each image, let's look at its architecture and in particular the last layers. Fast.ai's ResNet-18 model is composed of a different set of final layers (here: (1): Sequential
). As discussed at the start of this notebook, we use the output of the penultimate layer (here: (6): BatchNorm1d
) as our image representation.
learn.model
Sequential( (0): Sequential( (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) (4): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (1): BasicBlock( (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (5): Sequential( (0): BasicBlock( (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (6): Sequential( (0): BasicBlock( (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (7): Sequential( (0): BasicBlock( (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (downsample): Sequential( (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) (1): BasicBlock( (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (relu): ReLU(inplace=True) (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) ) (1): Sequential( (0): AdaptiveConcatPool2d( (ap): AdaptiveAvgPool2d(output_size=1) (mp): AdaptiveMaxPool2d(output_size=1) ) (1): Flatten() (2): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (3): Linear(in_features=1024, out_features=512, bias=True) (4): ReLU(inplace=True) (5): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (6): Linear(in_features=512, out_features=4, bias=True) ) )
The following line will allow us to extract the penultimate layer (ie 512 floating points vector) after running an image through the model.
# Use penultimate layer as image representation
embedding_layer = learn.model[1][-2]
print(embedding_layer)
BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
#Compute DNN features for all validation images
valid_features = compute_features_learner(data, DatasetType.Valid, learn, embedding_layer)
The cell below shows how to find and display the most similar images in the validation set for a given query image (which we also select from the validation set). This example is similar to the one shown in the 00_webcam.ipynb notebook.
We use the L2 distance which is defined as $ \sqrt{\sum_{i=1}^{n}{(F_{q}[i] - F_{r}[i])^{2}}} $ where $F_{q}$ and $F_{r}$ are the features of a query image and a reference image respectively, and $n=512$ is their dimensionality. By default, we normalize the feature vectors $F_{q}$ and $F_{r}$ to be unit-length, i.e. have a magnitude ||$F$|| of 1, before computing the L2 distance. One could also use other distances measures, such as L1 or cosine similarity, however L2 with unit-length normalized feature vectors seems to work well in practice.
# Get the DNN feature for the query image
query_im_path = str(data.valid_ds.items[1])
query_feature = valid_features[query_im_path]
print(f"Query image path: {query_im_path}")
print(f"Query feature dimension: {len(query_feature)}")
assert len(query_feature) == 512
# Compute the distances between the query and all reference images
distances = compute_distances(query_feature, valid_features)
plot_distances(distances, num_rows=1, num_cols=7, figsize=(15,5))
Query image path: C:\Users\pabuehle\Desktop\ComputerVision\data\fridgeObjects\carton\47.jpg Query feature dimension: 512
To measure accuracy of our image retrieval system, we create so called comparative sets
from the validation images. Each comparative set consists of a query
image, a positive
image (with the same label as the query image), and 99 negative
images (different label). When sorting the 101 reference images according to their distance to the query image, a perfect image similarity system would place the positive image at the top before all negative images i.e. at rank 1.
In the cell below, we construct 1000 comparative sets from the validation set, each with 99 negative images (and one positive image).
# Build multiple sets of comparative images from the validation images
comparative_sets = comparative_set_builder(data.valid_ds, num_sets = 1000, num_negatives = 99)
print(f"Generated {len(comparative_sets)} comparative image sets.")
Generated 1000 comparative image sets.
# Plot the query image, the positive image, and some of the negative images of the first comparative set
plot_comparative_set(comparative_sets[0], 7, figsize=(15,5))
# For each comparative set compute the distances between the query image and all reference images
for cs in comparative_sets:
cs.compute_distances(valid_features)
To measure accuracy of our image retrieval system, we compute these two statistics:
# Compute the median rank of the positive example over all comparative sets
ranks = positive_image_ranks(comparative_sets)
median_rank = np.median(ranks)
random_rank = np.median([(len(cs.neg_im_paths)+1)/2.0 for cs in comparative_sets])
print(f"The positive example ranks {median_rank}, as a median, \
across our {len(ranks)} comparative sets. Random chance rank is {random_rank}")
The positive example ranks 1.0, as a median, across our 1000 comparative sets. Random chance rank is 50.0
# Compute recall at k=1, 5, and 10
print(f"""The positive image is:
--> {recall_at_k(ranks, 1)}% of the time the most similar to the query
--> {recall_at_k(ranks, 5)}% of the time in the top 5 images
--> {recall_at_k(ranks, 10)}% of the time in the top 10 images""")
# Plot recall versus k
plot_recalls(ranks)
The positive image is: --> 79.7% of the time the most similar to the query --> 83.8% of the time in the top 5 images --> 89.0% of the time in the top 10 images
# Display the distribution of positive ranks among the comparative sets
plot_ranks_distribution(ranks)
# Write trained model to disk
learn.export("image_similarity_01_model")
print(f"Exported model to directory {learn.path}")
Exported model to directory C:\Users\pabuehle\Desktop\ComputerVision\data\fridgeObjects
Using the provided default parameters, one can get good results across a wide variety of datasets. However, as in most machine learning projects, getting the best possible results for a new dataset often requires tuning the parameters further.
See the image classification 03_training_accuracy_vs_speed.ipynb notebook for guidelines on optimizing for accuracy, inference speed, or model size for a given dataset. In addition, the notebook also goes through the parameters that will make the largest impact on your model as well as the parameters that may not be worth modifying.
The notebook 11_exploring_hyperparameters.ipynb in this directory is provided to run run sweeps to find the parameters with best possible image retrieval (ie rank) performance. Below is an example where, to identify good default parameters for this repository, different learning rates where tried on diverse datasets. Note that lower ranks is better, and learning rates between $1e-4$ and $1e-3$ performed best.
# Log some outputs using scrapbook which are used during testing to verify correct notebook execution
sb.glue("median_rank", median_rank)
sb.glue("random_rank", random_rank)