Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
In this notebook, we give an introduction to training an object detection model using torchvision. Using a small dataset, we demonstrate how to train and evaluate a FasterRCNN object detection model. We also cover one of the most common ways to store data on a file system for this type of problem.
To learn more about how object detection work, visit our FAQ.
Import all the functions we need.
import sys
sys.path.append("../../")
import os
import time
import matplotlib.pyplot as plt
from typing import Iterator
from pathlib import Path
from PIL import Image
from random import randrange
from typing import Tuple
import torch
import torchvision
from torchvision import transforms
import scrapbook as sb
from utils_cv.classification.data import Urls as UrlsIC
from utils_cv.common.data import unzip_url, data_path
from utils_cv.detection.data import Urls
from utils_cv.detection.dataset import DetectionDataset, get_transform
from utils_cv.detection.plot import (
plot_grid,
plot_boxes,
plot_pr_curves,
PlotSettings,
plot_counts_curves,
plot_detections
)
from utils_cv.detection.model import (
DetectionLearner,
get_pretrained_fasterrcnn,
)
from utils_cv.common.gpu import which_processor, is_windows
# Change matplotlib backend so that plots are shown for windows
if is_windows():
plt.switch_backend("TkAgg")
print(f"TorchVision: {torchvision.__version__}")
which_processor()
TorchVision: 0.4.0 ('cudart64_100', 0) Torch is using GPU: Tesla V100-PCIE-16GB
This shows your machine's GPUs (if it has any) and the computing device torch/torchvision
is using.
# Ensure edits to libraries are loaded and plotting is shown in the notebook.
%reload_ext autoreload
%autoreload 2
%matplotlib inline
Next, set some model runtime parameters. We use the unzip_url
helper function to download and unzip the data used in this example notebook.
DATA_PATH = unzip_url(Urls.fridge_objects_path, exist_ok=True)
NEG_DATA_PATH = unzip_url(UrlsIC.fridge_objects_negatives_path, exist_ok=True)
EPOCHS = 10
LEARNING_RATE = 0.005
IM_SIZE = 500
SAVE_MODEL = True
# train on the GPU or on the CPU, if a GPU is not available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"Using torch device: {device}")
Using torch device: cuda
In this notebook, we use a toy dataset called Fridge Objects, which consists of 134 images of 4 classes of beverage container {can, carton, milk bottle, water bottle}
photos taken on different backgrounds. The helper function downloads and unzips data set to the ComputerVision/data
directory.
Set that directory in the path
variable for ease of use throughout the notebook.
path = Path(DATA_PATH)
os.listdir(path)
['.DS_Store', 'annotations', 'images', 'models']
You'll notice that we have two different folders inside:
/images
/annotations
This format for object detection is fairly common.
/data
+-- images
| +-- image1.jpg
| +-- image2.jpg
| +-- ...
+-- annotations
| +-- image1.xml
| +-- image2.xml
| +-- ...
+-- ...
Each image corresponds to an xml file. Each xml file contains information on where its corresponding image file is located. It also contains information about the bounding boxes and the object labels. In this example, our fridge object dataset is annotated in Pascal VOC format. This is one of the most common formats for labelling object detection datasets.
<!-- Example Pascal VOC annotation -->
<annotation>
<folder>images</folder>
<filename>1.jpg</filename>
<path>../images/1.jpg</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>499</width>
<height>666</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>carton</name>
<pose>Unspecified</pose>
<truncated>0</truncated>
<difficult>0</difficult>
<bndbox>
<xmin>100</xmin>
<ymin>173</ymin>
<xmax>233</xmax>
<ymax>521</ymax>
</bndbox>
</object>
</annotation>
You'll notice that inside the annotation xml file, we can see which image the file references <path>
, the number of <objects>
in the image, that the image is of (<name>
) and the bounding box of that object (<bndbox>
).
A second popular annotation standard uses the COCO format introduced by the COCO challenge. Annotations in this format need to be converted first to Pascal VOC syntax in order to run this notebook. The function coco2voc
does exactly that. See the notebook 04_coco_accuracy_vs_speed.ipynb for a code example.
To load the data, we need to create a Dataset object class that Torchvision knows how to use. In short, we'll need to create a class and implement the __getitem__
method. More information here: https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#defining-the-dataset
To make it more convinient, we've created a DetectionDataset
class that knows how to extract annotation information from the Pascal VOC format and meet the requirements of the Torchvision dataset object class.
data = DetectionDataset(DATA_PATH, train_pct=0.75)
Inspect our Datasets/DataLoaders to make sure the train/test split looks right.
print(
f"Training dataset: {len(data.train_ds)} | Training DataLoader: {data.train_dl} \
\nTesting dataset: {len(data.test_ds)} | Testing DataLoader: {data.test_dl}"
)
Training dataset: 96 | Training DataLoader: <torch.utils.data.dataloader.DataLoader object at 0x0000017BEB6AF9E8> Testing dataset: 32 | Testing DataLoader: <torch.utils.data.dataloader.DataLoader object at 0x0000017BEB832C18>
And some more detailed information, ie. the counts of ground truth boxes per class, as well as the distribution of absolute and relative widths/heights of the annotations.
data.plot_boxes_stats()
By default, we apply some image transformations to the training set. Lets see how the transformations look:
data.train_ds.dataset.show_im_transformations()
Transformations applied on C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\49.jpg: <utils_cv.detection.dataset.ColorJitterTransform object at 0x0000017BEB3F6D30> <utils_cv.detection.references.transforms.ToTensor object at 0x0000017BEB3F6DD8> <utils_cv.detection.dataset.RandomHorizontalFlip object at 0x0000017BEB400C50>
Lets visualize the annotations to make sure they look correct.
data.show_ims(rows=2)
For the DetectionLearner, we use Faster R-CNN as the default model, and Stochastic Gradient Descent as our default optimizer.
Our Faster R-CNN model is pretrained on COCO, a large-scale object detection, segmentation, and captioning dataset that contains over 200K labeled images with over 80 label cateogories.
When we initialize the DetectionLearner, we can pass in the dataset. By default, the object will set the model to torchvision's Faster R-CNN. Alternatively, we can pass in the model we want to use - we would customize the model using the get_pretrained_fasterrcnn
function and specify the parameters we want.
detector = DetectionLearner(data, im_size=IM_SIZE)
print(f"Model: {type(detector.model)}")
Model: <class 'torchvision.models.detection.faster_rcnn.FasterRCNN'>
Computation of model accuracy requires a GPU; training does not, although without GPU training is too slow for any practical purpose. To also support CPU-only machines, we skip computation steps which would otherwise throw an error.
skip_evaluation = device.type == 'cpu'
Fine-tune the model using our training data loader (train_dl
) and evaluate our results using our testing data loader (test_dl
). Here we pass in the learning rate. If not set, the learning rate will automatically be set at 0.005.
detector.fit(EPOCHS, lr=LEARNING_RATE, print_freq=30, skip_evaluation=skip_evaluation)
Visualize the loss and average precision (ap) over time.
if not skip_evaluation:
detector.plot_precision_loss_curves()
We can simply run the evaluate()
method on our detector to evaluate the results.
Below, you'll notice the notation IoU=0.5:0.95 in the results above. This means: the average mAP over different IoU thresholds, from 0.5 to 0.95, at step size of 0.05 (0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95). For more information on IoU and mAP, visit our FAQ.
if not skip_evaluation:
e = detector.evaluate()
creating index... index created! Test: [ 0/16] eta: 0:00:00 model_time: 0.0312 (0.0312) evaluator_time: 0.0000 (0.0000) time: 0.0625 data: 0.0313 max mem: 959 Test: [15/16] eta: 0:00:00 model_time: 0.0312 (0.0332) evaluator_time: 0.0000 (0.0039) time: 0.0635 data: 0.0255 max mem: 959 Test: Total time: 0:00:01 (0.0645 s / it) Averaged stats: model_time: 0.0312 (0.0332) evaluator_time: 0.0000 (0.0039) Accumulating evaluation results... DONE (t=0.02s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.894 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 1.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.894 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.913 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.913 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.913 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.913
Plot precision-recall curves. We'll use the plot_pr_curves
function to plot the PR of each IoU threshold, and also the average over the IoU thresholds. If we look at the graph on the left, we can see that the lower the IOU threshold, the higher better the results are for the PR curve. This makes sense as the model is more lenient. This is why it is also a good idea to take a look at the mean over the IOU thresholds.
if not skip_evaluation:
plot_pr_curves(e)
Precision-Recall curves can be hard to interpret, and hence we also provide counts how many (left) images have at least one wrong detection / missed ground truth; and (right) the total number of wrong detections / missed ground truths summed over all images. These counts are shown for different detection score thresholds, ie. effectly discarding detections with confidence score lower than a given thresholds.
As can be seen, the model performs well on the test dataset for thresholds between 0.5 and 0.9, but especially at a threshold of 0.5 over-fires on the negative images (images which do not contain any of the objects) since similar images were not included in the training set. This information is especially valuable when deciding on a precision-recall operating point. To avoid over-firing, one can either add more images to the training set directly or via hard-negative mining as explained in the 12_hard_negative_sampling.ipynb notebook.
# Run detector on the test images. Set threshold to 0 to keep all low confidence detections.
print(f"Running detector on {len(data.test_ds)} test images...")
detections = detector.predict_dl(data.test_dl, threshold=0)
# Run detector on the negative images
if NEG_DATA_PATH is None or NEG_DATA_PATH == "":
detections_neg = None
else:
neg_data = DetectionDataset(
NEG_DATA_PATH,
train_pct=1.0, # , batch_size=BATCH_SIZE,
im_dir="",
allow_negatives=True,
train_transforms=get_transform(train=False),
)
print(f"Running detector on {len(neg_data.train_ds)} negative images...")
detections_neg = detector.predict_dl(neg_data.train_dl, threshold=0)
# Count and plot number of wrong/missing detections
plot_counts_curves(detections, data.test_ds, detections_neg)
Running detector on 32 test images... Running detector on 64 negative images...
We can predict a group of images using the predict_dl
function. We pass it a dataloader that references the images we want to predict and it will perform the predictions by the batch size specified when constructing the dataloader. In this case, we'll just use the test DataLoader (test_dl
) that we created above. This dataloader uses the batch size set by the parameter BATCH_SIZE
.
detections = detector.predict_dl(data.test_dl, threshold=0.5)
The predict_dl
function returns a list of AnnotationBboxes. Lets print out 5 of the detection values.
detections[:5]
[{'idx': 100, 'det_bboxes': [{Bbox object: [left=348, top=306, right=450, bottom=494] | <can> | label:1 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\74.jpg} | score: 0.9998225569725037, {Bbox object: [left=234, top=196, right=323, bottom=438] | <water_bottle> | label:3 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\74.jpg} | score: 0.9992152452468872, {Bbox object: [left=99, top=177, right=235, bottom=440] | <carton> | label:4 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\74.jpg} | score: 0.9981266856193542, {Bbox object: [left=49, top=422, right=329, bottom=525] | <milk_bottle> | label:2 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\74.jpg} | score: 0.9953067898750305], 'im_path': 'C:\\Users\\pabuehle\\Desktop\\computervision-recipes\\data\\odFridgeObjects\\images\\74.jpg'}, {'idx': 64, 'det_bboxes': [{Bbox object: [left=135, top=177, right=223, bottom=406] | <milk_bottle> | label:2 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\41.jpg} | score: 0.9994804263114929, {Bbox object: [left=195, top=308, right=470, bottom=568] | <carton> | label:4 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\41.jpg} | score: 0.9989302754402161, {Bbox object: [left=257, top=127, right=337, bottom=355] | <water_bottle> | label:3 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\41.jpg} | score: 0.9974021911621094], 'im_path': 'C:\\Users\\pabuehle\\Desktop\\computervision-recipes\\data\\odFridgeObjects\\images\\41.jpg'}, {'idx': 70, 'det_bboxes': [{Bbox object: [left=202, top=302, right=300, bottom=493] | <can> | label:1 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\47.jpg} | score: 0.9997478127479553, {Bbox object: [left=296, top=154, right=475, bottom=513] | <carton> | label:4 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\47.jpg} | score: 0.9994853734970093, {Bbox object: [left=75, top=186, right=190, bottom=486] | <water_bottle> | label:3 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\47.jpg} | score: 0.9983780384063721], 'im_path': 'C:\\Users\\pabuehle\\Desktop\\computervision-recipes\\data\\odFridgeObjects\\images\\47.jpg'}, {'idx': 63, 'det_bboxes': [{Bbox object: [left=16, top=330, right=283, bottom=580] | <carton> | label:4 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\40.jpg} | score: 0.9995437264442444, {Bbox object: [left=245, top=212, right=323, bottom=428] | <milk_bottle> | label:2 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\40.jpg} | score: 0.9994308352470398, {Bbox object: [left=348, top=173, right=431, bottom=390] | <water_bottle> | label:3 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\40.jpg} | score: 0.9942303895950317], 'im_path': 'C:\\Users\\pabuehle\\Desktop\\computervision-recipes\\data\\odFridgeObjects\\images\\40.jpg'}, {'idx': 51, 'det_bboxes': [{Bbox object: [left=258, top=223, right=384, bottom=459] | <can> | label:1 | path:C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\images\3.jpg} | score: 0.9998170733451843], 'im_path': 'C:\\Users\\pabuehle\\Desktop\\computervision-recipes\\data\\odFridgeObjects\\images\\3.jpg'}]
We can also verify that the length of the returned detections roughly equals the number of images in the test dataset.
print(
f"Number of detections: {len(detections)}\nNumber of test images: {len(data.test_ds)}"
)
Number of detections: 32 Number of test images: 32
Using a Generator Instead
Alternatively, we can use the
predict_batch
function if we'd rather get a generator object instead of all the predictions back in one go. This may be preferable if the dataloader you are passing in contains many images and you don't want overflow your machine's memory.
detector_generator = detector.predict_batch(<dataloader>)
one_batch_of_predictions = next(detector_generator)
We now have our predictions on the different images from our test set. Since the data is from the test set, we also know what the ground truth bounding boxes for those images are too. Lets plot the ground truth vs predictions for the images.
To plot the predictions, we'll create a simple function (get_im_and_bboxes()
) to create a generator that returns the data needed to plot the ground truths vs the detections. This generator returns a new image path (im_path
), ground truth bounding boxes (gt_bboxes
), and detection bounding boxes (det_bboxes
) each time it is iterated on.
Since these are the parameters that the plot_detection_vs_ground_truth()
function takes, we can pass the entire iterator object to the plot_grid()
function. For each plot it produces, it will use a new iteration from the generator.
The red boxes represents the ground truth while the green represents the predected bounding box.
def _grid_helper():
for detection in detections:
yield detection, data, None, None
plot_grid(plot_detections, _grid_helper(), rows=2)
We can use the detector to predict new images. Start by getting a new image. In this case, we'll reuse a random image in our dataset and pretend it is a new image.
new_im_path = data.root / data.im_dir / data.im_paths[randrange(len(data))]
new_im_path
WindowsPath('C:/Users/pabuehle/Desktop/computervision-recipes/data/odFridgeObjects/images/43.jpg')
We can use the inference function, which accepts either an image or a path to an image as input. Note that we set threshold=0.5
to suppress detections with low confidence.
detections = detector.predict(new_im_path, threshold=0.5)
Now that we have our predictions, lets plot the predictions on our image.
plot_detections(detections)
The final step is to save the model to disk. We'll do this since the deployment notebook will deploy this model.
if SAVE_MODEL:
detector.save("my_drink_detector")
Model is saved to C:\Users\pabuehle\Desktop\computervision-recipes\data\odFridgeObjects\models\my_drink_detector
# Preserve some of the notebook outputs
sb.glue("skip_evaluation", skip_evaluation)
sb.glue("training_losses", detector.losses)
sb.glue("training_average_precision", detector.ap)
Using the concepts introduced in this notebook, you can bring your own dataset and train an object detector to find objects of interest in a given image.