Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.
This notebook analyses how various parameters influence model accuracy and inference speed. For evaluation, the popular COCO dataset is used so that our numbers can be compared to published results. In addition, we show how to reproduce the accuracy of Torchvision's pre-trained Faster R-CNN model reported on their models site.
Familiarity with the 01_training_introduction.ipynb notebook is assumed, and hence no explanation for repeated concepts or code provided. Instead, we focus on new aspects such as how to evaluate on the COCO dataset, or how to improve speed and accuracy. Training a new model on the COCO dataset, while not covered in this notebook, could be easily added by copying the respective cells from the 01_training_introduction notebook.
Import all the functions we need.
import sys
sys.path.append("../../")
import os
import time
import matplotlib.pyplot as plt
from pathlib import Path
import scrapbook as sb
import torch
import torchvision
from utils_cv.common.data import unzip_url
from utils_cv.common.gpu import is_windows, which_processor
from utils_cv.detection.data import coco_labels, Urls
from utils_cv.detection.dataset import DetectionDataset
from utils_cv.detection.plot import plot_pr_curves
from utils_cv.detection.model import _calculate_ap, DetectionLearner, get_pretrained_fasterrcnn
# Change matplotlib backend so that plots are shown for windows
if is_windows():
plt.switch_backend("TkAgg")
print(f"TorchVision: {torchvision.__version__}")
which_processor()
TorchVision: 0.4.0 ('cudart64_100', 0) Torch is using GPU: Tesla V100-PCIE-16GB
# Ensure edits to libraries are loaded and plotting is shown in the notebook.
%reload_ext autoreload
%autoreload 2
%matplotlib inline
Check if a GPU is present, since it is required by detector.evaluate().
assert torch.cuda.is_available()
The COCO 2017 validation dataset is used in all our experiments following common practice for benchmarking object detection models. In particular, two .zip
archives need to be downloaded from http://cocodataset.org: the actual images 2017 val images (1GB) and the ground truth annotations 2017 train/val annotations (241MB).
The two files should be extracted and placed in a root-folder as shown below, with subfolders called annotationsCOCO and images:
/coco2017
+-- annotationsCOCO
| +-- captions_train2017.json
| +-- captions_val2017.json
| +-- ...
+-- images
| +-- 000000000139.jpg
| +-- 000000000285.jpg
| +-- ...
The COCO dataset comes with annotations in its own format, see this description. Hence, we need to convert the downloaded COCO annotations to Pascal VOC format in order to run this notebook. The function coco2voc does exactly that and only takes seconds to run. In the code below, we assume the COCO images and annotations are in the folder '/data/coco2017/'; the function then creates a new sub-directory called annotations. The function also, if activated, downloads the images provided their urls are specified in the .json
file.
from utils_cv.detection.data import coco2voc
coco2voc(
anno_path = "/data/coco2017/annotationsCOCO/instances_val2017.json",
output_dir = "/data/coco2017/",
download_images = False
)
# COCO dataset
DATA_PATH = "/data/coco2017/"
LABELS = coco_labels()[1:] #ignore first entry which is "__background__"
The DATA_PATH
directory should contain the annotations
, images
, and (albeit not used) annotationsCOCO
folders.
os.listdir(DATA_PATH)
['annotations', 'annotationsCOCO', 'images']
Most code in this notebook is taken from 01_training_introduction.ipynb with only small changes, mainly to ensure the class names (and ordering) in the DetectionDataset
object matches those used to train the Torchvision model. Hence, in the cell below, we explicitly provide labels
as input to the detection dataset.
Note that:
DetectionDataset
object requires at least 1 image to be assigned to the training set, hence we set train_pct=0.0001
.coco_labels()
is "__background__" and is removed when setting LABELS = coco_labels()[1:]
.allow_negatives = True
since a few of the COCO images don't contain any annotated objects and hence don't have a corresponding .xml
annotation file.data = DetectionDataset(DATA_PATH,
train_pct=0.0001,
labels = LABELS,
allow_negatives = True)
#max_num_images = 100) # Uncomment to only use small subset of COCO
print(f"Number of test images: {len(data.test_ds)}")
Number of test images: 4999
The plots below summarize some aspects of the annotations, eg. the counts of ground truth boxes per class, or the distribution of absolute and relative widths/heights of the objects.
data.plot_boxes_stats(figsize = (18,12))
Lets visualize the annotations to make sure they look correct.
data.show_ims(rows=2)
We now load the Faster-RCNN model (with feature pyramid network extension) which was trained on COCO. In contrast to the 01_training_introduction.ipynb notebook, we do not create a new classification layer, but instead keep the existing last layer of the pre-trained model.
model = get_pretrained_fasterrcnn()
detector = DetectionLearner(data, model)
print(f"Model: {type(detector.model)}")
Model: <class 'torchvision.models.detection.faster_rcnn.FasterRCNN'>
We can simply run the evaluate()
method and observe that the mAP in the first row (for IoU=0.50:0.95) is close to the number reported on Torchvision's models site.
e = detector.evaluate()
creating index... index created! Test: [ 0/2500] eta: 1:30:35 model_time: 2.1428 (2.1428) evaluator_time: 0.0156 (0.0156) time: 2.1741 data: 0.0156 max mem: 982 Test: [ 100/2500] eta: 0:06:05 model_time: 0.0781 (0.1097) evaluator_time: 0.0156 (0.0174) time: 0.1345 data: 0.0234 max mem: 1662 Test: [ 200/2500] eta: 0:05:25 model_time: 0.0781 (0.0993) evaluator_time: 0.0156 (0.0165) time: 0.1283 data: 0.0242 max mem: 1709 Test: [ 300/2500] eta: 0:04:59 model_time: 0.0792 (0.0947) evaluator_time: 0.0156 (0.0157) time: 0.1275 data: 0.0250 max mem: 1730 Test: [ 400/2500] eta: 0:04:42 model_time: 0.0781 (0.0929) evaluator_time: 0.0156 (0.0160) time: 0.1306 data: 0.0242 max mem: 1749 Test: [ 500/2500] eta: 0:04:25 model_time: 0.0781 (0.0915) evaluator_time: 0.0156 (0.0162) time: 0.1259 data: 0.0242 max mem: 1763 Test: [ 600/2500] eta: 0:04:11 model_time: 0.0792 (0.0906) evaluator_time: 0.0156 (0.0168) time: 0.1259 data: 0.0219 max mem: 1800 Test: [ 700/2500] eta: 0:03:57 model_time: 0.0781 (0.0904) evaluator_time: 0.0156 (0.0167) time: 0.1291 data: 0.0227 max mem: 1817 Test: [ 800/2500] eta: 0:03:44 model_time: 0.0793 (0.0903) evaluator_time: 0.0156 (0.0166) time: 0.1299 data: 0.0235 max mem: 1866 Test: [ 900/2500] eta: 0:03:31 model_time: 0.0781 (0.0903) evaluator_time: 0.0156 (0.0170) time: 0.1321 data: 0.0204 max mem: 1871 Test: [1000/2500] eta: 0:03:18 model_time: 0.0937 (0.0903) evaluator_time: 0.0156 (0.0169) time: 0.1314 data: 0.0219 max mem: 1904 Test: [1100/2500] eta: 0:03:04 model_time: 0.0781 (0.0900) evaluator_time: 0.0156 (0.0168) time: 0.1228 data: 0.0235 max mem: 1918 Test: [1200/2500] eta: 0:02:51 model_time: 0.0781 (0.0899) evaluator_time: 0.0156 (0.0166) time: 0.1290 data: 0.0242 max mem: 1918 Test: [1300/2500] eta: 0:02:37 model_time: 0.0781 (0.0897) evaluator_time: 0.0156 (0.0168) time: 0.1290 data: 0.0235 max mem: 1942 Test: [1400/2500] eta: 0:02:24 model_time: 0.0781 (0.0895) evaluator_time: 0.0156 (0.0167) time: 0.1259 data: 0.0243 max mem: 1950 Test: [1500/2500] eta: 0:02:11 model_time: 0.0781 (0.0894) evaluator_time: 0.0156 (0.0166) time: 0.1267 data: 0.0248 max mem: 1978 Test: [1600/2500] eta: 0:01:58 model_time: 0.0781 (0.0892) evaluator_time: 0.0156 (0.0168) time: 0.1243 data: 0.0227 max mem: 1978 Test: [1700/2500] eta: 0:01:44 model_time: 0.0937 (0.0890) evaluator_time: 0.0156 (0.0168) time: 0.1275 data: 0.0211 max mem: 2006 Test: [1800/2500] eta: 0:01:31 model_time: 0.0937 (0.0890) evaluator_time: 0.0156 (0.0167) time: 0.1283 data: 0.0242 max mem: 2015 Test: [1900/2500] eta: 0:01:18 model_time: 0.0792 (0.0890) evaluator_time: 0.0156 (0.0166) time: 0.1228 data: 0.0227 max mem: 2015 Test: [2000/2500] eta: 0:01:05 model_time: 0.0781 (0.0890) evaluator_time: 0.0156 (0.0167) time: 0.1306 data: 0.0243 max mem: 2157 Test: [2100/2500] eta: 0:00:52 model_time: 0.0781 (0.0889) evaluator_time: 0.0156 (0.0167) time: 0.1283 data: 0.0234 max mem: 2170 Test: [2200/2500] eta: 0:00:39 model_time: 0.0781 (0.0888) evaluator_time: 0.0156 (0.0167) time: 0.1291 data: 0.0242 max mem: 2170 Test: [2300/2500] eta: 0:00:26 model_time: 0.0781 (0.0886) evaluator_time: 0.0156 (0.0168) time: 0.1299 data: 0.0255 max mem: 2170 Test: [2400/2500] eta: 0:00:13 model_time: 0.0786 (0.0886) evaluator_time: 0.0156 (0.0169) time: 0.1322 data: 0.0227 max mem: 2170 Test: [2499/2500] eta: 0:00:00 model_time: 0.0781 (0.0886) evaluator_time: 0.0156 (0.0168) time: 0.1313 data: 0.0238 max mem: 2170 Test: Total time: 0:05:26 (0.1306 s / it) Averaged stats: model_time: 0.0781 (0.0886) evaluator_time: 0.0156 (0.0168) Accumulating evaluation results... DONE (t=6.55s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.354 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.567 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.378 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.144 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.346 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.474 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.300 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.471 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.494 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.269 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.500 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.608
In addition to the average precision/recall numbers, we can also plot the precision-recall curves for different IOU thresholds.
plot_pr_curves(e)
The remainder of this notebook is exploring the accuracy vs. inference speed trade-offs for various parameters. As quantitative measure we use the average precision (AP) at an IoU of 0.5. This corresponds to a relatively tight fit with ground truth object location and is sufficient for most real-world problems. Note that inference and training speed are strongly correlated and hence e.g. halving inference time roughly equates to halving training time.
# Loop over various parameters
aps = []
for size in [200, 300]:
print(f"\nTesting variable: size = {size}...")
# Get model
model = get_pretrained_fasterrcnn(
min_size = size,
max_size = size
)
detector = DetectionLearner(data, model)
# Compute inference time
start_time = time.time()
detector.predict_dl(data.test_dl)
inference_time = time.time() - start_time
print(f"inference time = {inference_time:.2f} seconds")
# Compute accuracy
e = detector.evaluate()
ap = _calculate_ap(e, iou_thres=0.5)
aps.append(ap)
print("At size = {} -> AP = {:.2f}".format(size, ap["bbox"]))
Testing variable: size = 100... inference time = 139.17 seconds creating index... index created! Test: [ 0/2500] eta: 0:01:57 model_time: 0.0312 (0.0312) evaluator_time: 0.0000 (0.0000) time: 0.0469 data: 0.0156 max mem: 2170 Test: [ 100/2500] eta: 0:02:22 model_time: 0.0312 (0.0296) evaluator_time: 0.0000 (0.0053) time: 0.0604 data: 0.0228 max mem: 2170 Test: [ 200/2500] eta: 0:02:16 model_time: 0.0312 (0.0289) evaluator_time: 0.0156 (0.0062) time: 0.0595 data: 0.0211 max mem: 2170 Test: [ 300/2500] eta: 0:02:10 model_time: 0.0312 (0.0289) evaluator_time: 0.0000 (0.0058) time: 0.0603 data: 0.0266 max mem: 2170 Test: [ 400/2500] eta: 0:02:04 model_time: 0.0312 (0.0288) evaluator_time: 0.0000 (0.0059) time: 0.0620 data: 0.0275 max mem: 2170 Test: [ 500/2500] eta: 0:01:58 model_time: 0.0312 (0.0288) evaluator_time: 0.0000 (0.0058) time: 0.0587 data: 0.0219 max mem: 2170 Test: [ 600/2500] eta: 0:01:52 model_time: 0.0312 (0.0288) evaluator_time: 0.0000 (0.0057) time: 0.0571 data: 0.0227 max mem: 2170 Test: [ 700/2500] eta: 0:01:46 model_time: 0.0312 (0.0288) evaluator_time: 0.0000 (0.0058) time: 0.0595 data: 0.0227 max mem: 2170 Test: [ 800/2500] eta: 0:01:40 model_time: 0.0312 (0.0287) evaluator_time: 0.0000 (0.0057) time: 0.0595 data: 0.0258 max mem: 2170 Test: [ 900/2500] eta: 0:01:34 model_time: 0.0312 (0.0287) evaluator_time: 0.0000 (0.0057) time: 0.0587 data: 0.0234 max mem: 2170 Test: [1000/2500] eta: 0:01:28 model_time: 0.0312 (0.0287) evaluator_time: 0.0000 (0.0058) time: 0.0571 data: 0.0235 max mem: 2170 Test: [1100/2500] eta: 0:01:22 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0058) time: 0.0579 data: 0.0227 max mem: 2170 Test: [1200/2500] eta: 0:01:16 model_time: 0.0312 (0.0287) evaluator_time: 0.0000 (0.0056) time: 0.0595 data: 0.0258 max mem: 2170 Test: [1300/2500] eta: 0:01:11 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0056) time: 0.0595 data: 0.0219 max mem: 2170 Test: [1400/2500] eta: 0:01:05 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0056) time: 0.0587 data: 0.0227 max mem: 2170 Test: [1500/2500] eta: 0:00:59 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0056) time: 0.0579 data: 0.0234 max mem: 2170 Test: [1600/2500] eta: 0:00:53 model_time: 0.0312 (0.0287) evaluator_time: 0.0000 (0.0056) time: 0.0595 data: 0.0235 max mem: 2170 Test: [1700/2500] eta: 0:00:47 model_time: 0.0312 (0.0287) evaluator_time: 0.0000 (0.0056) time: 0.0618 data: 0.0235 max mem: 2170 Test: [1800/2500] eta: 0:00:41 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0056) time: 0.0595 data: 0.0251 max mem: 2170 Test: [1900/2500] eta: 0:00:35 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0056) time: 0.0579 data: 0.0250 max mem: 2170 Test: [2000/2500] eta: 0:00:29 model_time: 0.0312 (0.0286) evaluator_time: 0.0156 (0.0056) time: 0.0610 data: 0.0227 max mem: 2170 Test: [2100/2500] eta: 0:00:23 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0056) time: 0.0586 data: 0.0243 max mem: 2170 Test: [2200/2500] eta: 0:00:17 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0056) time: 0.0586 data: 0.0266 max mem: 2170 Test: [2300/2500] eta: 0:00:11 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0056) time: 0.0586 data: 0.0266 max mem: 2170 Test: [2400/2500] eta: 0:00:05 model_time: 0.0312 (0.0286) evaluator_time: 0.0000 (0.0057) time: 0.0806 data: 0.0251 max mem: 2170 Test: [2499/2500] eta: 0:00:00 model_time: 0.0312 (0.0287) evaluator_time: 0.0000 (0.0057) time: 0.0594 data: 0.0258 max mem: 2170 Test: Total time: 0:02:29 (0.0596 s / it) Averaged stats: model_time: 0.0312 (0.0287) evaluator_time: 0.0000 (0.0057) Accumulating evaluation results... DONE (t=2.95s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.024 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.041 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.024 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.004 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.056 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.029 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.034 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.034 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.005 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.091 At size = 100 -> AP = 0.04 Testing variable: size = 300... inference time = 152.32 seconds creating index... index created! Test: [ 0/2500] eta: 0:03:15 model_time: 0.0312 (0.0312) evaluator_time: 0.0156 (0.0156) time: 0.0781 data: 0.0313 max mem: 2170 Test: [ 100/2500] eta: 0:02:52 model_time: 0.0312 (0.0336) evaluator_time: 0.0156 (0.0122) time: 0.0728 data: 0.0250 max mem: 2170 Test: [ 200/2500] eta: 0:02:44 model_time: 0.0312 (0.0340) evaluator_time: 0.0156 (0.0112) time: 0.0712 data: 0.0266 max mem: 2170 Test: [ 300/2500] eta: 0:02:36 model_time: 0.0312 (0.0341) evaluator_time: 0.0156 (0.0107) time: 0.0720 data: 0.0243 max mem: 2170 Test: [ 400/2500] eta: 0:02:29 model_time: 0.0312 (0.0342) evaluator_time: 0.0156 (0.0112) time: 0.0752 data: 0.0266 max mem: 2170 Test: [ 500/2500] eta: 0:02:23 model_time: 0.0312 (0.0341) evaluator_time: 0.0156 (0.0115) time: 0.0743 data: 0.0243 max mem: 2170 Test: [ 600/2500] eta: 0:02:15 model_time: 0.0312 (0.0342) evaluator_time: 0.0156 (0.0114) time: 0.0680 data: 0.0234 max mem: 2170 Test: [ 700/2500] eta: 0:02:08 model_time: 0.0312 (0.0341) evaluator_time: 0.0156 (0.0115) time: 0.0728 data: 0.0271 max mem: 2170 Test: [ 800/2500] eta: 0:02:01 model_time: 0.0312 (0.0340) evaluator_time: 0.0156 (0.0114) time: 0.0735 data: 0.0235 max mem: 2170 Test: [ 900/2500] eta: 0:01:55 model_time: 0.0312 (0.0340) evaluator_time: 0.0156 (0.0121) time: 0.0728 data: 0.0227 max mem: 2170 Test: [1000/2500] eta: 0:01:47 model_time: 0.0312 (0.0339) evaluator_time: 0.0000 (0.0121) time: 0.0681 data: 0.0258 max mem: 2170 Test: [1100/2500] eta: 0:01:40 model_time: 0.0312 (0.0339) evaluator_time: 0.0009 (0.0120) time: 0.0698 data: 0.0250 max mem: 2170 Test: [1200/2500] eta: 0:01:33 model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0120) time: 0.0712 data: 0.0211 max mem: 2170 Test: [1300/2500] eta: 0:01:26 model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0119) time: 0.0727 data: 0.0242 max mem: 2170 Test: [1400/2500] eta: 0:01:18 model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0118) time: 0.0689 data: 0.0243 max mem: 2170 Test: [1500/2500] eta: 0:01:11 model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0118) time: 0.0711 data: 0.0266 max mem: 2170 Test: [1600/2500] eta: 0:01:04 model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0118) time: 0.0705 data: 0.0242 max mem: 2170 Test: [1700/2500] eta: 0:00:57 model_time: 0.0312 (0.0340) evaluator_time: 0.0156 (0.0118) time: 0.0720 data: 0.0250 max mem: 2170 Test: [1800/2500] eta: 0:00:50 model_time: 0.0312 (0.0340) evaluator_time: 0.0000 (0.0117) time: 0.0674 data: 0.0242 max mem: 2170 Test: [1900/2500] eta: 0:00:42 model_time: 0.0312 (0.0339) evaluator_time: 0.0000 (0.0118) time: 0.0665 data: 0.0243 max mem: 2170 Test: [2000/2500] eta: 0:00:35 model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0120) time: 0.0923 data: 0.0258 max mem: 2170 Test: [2100/2500] eta: 0:00:28 model_time: 0.0312 (0.0339) evaluator_time: 0.0000 (0.0119) time: 0.0697 data: 0.0242 max mem: 2170 Test: [2200/2500] eta: 0:00:21 model_time: 0.0313 (0.0339) evaluator_time: 0.0156 (0.0119) time: 0.0720 data: 0.0235 max mem: 2170 Test: [2300/2500] eta: 0:00:14 model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0120) time: 0.0720 data: 0.0234 max mem: 2170 Test: [2400/2500] eta: 0:00:07 model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0120) time: 0.0720 data: 0.0227 max mem: 2170 Test: [2499/2500] eta: 0:00:00 model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0120) time: 0.0688 data: 0.0227 max mem: 2170 Test: Total time: 0:02:59 (0.0717 s / it) Averaged stats: model_time: 0.0312 (0.0339) evaluator_time: 0.0156 (0.0120) Accumulating evaluation results... DONE (t=4.52s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.189 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.316 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.197 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.013 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.114 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.358 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.190 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.266 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.273 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.025 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.206 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.486 At size = 300 -> AP = 0.32
If not otherwise specified, we use the default parameters stated in get_pretrained_fasterrcnn()
with batch size of 2, running on an Azure Linux DSVM with single V100 GPU. Using these default parameters, the AP is 0.57
at an inference time of 46.1
ms per image. On an near-identical Windows DSVM, the inference speed would be 66.4
ms due to PyTorch's dataloader being slow on Windows. Note that these speeds will differ depending on the exact machine configuration including especially the type of GPU.
The plot below summarizes the results of running multiple experiments; for an explanation of all parameters see Torchvision's faster_rcnn.py. We observe that:
box_detections_per_img
, box_nms_thresh
and box_score_thresh
do not effect inference speed and hence are omitted from the plot.By combining these findings, we can define a sweet-spot for COCO at min_size = max_size = 750
and rpn_pre_nms_top_n_test = rpn_post_nms_top_n_test = 250
of 0.52
AP at 24.5
ms per image. Increasing the batch_size
to 8 improves the inference time only modestly to 23.1
ms.
# Plot accuracy vs inference speed
colors = ("green", "red", "orange", "blue", "black")
markers = ('X', 'o', 'P', 'v', '>', 'D')
plt.figure(figsize=(16,6))
plt.scatter(46.1, 0.57, c=colors[3], marker=markers[0], s=500, alpha=0.8, label="Default params (batch_size = 2)")
plt.scatter(24.5, 0.52, c=colors[3], marker=markers[1], s=500, alpha=0.8, label="Sweet-spot params (batch_size = 2)")
plt.scatter(23.1, 0.52, c=colors[3], marker=markers[2], s=500, alpha=0.8, label="Sweet-spot params (batch_size = 8)")
plt.plot([17.6,19.9,28.1,39.7,116.], [0.32,0.45,0.53,0.56,0.49], c=colors[0], alpha=0.5)
plt.scatter(17.6, 0.32, c=colors[0], marker=markers[1], s=100, alpha=0.8, label="min_size = max_size = 300")
plt.scatter(19.9, 0.45, c=colors[0], marker=markers[2], s=100, alpha=0.8, label="min_size = max_size = 500")
plt.scatter(28.1, 0.53, c=colors[0], marker=markers[3], s=100, alpha=0.8, label="min_size = max_size = 750")
plt.scatter(39.7, 0.56, c=colors[0], marker=markers[4], s=100, alpha=0.8, label="min_size = max_size = 1000")
plt.scatter(116., 0.49, c=colors[0], marker=markers[5], s=100, alpha=0.8, label="min_size = max_size = 2000")
plt.plot([41.5,41.7,42.2,46.2,82.4], [0.41,0.54,0.56,0.57,0.57], c=colors[1], alpha=0.5)
plt.scatter(41.5, 0.41, c=colors[1], marker=markers[0], s=100, alpha=0.8, label="rpn_pre_nms_top_n_test = rpn_post_nms_top_n_test = 10")
plt.scatter(41.7, 0.54, c=colors[1], marker=markers[1], s=100, alpha=0.8, label="rpn_pre_nms_top_n_test = rpn_post_nms_top_n_test = 100")
plt.scatter(42.2, 0.56, c=colors[1], marker=markers[2], s=100, alpha=0.8, label="rpn_pre_nms_top_n_test = rpn_post_nms_top_n_test = 250")
plt.scatter(46.2, 0.57, c=colors[1], marker=markers[3], s=100, alpha=0.8, label="rpn_pre_nms_top_n_test = rpn_post_nms_top_n_test = 1000")
plt.scatter(82.4, 0.57, c=colors[1], marker=markers[4], s=100, alpha=0.8, label="rpn_pre_nms_top_n_test = rpn_post_nms_top_n_test = 5000")
plt.plot([42.3,45.8,46.6], [0.52,0.57,0.56], c=colors[2], alpha=0.5)
plt.scatter(42.3, 0.52, c=colors[2], marker=markers[0], s=100, alpha=0.8, label="rpn_nms_thresh = 0.1")
plt.scatter(45.8, 0.57, c=colors[2], marker=markers[1], s=100, alpha=0.8, label="rpn_nms_thresh = 0.7")
plt.scatter(46.6, 0.56, c=colors[2], marker=markers[2], s=100, alpha=0.8, label="rpn_nms_thresh = 0.95")
plt.grid()
plt.xlim(15, 120)
plt.ylim(0.3, 0.6)
plt.xlabel("Inference speed per image [ms]")
plt.ylabel("Accuracy on COCO validation 2017 at IoU=0.5 [mAP]")
plt.legend(loc=4)
plt.show()
This notebook illustrated the trade-off between inference time and accuracy induced by different parameter settings and using COCO as benchmark dataset. Additionally, it was shown how to reproduce the accuracy of Torchvision's pre-trained Faster R-CNN model as published on their models site.
# Preserve some of the notebook outputs
sb.glue("aps", aps)
sb.glue("num_test_images", len(data.test_ds))