In this tutorial, we will cover:
This tutorial is heavly based on Technion:EE046746 tutorial
# Setup
%matplotlib inline
import os
import sys
import torch
import torchvision
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')
Recall:
What challengins setup can it solve for us?
What do we see here?
Classification and localization for multiple instances is called Object Detection
What are the problems with this aproach?
Problems
how do you know the size of the window so that it always contains the image?
Let's say that the window predicted the class, how do we know it's a good rectengle?
The output of the algorithms are usually conducted from 5 outputs:
and 4 that determine the bounding box:
top_left_x
top_left_y
width
height
How can we tell if the predicted bounding box is good with respect to the ground truth (labeled) bounding box?
The evaluation score is usually mean Average Precision (mAP)
For that we're also going to talk about intersection over union (IoU)
First, let's remember some definisions.
confusion matrix | precision and recall |
---|---|
Precision is a measure of when ""your model predicts how often does it predicts correctly?"" It indicates how much we can rely on the model's positive predictions.
Recall is a measure of ""has your model predicted every time that it should have predicted?"" It indicates any predictions that it should not have missed if the model is missing.
Also called the Jaccard Index, $$ IoU = \frac{TP}{TP + FP+ FN} = \frac{\mid X \cap Y \mid}{\mid X \mid + \mid Y \mid - \mid X \cap Y \mid} $$
Yeild a value between 0 and 1 and The higher the IoU, the better the predicted location of the box for a given object.
Typical threshold for detection: 0.5
We can check what is the P and R for each threshold:
and define a AP-AUC
In practice, we choose a small set of thresholds and aproximate the area as follow:
we do that for all the instances of a class in the dataset!
The mean Average Precision (mAP) is the mean of the Average Precisions computed over all the classes of the challenge.
Typically OD models are devided to two categories:
R-CNN | Fast R-CNN | Faster R-CNN | |
---|---|---|---|
Test Time Per Image (sec) | 50 | 2 | 0.2 |
Speed Up | 1x | 25x | 250x |
There is no change in test time.
YOLO sees the complete image at once as opposed to looking at only a generated region proposals.
One limitation for YOLO is that it only predicts one type of class in one grid cell. Hence, it struggles with very small objects.
https://soundcloud.com/tsirifein/daft-punk-harder-better-faster-stronger-trfn-remix
BatchNorm
Image resolution matters: Fine-tuning the base model with high resolution images improves the detection performance.
Convolutional anchor box detection: like in faster R-CNN.
K-mean clustering of box dimensions: Different from faster R-CNN that uses hand-picked sizes of anchor boxes, YOLOv2 runs k-mean clustering on the training data to find good priors on anchor box dimensions.
Add fine-grained features: YOLOv2 adds a passthrough layer to bring fine-grained features from an earlier layer to the last output layer. The mechanism of this passthrough layer is similar to identity mappings in ResNet to extract higher-dimensional features from previous layers.
Multi-scale training.
Light-weighted base model: To make prediction even faster, YOLOv2 adopts a light-weighted base model, DarkNet-19, which has 19 conv layers and 5 max-pooling layers. The key point is to insert avg poolings and 1x1 conv filters between 3x3 conv layers.
Reach data: made use of Imagenet Dataset as well, used a graph distance to determine the class.
We do not have the time to cover all the algorithms exist.
If you want to dive deeper i recommand to look at:
EfficientDet - NAS based detector
SSD in pytorch- Single Shot MultiBox Detector model for object detection
After we will see the vision transofrmers, you can also look at
At any time, you can find the state of the art models in here
Credits
This tutorial was written by Moshe Kimhi.
To re-use, please provide attribution and link to the original.
some content from:
images without specific credit: