Notebook

目标检测 Object Detection¶

1. 目标定位 Object Localization¶

目标检测是这几年随着深度学习兴起以来，计算机视觉领域中进展神速的一个问题。要进行目标检测，目标定位是第一步。

目标定位就是在一个图像中，找到目标，用一个包围盒将其标出，目标定位通常和分类器配合使用，只涉及一个目标。而目标检测则在此基础上，进行多个目标的标记。

![What are localization and detection.png](img/What are localization and detection.png)

带目标定位的神经网络分类器，就是在卷积神经网络分类器的基础上，在最后的Softmax层，加入红框的坐标信息。这里记做 $b_x, b_y, b_h, b_w$，分别表示包围盒中心的横坐标，包围盒中心的纵坐标，包围盒的高度和包围盒的宽度。

![Classification with localization.png](img/Classification with localization.png)

最终输出的预测值 $\hat{y}$，包括三个部分：$p_c$ 图像中是否包含目标，$b_x, b_y, b_h, b_w$ 包围盒坐标信息，以及 $c_1, c_2, c_3$ 分类器分类信息。当图像中不包含目标时，包围盒坐标信息和分类信息实际上是我们不关心的，所以损失函数也会相应地调整，在 $y=0$ 时，只计算 $p_c$ 的误差。下面的损失函数以误差平方为例，考虑到是否包含目标以及分类信息都属于分类问题，这两部分的损失也可以用logloss来计算。

![Defining the target label y.png](img/Defining the target label y.png)

2. 特征点检测 Landmark Detection¶

类似上面带目标定位的神经网络分类器，也可以让神经网络最终的输出层直接输出特征点 Landmark的坐标。但特征点的数量需要预先定义。并且每个特征点在输入向量中的位置（比如左眼内侧眼角，下巴，膝盖），需要跨样本保持一致。

![Landmark detection.png](img/Landmark detection.png)

3. 滑动窗口目标检测 Sliding Windows Object Detection¶

使用滑动窗口目标检测算法，需要首先训练一个分类器。以汽车检测为例，这个分类器要求输入 $X$ 在正样本的情况下，图片的大小几乎只包含汽车。分类器本身，可以是神经网络，也可以不是。

![Car detection example.png](img/Car detection example.png)

滑动窗口目标检测的算法，先挑选一个较小的滑动窗口，配以一个最好较小的步长，在新的训练集图片中进行滑动，并将滑窗里截取的图片作为输入，用到上面训练好的分类器中。一个循环结束后，再放大滑动窗口的大小，继续之前的步骤。

滑动窗口的一个问题是计算成本较大，增加步长可以降低计算成本，但会损失精度。在神经网络兴起之前，滑动窗口目标检测，通常配合着人工特征以及简单线性分类器，计算量较小。而在深度学习时代，简单地用滑动窗口配合卷积神经网络，即使以当前的计算能力，还是非常地慢。

![Sliding windows detection.png](img/Sliding windows detection.png)

下一节，我们会介绍用卷积来实现滑动窗口，从而将计算量降低到可以接受的范围。

4. 滑动窗口的卷积实现 Convolutional Implementation of Sliding Windows¶

在介绍滑动窗口的卷积实现之前，我们先介绍一个概念，全连接（FC）层，是可以等价地用卷积层来表达的，如下图所示：

![Turning FC layer into convolutional layers.png](img/Turning FC layer into convolutional layers.png)

假设我们已经训练好了一个接受14×14×3的图片的卷积神经网络分类器。那么，针对更大的训练样本，我们只需要复制卷积神经网络的架构，就可以同步对多个滑动窗口的结果进行分类。

![Convolution implementation of sliding windows.png](img/Convolution implementation of sliding windows.png)

5. 预测包围盒 Bounding Box Prediction¶

上一节提到滑动窗口的卷积实现，计算上效率很高，但仍然有一个问题，它无法有效地确定包围盒。要使用不同尺寸的包围盒，需要训练多个不同的分类器。另外，实际训练样本的包围盒，其形状可能不是正方形，同时各个样本之间还存在差异。YOLO(You Only Look Once)算法，提供了更精确的包围盒预测算法。

在YOLO算法中，每张图片被划分为若干个区域，图上是3×3，实际中一般更大（比如19×19）。每个目标，根据其中心点，只被分配到一个具体的区域中。每个区域的目标变量y等同于目标定位中的目标变量。这样神经网络会输出图像中每个区域的包围盒（如果有的话）的精确位置。一个区域中有多个目标的情况，我们之后会再谈到。

YOLO算法是通过卷积实现的，因而效率非常高，经常用于实时的目标检测。

![YOLO algorithm.png](img/YOLO algorithm.png)

注意包围盒参数的取值范围。按照图像问题的惯例，图像的左上角定义为坐标(0, 0)，而右下角定义为坐标(1, 1)，$b_x$ 和 $b_y$ 的值根据这个坐标系来确定，由于是中心点的坐标，因此这两个值都在(0, 1)的区间内。但是目标是可以跨区域的，因而 $b_h$ 和 $b_w$ 是可以大于1的。

![Specify the bounding boxes.png](img/Specify the bounding boxes.png)

6. 交并集 Intersection Over Union¶

交并集可以作为评价计算出包围盒好坏的一个指标。在下一节，也会作为一个组件，提升目标检测算法。

计算机视觉领域惯例上在 $IoU \ge 0.5$ 时，判断包围盒正确。0.5在这里是一个人为的规定，并没有理论原因，如果希望判定更为严格，也可以将这个阈值设为0.6。

![Evaluating object localization.png](img/Evaluating object localization.png)

7. 非极大值抑制 Non-Max Suppression¶

YOLO算法针对同一个目标，可能会有多个区域判定该目标在本区域内，从而对单个目标产生多个包围盒。

![Non-max suppression example.png](img/Non-max suppression example.png)

利用IoU，可以消除针对同一目标的多个包围盒。这里就是挑选概率最高的包围盒，然后如果 $IoU \ge 0.5$，则认为是针对同一目标，直接消除。

![Non-max suppression algorithm.png](img/Non-max suppression algorithm.png)

8. Archor Box¶

目前为止上面提到的目标识别算法，还无法针对同一区域包含多个目标的情况进行识别。引入Archor Box的概念，可以解决这个问题。

![Overlapping objects.png](img/Overlapping objects.png)

![Archor box algorithm.png](img/Archor box algorithm.png)

![Archor box example.png](img/Archor box example.png)

Archor Box无法很好地处理： - 预先定义了两个Archor Box，但是区域中有三个目标的情况 - 两个目标的Archor Box形状非常类似的情况

9. YOLO算法¶

![YOLO Training.png](img/YOLO Training.png)

![YOLO Making Predictions.png](img/YOLO Making Predictions.png)

![YOLO Outputing the non- max supressed outputs.png](img/YOLO Outputing the non- max supressed outputs.png)

10. Region Proposal¶

通过传统的图像分割算法，分割出区域，针对具体的区域进行分类。

![Region proposal R-CNN.png](img/Region proposal R-CNN.png)

这个算法比较慢，社区内一直在致力于提升其运算效率。

![Faster algorithms.png](img/Faster algorithms.png)

11. 车辆检测 Autonomous driving - Car detection¶

本节我们将会使用YOLO模型进行目标检测，YOLO算法主要来自于以下两篇论文：Redmon et al., 2016 (https://arxiv.org/abs/1506.02640) and Redmon and Farhadi, 2016 (https://arxiv.org/abs/1612.08242).

这一节，我们将会:

对一个车辆检测数据集，应用目标检测算法
处理包围盒

执行以下代码来引入相关的包和依赖。

In [1]:

import argparse
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import scipy.io
import scipy.misc
import numpy as np
import pandas as pd
import PIL
import tensorflow as tf
from keras import backend as K
from keras.layers import Input, Lambda, Conv2D
from keras.models import load_model, Model
from yolo_utils import read_classes, read_anchors, generate_colors, preprocess_image, draw_boxes, scale_boxes
from yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners, preprocess_true_boxes, yolo_loss, yolo_body

%matplotlib inline

Using TensorFlow backend.

11.1 问题陈述¶

作为研发自动驾驶汽车的一个核心组件，我们需要首先构建一个车辆检测系统。为了收集数据，我们在车辆的前方挂载了一个摄像头，行驶过程中每隔几秒就对路面情况进行拍照。

所有的图片现在都放到了一个文件夹中，并且已经完成了打标，每张图片上，每辆车都画出了一个包围盒。下面是一个包围盒的实例。

**Figure 1** : **Definition of a box**

如果我们目前有80个分类需要YOLO算法来识别，可以将类别标签 $c$ 表示为一个1到80的整型数字，或者一个80维的向量（其中一个元素是1，其它都是0）。本节中，我们会视每一步的使用便利，来决定使用哪一种表示。

这个练习中，我们会学得YOLO是如果运作的，之后再将其应用于车辆检测。由于YOLO模型训练的过程计算消耗很大，我们会载入一个预先训练好的模型。

11.2 YOLO¶

YOLO（只看一次）是目前非常流行的目标检测算法，它的准确率高，同时还能够实时进行计算。只看一次意味着只需要一次前向传播，算法就可以做出预测。在进行了非极大值抑制之后，算法就可以产出识别的目标以及对应的包围盒。

11.2.1 模型细节¶

数据情况：

输入input 是一批图片，维度为(m, 608, 608, 3)
输出output 是一组包围盒及其对应的分类名称。每个包围盒由6个数字 $(p_c, b_x, b_y, b_h, b_w, c)$ 来表示，如上所述。如果将 $c$ 扩展为一个80维的向量，那么每个包围盒就可以表示为85个数字。

我们将会使用5个archor boxes。所以这个YOLO的架构可以认为是：IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85)，encoding的部分如图所示：

**Figure 2** : **Encoding architecture for YOLO**

如果一个目标的中心点若在特定网格内，那么这个网格需要检测出对应的目标。

考虑到我们使用了5个archor boxes，那么19 × 19的网格，每一个都包含了5个archor boxes的信息。每个anchor boxes由他们的宽和高来定义。

为了方便，我们将 (19, 19, 5, 85) 编码的最后两个维度打平，所以深度圣经网络的输出将会是 (19, 19, 425)。

**Figure 3** : **Flattening the last two last dimensions**

这样，对于（每个网格的）每个box，，我们需要计算元素级别的乘积，抽取出每个box包含每个分类的概率值。

**Figure 4** : **Find the class detected by each box**

下面是对YOLO预测算法进行可视化的一种方法：

对19×19的每个网格，找到其最大的概率值（对5个archor boxes，每个都对80的分类，取最大值）
对每个网格，根据最可能的分类，进行上色

可视化的效果如下:

**Figure 5** : Each of the 19x19 grid cells colored according to which class has the largest predicted probability in that cell.

这个可视化方法并不是YOLO算法进行预测的核心组成部分，但它可以很好地展示算法的中间结果。

另一种YOLO输出可视化的办法，是直接在原图上绘制产出的包围盒，可视化的效果如下：

**Figure 6** : Each cell gives you 5 boxes. In total, the model predicts: 19x19x5 = 1805 boxes just by looking once at the image (one forward pass through the network)! Different colors denote different classes.

在上图中，我们仅仅绘制了概率较高的包围盒，但包围盒的数量仍然过多。我们需要将算法的输出结果，过滤为更小的目标集合。为了达到这个目标，我们需要使用非极大值抑制。具体来说，从模型产出的19×19×5个包围盒开始，我们需要：

去除概率较低的包围盒（说明这个包围盒检测到分类的置信度较低）
对于相互重合的包围盒，仅选取一个最可信的

11.2.2 用分类得分的阈值进行过滤 Filtering with a threshold on class scores¶

第一个过滤器，我们会使用阈值来过滤。每个网格的5个archor box，都有一个概率得分，如果概率得分小于选定的阈值，那么这个box就可以丢弃。

模型总共返回 19x19x5x85 个数字，每个box用85个数字来表示。我们可以将 (19,19,5,85) (或者 (19,19,425)) 的张量调整为以下几个变量：

box_confidence: 维度为 $(19 \times 19, 5, 1)$ 的张量，包含 $p_c$ (对19×19的每个网格，5个Box中，每个给定box包含目标的概率)
boxes: 维度为 $(19 \times 19, 5, 4)$ 的张量，包含 $(b_x, b_y, b_h, b_w)$ （描述了19×19的每个网格，5个Box的位置）
box_class_probs: 维度为 $(19 \times 19, 5, 80)$ 的张量，包含检测概率 $(c_1, c_2, ... c_{80})$ （对19×19的每个网格，5个Box中，每个属于80个分类的概率）

练习: 实现 yolo_filter_boxes().

按照图4描述，计算每个box预测特定分类的得分，可以参考下面的Python代码：

a = np.random.randn(19*19, 5, 1)
b = np.random.randn(19*19, 5, 80)
c = a * b # shape of c will be (19*19, 5, 80)

对每个box，找到:
- 得分最高的分类的索引值 (Hint) (Be careful with what axis you choose; consider using axis=-1)
- 得分最高的分类的得分 (Hint) (Be careful with what axis you choose; consider using axis=-1)
给定阈值，创建一个mask。作为提示: ([0.9, 0.3, 0.4, 0.5, 0.1] < 0.4) returns: [False, True, False, False, True]. mask需要对希望保留的box返回True。
使用Tensorflow，将mask应用到box_class_scores, boxes 和 box_classes，来过滤掉我们不需要的box，仅留下需要保留的box子集

(Hint)

提醒: 调用Keras函数应使用 K.function(...).

In [2]:

# GRADED FUNCTION: yolo_filter_boxes

def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
    """Filters YOLO boxes by thresholding on object and class confidence.
    
    Arguments:
    box_confidence -- tensor of shape (19, 19, 5, 1)
    boxes -- tensor of shape (19, 19, 5, 4)
    box_class_probs -- tensor of shape (19, 19, 5, 80)
    threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
    
    Returns:
    scores -- tensor of shape (None,), containing the class probability score for selected boxes
    boxes -- tensor of shape (None, 4), containing (b_x, b_y, b_h, b_w) coordinates of selected boxes
    classes -- tensor of shape (None,), containing the index of the class detected by the selected boxes
    
    Note: "None" is here because you don't know the exact number of selected boxes, as it depends on the threshold. 
    For example, the actual output size of scores would be (10,) if there are 10 boxes.
    """
    
    # Step 1: Compute box scores
    ### START CODE HERE ### (≈ 1 line)
    box_scores = box_confidence * box_class_probs
    ### END CODE HERE ###
    
    # Step 2: Find the box_classes thanks to the max box_scores, keep track of the corresponding score
    ### START CODE HERE ### (≈ 2 lines)
    box_classes = K.argmax(box_scores, axis=-1)
    box_class_scores = K.max(box_scores, axis=-1)
    ### END CODE HERE ###
    
    # Step 3: Create a filtering mask based on "box_class_scores" by using "threshold". The mask should have the
    # same dimension as box_class_scores, and be True for the boxes you want to keep (with probability >= threshold)
    ### START CODE HERE ### (≈ 1 line)
    filtering_mask = box_class_scores >= threshold
    ### END CODE HERE ###
    
    # Step 4: Apply the mask to scores, boxes and classes
    ### START CODE HERE ### (≈ 3 lines)
    scores = tf.boolean_mask(box_class_scores, filtering_mask)
    boxes = tf.boolean_mask(boxes, filtering_mask)
    classes = tf.boolean_mask(box_classes, filtering_mask)
    ### END CODE HERE ###
    
    return scores, boxes, classes

In [3]:

with tf.Session() as test_a:
    box_confidence = tf.random_normal([19, 19, 5, 1], mean=1, stddev=4, seed = 1)
    boxes = tf.random_normal([19, 19, 5, 4], mean=1, stddev=4, seed = 1)
    box_class_probs = tf.random_normal([19, 19, 5, 80], mean=1, stddev=4, seed = 1)
    scores, boxes, classes = yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = 0.5)
    print("scores[2] = " + str(scores[2].eval()))
    print("boxes[2] = " + str(boxes[2].eval()))
    print("classes[2] = " + str(classes[2].eval()))
    print("scores.shape = " + str(scores.shape))
    print("boxes.shape = " + str(boxes.shape))
    print("classes.shape = " + str(classes.shape))

scores[2] = 10.7506
boxes[2] = [ 8.42653275  3.27136683 -0.5313437  -4.94137383]
classes[2] = 7
scores.shape = (?,)
boxes.shape = (?, 4)
classes.shape = (?,)

预期输出:

<tr>
    <td>
        **classes[2]**
    </td>
    <td>
       7
    </td>
</tr>
    <tr>
    <td>
        **scores.shape**
    </td>
    <td>
       (?,)
    </td>
</tr>
<tr>
    <td>
        **boxes.shape**
    </td>
    <td>
       (?, 4)
    </td>
</tr>

<tr>
    <td>
        **classes.shape**
    </td>
    <td>
       (?,)
    </td>
</tr>

scores[2]	10.7506
boxes[2]	[ 8.42653275 3.27136683 -0.5313437 -4.94137383]

11.2.3 非极大值抑制 Non-max suppression¶

在按照分类得分过滤之后，我们还面临着很多重合的box。第二个过滤器，将会使用非极大值抑制（NMS）来选择正确的box。

**Figure 7** : In this example, the model has predicted 3 cars, but it's actually 3 predictions of the same car. Running non-max suppression (NMS) will select only the most accurate (highest probabiliy) one of the 3 boxes.

非极大值抑制的核心是函数 交并集Intersection over Union，简称 IoU.

**Figure 8** : Definition of "Intersection over Union".

练习: 实现 iou()，一些提示:

仅在这个练习中，我们通过左上角和右下角的坐标来定义box (x1, y1, x2, y2) ，而不是中间点和长宽。
计算矩形的面积，可以使用高度 (y2 - y1) 乘以宽度 (x2 - x1).
交叉区域的坐标 (xi1, yi1, xi2, yi2) 可以这样计算:
- xi1 = 两个box的x1的最大值
- yi1 = 两个box的y1的最大值
- xi2 = 两个box的x2的最小值
- yi2 = 两个box的y2的最小值
计算交叉区域的面积，需要保证交叉区域的长宽为正，否则交叉区域的面积就应该是0。使用 max(height, 0) 和 max(width, 0).

在下面的代码中，按照管理，(0,0) 表示图片的左上角，(1,0) 表示右上角，(1,1) 表示右下角。

In [4]:

# GRADED FUNCTION: iou

def iou(box1, box2):
    """Implement the intersection over union (IoU) between box1 and box2
    
    Arguments:
    box1 -- first box, list object with coordinates (x1, y1, x2, y2)
    box2 -- second box, list object with coordinates (x1, y1, x2, y2)
    """

    # Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.
    ### START CODE HERE ### (≈ 5 lines)
    xi1 = max(box1[0], box2[0])
    yi1 = max(box1[1], box2[1])
    xi2 = min(box1[2], box2[2])
    yi2 = min(box1[3], box2[3])
    inter_area = max(yi2 - yi1, 0) * max(xi2 - xi1, 0)
    ### END CODE HERE ###    

    # Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)
    ### START CODE HERE ### (≈ 3 lines)
    box1_area = max(box1[3] - box1[1], 0) * max(box1[2] - box1[0], 0)
    box2_area = max(box2[3] - box2[1], 0) * max(box2[2] - box2[0], 0)
    union_area = box1_area + box2_area - inter_area
    ### END CODE HERE ###
    
    # compute the IoU
    ### START CODE HERE ### (≈ 1 line)
    iou = inter_area / union_area
    ### END CODE HERE ###
    
    return iou

In [5]:

box1 = (2, 1, 4, 3)
box2 = (1, 2, 3, 4) 
print("iou = " + str(iou(box1, box2)))

iou = 0.14285714285714285

预期输出:

**iou = **

0.14285714285714285

这样我们就可以开始实现非极大值抑制，关键步骤包括：

选取得分最高的box。
计算所有和它重合的box，如果iou高于阈值 iou_threshold，则移除这个box。
重复步骤1，直到没有其它box比当前的box得分更低。

这样，和选中box重复度非常高的box都会被移除，仅留下最好的box。

练习: 使用Tensorflow实现 yolo_non_max_suppression()。Tensorflow自带非极大值抑制的函数（所以我们用不到自己的iou()实现）：

In [6]:

# GRADED FUNCTION: yolo_non_max_suppression

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
    """
    Applies Non-max suppression (NMS) to set of boxes
    
    Arguments:
    scores -- tensor of shape (None,), output of yolo_filter_boxes()
    boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
    classes -- tensor of shape (None,), output of yolo_filter_boxes()
    max_boxes -- integer, maximum number of predicted boxes you'd like
    iou_threshold -- real value, "intersection over union" threshold used for NMS filtering
    
    Returns:
    scores -- tensor of shape (, None), predicted score for each box
    boxes -- tensor of shape (4, None), predicted box coordinates
    classes -- tensor of shape (, None), predicted class for each box
    
    Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
    function will transpose the shapes of scores, boxes, classes. This is made for convenience.
    """
    
    max_boxes_tensor = K.variable(max_boxes, dtype='int32')     # tensor to be used in tf.image.non_max_suppression()
    K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor
    
    # Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep
    ### START CODE HERE ### (≈ 1 line)
    nms_indices = tf.image.non_max_suppression(boxes, scores, max_boxes, iou_threshold)
    ### END CODE HERE ###
    
    # Use K.gather() to select only nms_indices from scores, boxes and classes
    ### START CODE HERE ### (≈ 3 lines)
    scores = K.gather(scores, nms_indices)
    boxes = K.gather(boxes, nms_indices)
    classes = K.gather(classes, nms_indices)
    ### END CODE HERE ###
    
    return scores, boxes, classes

In [7]:

with tf.Session() as test_b:
    scores = tf.random_normal([54,], mean=1, stddev=4, seed = 1)
    boxes = tf.random_normal([54, 4], mean=1, stddev=4, seed = 1)
    classes = tf.random_normal([54,], mean=1, stddev=4, seed = 1)
    scores, boxes, classes = yolo_non_max_suppression(scores, boxes, classes)
    print("scores[2] = " + str(scores[2].eval()))
    print("boxes[2] = " + str(boxes[2].eval()))
    print("classes[2] = " + str(classes[2].eval()))
    print("scores.shape = " + str(scores.eval().shape))
    print("boxes.shape = " + str(boxes.eval().shape))
    print("classes.shape = " + str(classes.eval().shape))

scores[2] = 6.9384
boxes[2] = [-5.299932    3.13798141  4.45036697  0.95942086]
classes[2] = -2.24527
scores.shape = (10,)
boxes.shape = (10, 4)
classes.shape = (10,)

预期输出:

<tr>
    <td>
        **classes[2]**
    </td>
    <td>
       -2.24527
    </td>
</tr>
    <tr>
    <td>
        **scores.shape**
    </td>
    <td>
       (10,)
    </td>
</tr>
<tr>
    <td>
        **boxes.shape**
    </td>
    <td>
       (10, 4)
    </td>
</tr>

<tr>
    <td>
        **classes.shape**
    </td>
    <td>
       (10,)
    </td>
</tr>

scores[2]	6.9384
boxes[2]	[-5.299932 3.13798141 4.45036697 0.95942086]

11.2.4 封装过滤器 Wrapping up the filtering¶

接下来，我们要实现一个过滤器的封装函数，将深度神经网络的输出（19x19x5x85的编码）进行过滤。

练习: 实现 yolo_eval()，将YOLO的输出，通过得分阈值和非极大值抑制进行过滤。最后一点实现细节：表示box的方法有好几种，比如通过边角坐标、通过中心点和长宽。YOLO的过程中，会通过下面的函数进行几次转换：

boxes = yolo_boxes_to_corners(box_xy, box_wh)

将中心点+长宽 (x,y,w,h) 的表示形式，转换为边角的表示形式 (x1, y1, x2, y2) ，作为 yolo_filter_boxes 的输入

boxes = scale_boxes(boxes, image_shape)

YOLO的模型是基于608×608的图片来训练的。如果测试图片的大小不同，比如假设图片大小为 720×1280，上面这个函数可以对图片进行转换。

In [8]:

# GRADED FUNCTION: yolo_eval

def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
    """
    Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.
    
    Arguments:
    yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
                    box_confidence: tensor of shape (None, 19, 19, 5, 1)
                    box_xy: tensor of shape (None, 19, 19, 5, 2)
                    box_wh: tensor of shape (None, 19, 19, 5, 2)
                    box_class_probs: tensor of shape (None, 19, 19, 5, 80)
    image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
    max_boxes -- integer, maximum number of predicted boxes you'd like
    score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
    iou_threshold -- real value, "intersection over union" threshold used for NMS filtering
    
    Returns:
    scores -- tensor of shape (None, ), predicted score for each box
    boxes -- tensor of shape (None, 4), predicted box coordinates
    classes -- tensor of shape (None,), predicted class for each box
    """
    
    ### START CODE HERE ### 
    
    # Retrieve outputs of the YOLO model (≈1 line)
    box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs

    # Convert boxes to be ready for filtering functions 
    boxes = yolo_boxes_to_corners(box_xy, box_wh)

    # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold (≈1 line)
    scores, boxes, classes = yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold=score_threshold)
    
    # Scale boxes back to original image shape.
    boxes = scale_boxes(boxes, image_shape)

    # Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold (≈1 line)
    scores, boxes, classes = yolo_non_max_suppression(scores, boxes, classes, max_boxes=max_boxes, iou_threshold=iou_threshold)
    
    ### END CODE HERE ###
    
    return scores, boxes, classes

In [9]:

with tf.Session() as test_b:
    yolo_outputs = (tf.random_normal([19, 19, 5, 1], mean=1, stddev=4, seed = 1),
                    tf.random_normal([19, 19, 5, 2], mean=1, stddev=4, seed = 1),
                    tf.random_normal([19, 19, 5, 2], mean=1, stddev=4, seed = 1),
                    tf.random_normal([19, 19, 5, 80], mean=1, stddev=4, seed = 1))
    scores, boxes, classes = yolo_eval(yolo_outputs)
    print("scores[2] = " + str(scores[2].eval()))
    print("boxes[2] = " + str(boxes[2].eval()))
    print("classes[2] = " + str(classes[2].eval()))
    print("scores.shape = " + str(scores.eval().shape))
    print("boxes.shape = " + str(boxes.eval().shape))
    print("classes.shape = " + str(classes.eval().shape))

scores[2] = 138.791
boxes[2] = [ 1292.32971191  -278.52166748  3876.98925781  -835.56494141]
classes[2] = 54
scores.shape = (10,)
boxes.shape = (10, 4)
classes.shape = (10,)

预期输出:

<tr>
    <td>
        **classes[2]**
    </td>
    <td>
       54
    </td>
</tr>
    <tr>
    <td>
        **scores.shape**
    </td>
    <td>
       (10,)
    </td>
</tr>
<tr>
    <td>
        **boxes.shape**
    </td>
    <td>
       (10, 4)
    </td>
</tr>

<tr>
    <td>
        **classes.shape**
    </td>
    <td>
       (10,)
    </td>
</tr>

scores[2]	138.791
boxes[2]	[ 1292.32971191 -278.52166748 3876.98925781 -835.56494141]

**YOLO总结**: - 输入图片的维度 (608, 608, 3) - 输入图片经过一个CNN，产出的维度为 (19,19,5,85) - 将最后两个维度打平，得到 (19, 19, 425): - 原始图片按19×19的网格进行划分，每个网格产出425个数字。 - 425 = 5 x 85，因为每个网格都会输出5个box，对应着我们预先定义好的5个archor box。 - 85 = 5 + 80，5表示 $(p_c, b_x, b_y, b_h, b_w)$ ，80是我们希望分类的类别数量。 - 对输出的box进行选择: - 得分阈值：如果输出的box，类别概率得分低于阈值，则抛弃 - 最极大值抑制：计算IOU，避免选择重复的包围盒 - 上面的步骤完成后，就得到了YOLO的最终输出。

11.3 使用预训练好的YOLO模型对图片进行测试 Test YOLO pretrained model on images¶

在这一节，我们将会使用一个预训练好的模型，针对车辆检测数据集进行测试。和之前一样，我们需要首先 创建一个session，来开始计算图。执行下面的代码。

In [10]:

sess = K.get_session()

11.3.1 定义分类、anchor和图片维度 Defining classes, anchors and image shape¶

我们需要检测80种分类，使用5种archo box。80个分类和5个box的信息，已经分别储存在"coco_classes.txt" 和 "yolo_anchors.txt"两个文件中。执行下面的代码块，来导入这些信息。

车辆检测数据集的图片是 720x1280 的，我们已经将其预处理为 608x608。

In [ ]:

class_names = read_classes("datasets/coco_classes.txt")
anchors = read_anchors("datasets/yolo_anchors.txt")
image_shape = (720., 1280.)    

11.3.2 载入预训练的模型¶

训练YOLO模型十分耗时，也需要非常大的标记了包围盒的数据集。我们将会载入一个已经训练好的Keras YOLO模型，储存在"yolo.h5"中。（这些权重来自YOLO官方网站，Allan Zelener提供了一个转换函数将其转为Keras兼容的格式。严格来说，这些是YOLOv2模型的参数，为了方便理解，我们在这里还是简单称为YOLO模型）。执行下面的代码块来载入模型。

In [ ]:

yolo_model = load_model("datasets/yolo.h5")

这样预先训练好的YOLO模型权重就已经载入，各层的汇总情况如下：

In [ ]:

yolo_model.summary()

提示: 这个模型将预处理的批量输入图片（维度：(m, 608, 608, 3)），转为维度是 (m, 19, 19, 5, 85) 的张量，如图2所述。

11.3.3 将模型的输出转换为可用的包围盒张量 Convert output of the model to usable bounding box tensors¶

yolo_model 的输出是一个 (m, 19, 19, 5, 85) 的张量，这个张量需要一些处理和转换，执行下面的代码块。

In [ ]:

yolo_outputs = yolo_head(yolo_model.output, anchors, len(class_names))

我们需要将 yolo_outputs 加入到计算图中。这一组共4个张量，可以输入到 yolo_eval 函数中。

11.3.4 过滤包围盒 Filtering boxes¶

yolo_outputs 将 yolo_model的结果进行了转换。下面调用我们实现的 yolo_eval 函数，来过滤包围盒。

In [ ]:

scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

11.3.5 对一张图片进行图计算 Run the graph on an image¶

我们已经创建了一个 (sess) 计算图，其计算过程如下：

yolo_model.input 作为输入，提供给 yolo_model，进行计算，产出 yolo_model.output
yolo_model.output 作为输入，提供给 yolo_head，进行预处理，产出 yolo_outputs
yolo_outputs 作为输入，提供给 yolo_eval，进行过滤，产出预测结果： scores, boxes, classes

Exercise: 实现 predict()，对测试集的一张图片进行计算，测试YOLO。我们需要运行一个Tensorflow回话，来计算scores, boxes, classes.

下面的代码同时还使用了图片处理函数:

image, image_data = preprocess_image("images/" + image_file, model_image_size = (608, 608))

输出:

image: Python (PIL) 的图片表示，用来绘制包围盒。
image_data: 图片的numpy表示，作为CNN的输入

注意: 当一个模型使用了批量正则化（比如YOLO）时，需要再添加一个占位符 {K.learning_phase(): 0}。

In [ ]:

def predict(sess, image_file):
    """
    Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.
    
    Arguments:
    sess -- your tensorflow/Keras session containing the YOLO graph
    image_file -- name of an image stored in the "images" folder.
    
    Returns:
    out_scores -- tensor of shape (None, ), scores of the predicted boxes
    out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
    out_classes -- tensor of shape (None, ), class index of the predicted boxes
    
    Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes. 
    """

    # Preprocess your image
    image, image_data = preprocess_image("images/" + image_file, model_image_size = (608, 608))

    # Run the session with the correct tensors and choose the correct placeholders in the feed_dict.
    # You'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})
    ### START CODE HERE ### (≈ 1 line)
    out_scores, out_boxes, out_classes = sess.run((scores, boxes, classes), feed_dict={yolo_model.input: image_data, K.learning_phase(): 0})
    ### END CODE HERE ###

    # Print predictions info
    print('Found {} boxes for {}'.format(len(out_boxes), image_file))
    # Generate colors for drawing bounding boxes.
    colors = generate_colors(class_names)
    # Draw bounding boxes on the image file
    draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
    # Save the predicted bounding box on the image
    image.save(os.path.join("out", image_file), quality=90)
    # Display the results in the notebook
    output_image = scipy.misc.imread(os.path.join("out", image_file))
    imshow(output_image)
    
    return out_scores, out_boxes, out_classes

In [ ]:

out_scores, out_boxes, out_classes = predict(sess, "test.jpg")

预期输出:

Found 7 boxes for test.jpg
car	0.60 (925, 285) (1045, 374)
car	0.66 (706, 279) (786, 350)
bus	0.67 (5, 266) (220, 407)
car	0.70 (947, 324) (1280, 705)
car	0.74 (159, 303) (346, 440)
car	0.80 (761, 282) (942, 412)
car	0.89 (367, 300) (745, 648)

**总结**: - YOLO是目前最先进的目标检测算法，兼具速度和准确性 - YOLO将输入图片载入一个深度神经网络，产出 19x19x5x85 的立方体 - 这个立方体作为编码，可以视作是 19×19 的网格，每个包含了5个包围盒 - 对所有包围盒使用非极大值抑制，具体来说包括： - 得分阈值 - IoU阈值 - 从随机权重开始训练YOLO模型需要很大的计算量和数据集，我们这里采用了提前训练好的模型。

参考文献: YOLO的概念主要来自于下面的两篇论文，而实现参考了Allan Zelener的Github仓库。预训练的权重来自YOLO的官方网站。

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi - You Only Look Once: Unified, Real-Time Object Detection (2015)
Joseph Redmon, Ali Farhadi - YOLO9000: Better, Faster, Stronger (2016)
Allan Zelener - YAD2K: Yet Another Darknet 2 Keras
The official YOLO website (https://pjreddie.com/darknet/yolo/)

In [ ]:

Found 7 boxes for test.jpg
car	0.60 (925, 285) (1045, 374)
car	0.66 (706, 279) (786, 350)
bus	0.67 (5, 266) (220, 407)
car	0.70 (947, 324) (1280, 705)
car	0.74 (159, 303) (346, 440)
car	0.80 (761, 282) (942, 412)
car	0.89 (367, 300) (745, 648)