

深度解析 CornerNet 网络结构

极市平台

2020-12-27

导读：本文第一部分结合代码对cornerNet进行了详细解析。第二部分则聚焦于阐述如何将真实标签映射为监督信息以及详细探讨了损失函数的定义。

↑ 点击蓝字关注极市平台

作者丨周威@知乎（已授权）

来源丨https://zhuanlan.zhihu.com/p/188587434

编辑丨极市平台

极市导读

本文第一部分结合代码，从cornerNet的具体网络结构和损失函数开始，对cornerNet进行了详细解析。第二部分则聚焦于阐述如何将真实标签映射为监督信息（类似网络的输出格式）以及详细探讨了损失函数的定义。>>年度盘点：极市计算机视觉资源汇总，顶会论文、技术视频、数据集等（限时开放下载）

1.前言

最近又跳回来继续学习基于anchor free的目标检测模型，具体包括CornerNet和CenterNet等网络结构。

学习anchor free的detector目的如下：

1. 作为以目标检测领域入门深度学习的小白，如果目标检测领域没有接触到anchor free，显得我很业余，很不专业（当然，这仅仅是心理作用罢了）
2. 接触一些目标追踪领域（MOT）的文章，遇到了如下的一些关键字：Objects as points、anchor-free等，这不为了学习MOT打一下基础嘛

图1 一对点进行bounding box位置预测

所以本文就我在CornerNet论文解读和代码解析过程中的一些见解，以文字的形式进行总结。

论文和代码链接如下：

paper：https://arxiv.org/abs/1808.01244

code : https://github.com/princeton-vl/CornerNet

我在学习一种新的模型/网络时，喜欢按照下面步骤进行探索。

1. 在知乎/CSDN上找相关的（解析）博客，进行初步印象的建立
2. 细读原文paper
3. 开始跑代码，看代码，深入理解

可是我在知乎上看了一大堆关于cornerNet的文章，总觉得千篇一律（虽然大家写的很认真），但是感觉就是对原文的翻译，少了一些个人的理解在里面。

2. 一些基础知识

所以本次我想对cornerNet进行更深度的解析，便于以后的我和各位读者更好的理解。

论文的名字是这样的，CornerNet: Detecting Objects as Paired Keypoints。那么CornerNet 是根据一对关键点来进行物体的检测的。

该论文的主要创新为

1. anchor-free 的目标检测新思路
2. corner pooling的提出
3. cornerNet网络的提出

那么不妨会出现以下几个疑问：

（1）问题1，具体是哪一对关键点呢？

原文中提到

We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network.

作者的意思就是我们只需要预测物体包围框的左上点坐标（top-left corner）和右下角坐标，那么就可以完成对物体的检测了。就像上图1一样。

那么原来需要设置很多anchor进行 region proposal 的方法变成了一对对关键点的检测了。

我们都知道，anchor-based的方法，虽然目前有的方法可以满足实时性要求（如YOLO、SSD等），但是仍然会消耗大量的时间在anchor的计算上。所以作者另辟思路，提出了anchor-free的cornerNet这种方法，提高了检测的速度和精度。

（2）问题2：如何匹配同一物体bounding box的左上角和右下角？

原文中提到

The network also predicts an embedding vector for each detected corner [27] such that the distance between the embeddings of two corners from the same object is small.

也就是cornerNet在进行预测的时候，会为每个点分配一个embedding vector，属于同一物体的点的vector的距离较小。如下图2所示。

图2 corner的embedding

（3）问题3：什么是corner pooling ? 干什么的？有什么用？

我们知道max pooling，知道average pooling，但是没有见过corner pooling。本文将提出了一种适用于cornerNet网络的corner pooling，目的是为了建立点corner和目标的位置关系。

一般而言，知道了bounding box的左上角和bounding box的右下角就可以确定位置所在的范围和区域。

那么我们从bounding box左上角(top-left corner)看物体。视线横着的话，物体就在视线的下面（那么视线所在位置为the topmost boundary of an object）。视线竖着的话，物体就在视线的右边，那么视线位置为the leftmost boundary。如下图3所示。

图3

那么 top-left corner pooling 的实现过程如下：

图4 top-left corner pooling 的实现过程

也就是当求解某一个点的 top-left corner pooling时，就是以该点为起点，水平向右看遇到的最大值以及竖直向下看最大的值之和。那么对一张图上的特征值的每个像素点都执行这样的操作，看起来实属麻烦。能不能有高效的方法呢？

作者提出了一个高效的方法，如下图5

图5 corner pooling高效的解决方法

图上怎么从右向左，从下向上看呢？这是bottom-right的corner pooling？

显然不是，不要被方向迷惑了。这里还是 top-left corner pooling。

把方向颠倒了后，我们每次都将沿着该方向上遇到的最大的值作为填充值即可快速实现 corner pooling。这样每行或者每列只需要进行少量的判断即可，不像之前那样，还需要每个点都要判断所沿方向上的所处行和列中的最大值，大大提升了效率。

讲到这里，相信大家应该对corner pooling进行了一些了解了。

（4）网络输出什么？各有什么作用？

我们知道网络总有六个输出，每个分支三个输出。下图为corner net的总图。从图中可以看出每个分支都有以三个构成。

1. heatmaps
2. embeddings
3. offsets

那么以上的三个输出各什么作用呢？

论文中提到

We predict two sets of heatmaps, one for top-left corners and one for bottom- right corners. Each set of heatmaps has C channels, where C is the number of categories, and is of size H ×W. There is no background channel. Each channe is a binary mask indicating the locations of the corners for a class.

那么获得的两个heatmap表示了不同类别的左上corner和右下corner的位置信息以及位置的置信度信息。

embedding输出在前面已经说过了，用来衡量左上corner和右下corner的距离的，从而判断某一对角点是否属于同一个物体的两个角点。

至于offsets输出，论文中提到

也就是heatmap被downsample至原来的1/n后，还想继续upsample回去的话会造成精度的损失，这会严重影响到小物体框的位置精度，所以作者采用了offsets来缓解这种问题。

有关这三个输出的作用就讲解结束了，基础知识就讲解完毕了。

本文将结合代码，从cornerNet的具体网络结构和损失函数开始，对cornerNet进行详细解析。

3. CornerNet网络结构

cornerNet的网络结构主要分为以下几个部分

1. backbone: hourglass Network
2. head: 二分支输出 Top-left corners 和 Bottom-right corners，每个分支包含了各自的corner pooling以及三分支输出

以上三个部分如图6所示

图6 总结构

原文中做了很详细的解释，如下

（1）backbone: hourglass Network

上面英文的意思就是cornerNet借用了hourglass network作为他的backbone特征提取网络，这个hourglass network通常被用在姿态估计任务中，是一种呈沙漏状的downsampling 和 upsampling组合，如下图7所示为两个沙漏模块（hourglass module）头尾相连的效果。

图7 两个hourglass module

我们不妨借用hourglass （https://arxiv.org/abs/1603.06937）原文中的结构图。

图8 hourglass module 结构图

图8是一个hourglass module 的结构图。很明显地可以看出来，该网络先使用一系列卷积层和max pooling层对输入进行了downsample，然后将downsampling的特征图通过upsample layer恢复到原来输入图片的大小。

因为在max pooling的过程中会有一些细节的信息损失，所以hourglass module还采用了skip layer将特征进行融合，减少了信息的损失。

因为单个hourglass的特征提取能力是有限的，所以可以不断地进行hourglass module的堆叠，可以获得更高的特征提取能力，cornerNet中使用了两个hourglass模块。

并且在原有hourglass的基础上，作者做了以下的改进，

可以总结如下：

1. 在输入hourglass module之前，需要将图片分辨率降低为原来的1/4倍。本文采用了一个stride=2的7x7卷积和一个stride=2的残差单元进行图片分辨率降低。
2. 使用stride=2的卷积层代替max pooling进行downsample
3. 共进行5次downsample ,这5次downsample后的特征图通道为[256,384,384,384,512]
4. 采用最近邻插值的上采样（upsample),后面接两个残差单元

上述1中，会对输入图片进行一个处理，采用了一个stride=2的7x7卷积和一个stride=2的残差单元使其分辨率缩小为原来的1/4，代码实现如下

 
   #在第一个hourglass module之前，用来降低图片分辨率为原来的1/4 
        self.pre = nn.Sequential(
            convolution(7, 3, 128, stride=2),
            residual(3, 128, 256, stride=2)
        ) if pre is None else pre

接着，作者提到

We apply a 3 × 3 Conv-BN module to both the input and output of the first hourglass module. We then merge them by element-wise addition followed by a ReLU and a residual block with 256 channels, which is then used as the input to the second hourglass module. The depth of the hourglass network is 104. Unlike many other state-of-the-art detectors, we only use the features from the last layer of the whole network to make predictions.

作者的意思是

1. 在第一个hourglass module的输入和输出后都有一个3x3卷积层+BN层
2. 然后对残差连接后使用按照元素相加
3. 处理2完毕后，作为第二个hourglass module的输入
4. 预测的话，只选择总网络的最后一层特征图作为输入

这里我们简单看一下代码，在models\py_utils\kp.py文件下。

这里定义的类kp_module就是hourglass module的定义。

 
  class kp_module(nn.Module):
    """
    一个简单的hourglass module结构
    """
    def __init__(
        self, n, dims, modules, layer=residual,
        make_up_layer=make_layer, make_low_layer=make_layer,
        make_hg_layer=make_layer, make_hg_layer_revr=make_layer_revr,
        make_pool_layer=make_pool_layer, make_unpool_layer=make_unpool_layer,
        make_merge_layer=make_merge_layer, **kwargs
    ):
        super(kp_module, self).__init__()

        self.n   = n #5

        # modules = [2, 2, 2, 2, 2, 4]，模块的数量
        curr_mod = modules[0]
        next_mod = modules[1]

        # dims=[256, 256, 384, 384, 384, 512]
        curr_dim = dims[0]
        next_dim = dims[1]

        self.up1  = make_up_layer(
            3, curr_dim, curr_dim, curr_mod, 
            layer=layer, **kwargs
        )  #三个简单的layer(residual module),kernel_size=3

        self.max1 = make_pool_layer(curr_dim) #MaxPool2d(kernel_size=2, stride=2)

        self.low1 = make_hg_layer(
            3, curr_dim, next_dim, curr_mod,
            layer=layer, **kwargs
        ) #三个简单的layer(residual module),kernel_size=3

        self.low2 = kp_module(
            n - 1, dims[1:], modules[1:], layer=layer, 
            make_up_layer=make_up_layer, 
            make_low_layer=make_low_layer,
            make_hg_layer=make_hg_layer,
            make_hg_layer_revr=make_hg_layer_revr,
            make_pool_layer=make_pool_layer,
            make_unpool_layer=make_unpool_layer,
            make_merge_layer=make_merge_layer,
            **kwargs
        ) if self.n > 1 else \
        make_low_layer(
            3, next_dim, next_dim, next_mod,
            layer=layer, **kwargs
        ) #递归的思想，不断地降低n,知道n>1不满足

        self.low3 = make_hg_layer_revr(
            3, next_dim, curr_dim, curr_mod,
            layer=layer, **kwargs
        )

        #  nn.Upsample(scale_factor=2)
        self.up2  = make_unpool_layer(curr_dim)

        self.merge = make_merge_layer(curr_dim)

    def forward(self, x):
        up1  = self.up1(x)
        max1 = self.max1(x)
        low1 = self.low1(max1)
        low2 = self.low2(low1)
        low3 = self.low3(low2)
        up2  = self.up2(low3)
        return self.merge(up1, up2) #element-wise add

我们注意到self.low2的定义如下

 
   self.low2 = kp_module(
            n - 1, dims[1:], modules[1:], layer=layer, 
            make_up_layer=make_up_layer, 
            make_low_layer=make_low_layer,
            make_hg_layer=make_hg_layer,
            make_hg_layer_revr=make_hg_layer_revr,
            make_pool_layer=make_pool_layer,
            make_unpool_layer=make_unpool_layer,
            make_merge_layer=make_merge_layer,
            **kwargs
        ) if self.n > 1 else \
        make_low_layer(
            3, next_dim, next_dim, next_mod,
            layer=layer, **kwargs
        ) #递归的思想，不断地降低n,知道n>1不满足

这里是在kp_module的类定义中使用了它本身。很明显这是一个递归的思想。有了这个递归的思想，这个hourglass module就定义就很容易进行实现和理解了。

定义结束hourglass module后，定义由两个hourglass module构成的hourglass网络，代码如下：

 
   self.kps  = nn.ModuleList([
            kp_module(
                n, dims, modules, layer=kp_layer,
                make_up_layer=make_up_layer,
                make_low_layer=make_low_layer,
                make_hg_layer=make_hg_layer,
                make_hg_layer_revr=make_hg_layer_revr,
                make_pool_layer=make_pool_layer,
                make_unpool_layer=make_unpool_layer,
                make_merge_layer=make_merge_layer
            ) for _ in range(nstack) #hourglass 网络，包含了nstack个模块
        ])

至此，特征提取网络（backbone）— hourglass 就定义结束了。

(2) head: 二分支输出 Top-left corners 和 Bottom-right corners

通过两个hourglass module后的特征图，需要各自再通过一个3x3卷积后才能获得两个corners分支。如下图所示。

图9 双分支

代码实现如下：

 
              # 获得两个分支特征图，分别做左上点和右下点的预测的
            tl_cnv = tl_cnv_(cnv)
            br_cnv = br_cnv_(cnv)

获得用于预测左上点和右下点的两个分支module（就是图上的prediction module)后，每个predicition module分别进行如下操作

1. corner pooling
2. 三分支的输出

这里以top_left corners的prediction module为例，如图10所示。

图 10

更具体地，这里仍然根据论文中的图，如下图11是图10的具体实现。

图11

具有有关上面代码中的tl_cnv_实现如下，这其实就是一个实例化的tl_pool类（top-left corner pooling )。有关代码定义如下：

 
  class tl_pool(pool):
    def __init__(self, dim):
        super(tl_pool, self).__init__(dim, TopPool, LeftPool)

他是pool的子类，pool定义如下

 
  class pool(nn.Module):
    def __init__(self, dim, pool1, pool2):
        super(pool, self).__init__()
        self.p1_conv1 = convolution(3, dim, 128)
        self.p2_conv1 = convolution(3, dim, 128)

        self.p_conv1 = nn.Conv2d(128, dim, (3, 3), padding=(1, 1), bias=False)
        self.p_bn1   = nn.BatchNorm2d(dim)

        self.conv1 = nn.Conv2d(dim, dim, (1, 1), bias=False)
        self.bn1   = nn.BatchNorm2d(dim)
        self.relu1 = nn.ReLU(inplace=True)

        self.conv2 = convolution(3, dim, dim)

        self.pool1 = pool1()
        self.pool2 = pool2()

    def forward(self, x):
        # pool 1
        p1_conv1 = self.p1_conv1(x)
        pool1    = self.pool1(p1_conv1)

        # pool 2
        p2_conv1 = self.p2_conv1(x)
        pool2    = self.pool2(p2_conv1)

        # pool 1 + pool 2
        p_conv1 = self.p_conv1(pool1 + pool2)
        p_bn1   = self.p_bn1(p_conv1)

        # resudual connect 
        conv1 = self.conv1(x)
        bn1   = self.bn1(conv1)
        relu1 = self.relu1(p_bn1 + bn1)

        conv2 = self.conv2(relu1)
        return conv2

很清晰明了，这代码就是图11中部分实现，该部分如图12所示。

图 12

那么根据上述代码执行后的结果，对其分别执行3x3 conv-BN-ReLU后，获得三个输出。代码实现如下:

 
  #对上面两个分支分别输出三个预测分支
            tl_heat, br_heat = tl_heat_(tl_cnv), br_heat_(br_cnv)
            tl_tag,  br_tag  = tl_tag_(tl_cnv),  br_tag_(br_cnv)
            tl_regr, br_regr = tl_regr_(tl_cnv), br_regr_(br_cnv)

至此，有关corner net的网络就解析完毕了，我们根据代码绘制一张总的模型图，如图13所示。

图 13

1.CornerNet损失函数深度解析

cornerNet的一些实现细节，以及损失函数

为了实现模型的训练与测试，仅仅定义网络结构是远远不够的。我们还需要

（1）将真实标签（即物体的类别和所在的位置）映射为监督信息（类似网络的输出格式）
（2）根据网络前向过程的输出和（1）中的监督信息构建相应的损失函数
（3）根据损失函数进行梯度下降，更新网络参数

下面的解析将着重讨论（1）（2）两点。

2. 实现的一些细节

2.1 如何将真实标签映射为监督信息（类似网络的输出格式）

上一讲中，我们给出了CornerNet网络的网络结构，具体如下图1所示。

图 1 cornerNet网络结构

从上图可见，CornerNet共有6个输出，分别为用于bounding box左上corner位置预测的三个输出

（1）左上corner的heatmaps
（2）左上corner的embedding
（3）左上corner的offsets

以及用于bounding box右下corner位置预测的三个输出

（1）右下corner的heatmaps
（2）右下corner的embedding
（3）右下corner的offsets

那么具体各个输出的大小为多少呢？

论文中给出

During training, we set the input resolution of the network to 511×511, which leads to an output resolution of 128×128.

所以网络接受511x511x3大小的RGB图片输入，返回6个输出特征图，特征图的宽高均为128。也就是

（1）左上corner的heatmaps，大小为（batch size,128,128,80)
（2）左上corner的embedding，大小为（batch size,128,128,1)
（3）左上corner的offsets，大小为（batch size,128,128,2)
（4）右下corner的heatmaps，大小为（batch size,128,128,80)
（5）右下corner的embedding，大小为（batch size,128,128，1)
（6）右下corner的offsets，大小为（batch size,128,128,2)

那么我们需要将训练过程中已知的ground-truth bounding box的信息（中心点坐标和宽高、类别）转换成监督信息，类似上述网络输出的格式，这样会便于损失函数的求解。

这里论文中并没有详细说明转换过程，需要通过代码进行探索。

该转换过程定义在在sample/coco.py中的函数kp_detection(找了我很久终于找到）。全部的代码为

 
  def kp_detection(db, k_ind, data_aug, debug):
    data_rng   = system_configs.data_rng
    batch_size = system_configs.batch_size

    categories   = db.configs["categories"] #80
    input_size   = db.configs["input_size"] #511
    output_size  = db.configs["output_sizes"][0] #[[128,128]]

    border        = db.configs["border"] #128
    lighting      = db.configs["lighting"] #True
    rand_crop     = db.configs["rand_crop"] #False
    rand_color    = db.configs["rand_color"] #False
    rand_scales   = db.configs["rand_scales"] #False
    gaussian_bump = db.configs["gaussian_bump"] #True
    gaussian_iou  = db.configs["gaussian_iou"] #0.7
    gaussian_rad  = db.configs["gaussian_radius"] #-1

    max_tag_len = 128 #一张图中最大可能的target数量

    # allocating memory
    images      = np.zeros((batch_size, 3, input_size[0], input_size[1]), dtype=np.float32)
    tl_heatmaps = np.zeros((batch_size, categories, output_size[0], output_size[1]), dtype=np.float32)
    br_heatmaps = np.zeros((batch_size, categories, output_size[0], output_size[1]), dtype=np.float32)
    tl_regrs    = np.zeros((batch_size, max_tag_len, 2), dtype=np.float32)
    br_regrs    = np.zeros((batch_size, max_tag_len, 2), dtype=np.float32)
    tl_tags     = np.zeros((batch_size, max_tag_len), dtype=np.int64)
    br_tags     = np.zeros((batch_size, max_tag_len), dtype=np.int64)
    tag_masks   = np.zeros((batch_size, max_tag_len), dtype=np.uint8)
    tag_lens    = np.zeros((batch_size, ), dtype=np.int32) # store the num of targets for every image in a batch images

    db_size = db.db_inds.size
    for b_ind in range(batch_size): #b_ind means the index of image in a batch
        if not debug and k_ind == 0:
            db.shuffle_inds()

        db_ind = db.db_inds[k_ind]
        k_ind  = (k_ind + 1) % db_size

        # reading image
        image_file = db.image_file(db_ind)
        image      = cv2.imread(image_file)

        # reading detections
        detections = db.detections(db_ind)

        # cropping an image randomly
        if not debug and rand_crop:
            image, detections = random_crop(image, detections, rand_scales, input_size, border=border)
        else:
            image, detections = _full_image_crop(image, detections)

        image, detections = _resize_image(image, detections, input_size)
        detections = _clip_detections(image, detections)

        width_ratio  = output_size[1] / input_size[1] # 缩放比例（宽）
        height_ratio = output_size[0] / input_size[0] # 缩放比例 （高）

        # flipping an image randomly
        if not debug and np.random.uniform() > 0.5:
            image[:] = image[:, ::-1, :]
            width    = image.shape[1]
            detections[:, [0, 2]] = width - detections[:, [2, 0]] - 1

        if not debug:
            image = image.astype(np.float32) / 255.
            if rand_color:
                color_jittering_(data_rng, image)
                if lighting:
                    lighting_(data_rng, image, 0.1, db.eig_val, db.eig_vec)
            normalize_(image, db.mean, db.std)
        images[b_ind] = image.transpose((2, 0, 1))

        for ind, detection in enumerate(detections):
            # prepare the ground_truth heatmap 
            category = int(detection[-1]) - 1  #get the detected target's category

            xtl, ytl = detection[0], detection[1] # the coordinate of the left-top corner
            xbr, ybr = detection[2], detection[3] # the coordinate of the right-bottom corner

            fxtl = (xtl * width_ratio) # reflect the coordinate to the size of output feature map
            fytl = (ytl * height_ratio)
            fxbr = (xbr * width_ratio)
            fybr = (ybr * height_ratio)

            xtl = int(fxtl) #give the postion at which the corner  actually located
            ytl = int(fytl)
            xbr = int(fxbr)
            ybr = int(fybr)

            if gaussian_bump:
                # 使用高斯分布的heatmap
                # execute
                width  = detection[2] - detection[0]
                height = detection[3] - detection[1]

                width  = math.ceil(width * width_ratio) #取上整
                height = math.ceil(height * height_ratio)

                if gaussian_rad == -1:
                    radius = gaussian_radius((height, width), gaussian_iou) #calculate the radius
                    radius = max(0, int(radius))
                else:
                    radius = gaussian_rad

                draw_gaussian(tl_heatmaps[b_ind, category], [xtl, ytl], radius)
                draw_gaussian(br_heatmaps[b_ind, category], [xbr, ybr], radius)
            else: 
                #if not guassian bump,then the corresponding corner equals 1,others equal 0
                tl_heatmaps[b_ind, category, ytl, xtl] = 1
                br_heatmaps[b_ind, category, ybr, xbr] = 1

            # the index of target that be detected in current image, a value
            tag_ind = tag_lens[b_ind]
            # the offset between the true coordinate of corner and the actual coordinate of it
            tl_regrs[b_ind, tag_ind, :] = [fxtl - xtl, fytl - ytl]
            br_regrs[b_ind, tag_ind, :] = [fxbr - xbr, fybr - ybr]
            
            # embedding，这里很奇妙，相当于把特征图铺平，然后把corner的位置用该铺平的空间的位置表示
            tl_tags[b_ind, tag_ind] = ytl * output_size[1] + xtl
            br_tags[b_ind, tag_ind] = ybr * output_size[1] + xbr

            # 每多一个目标(target、detection),对应图片的tag_lens加 1
            tag_lens[b_ind] += 1

    for b_ind in range(batch_size):
        # 用来记录一个batch size图片中target的数量，多少个1表示多少个目标
        tag_len = tag_lens[b_ind]
        tag_masks[b_ind, :tag_len] = 1

    images      = torch.from_numpy(images)
    tl_heatmaps = torch.from_numpy(tl_heatmaps)
    br_heatmaps = torch.from_numpy(br_heatmaps)
    tl_regrs    = torch.from_numpy(tl_regrs)
    br_regrs    = torch.from_numpy(br_regrs)
    tl_tags     = torch.from_numpy(tl_tags)
    br_tags     = torch.from_numpy(br_tags)
    tag_masks   = torch.from_numpy(tag_masks)

    return {
        "xs": [images, tl_tags, br_tags],
        "ys": [tl_heatmaps, br_heatmaps, tag_masks, tl_regrs, br_regrs]
    }, k_ind

有点长，我们慢慢解析。

我标注了一些代码注释，截取部分代码段，如下

 
  def kp_detection(db, k_ind, data_aug, debug):
    data_rng   = system_configs.data_rng
    batch_size = system_configs.batch_size

    categories   = db.configs["categories"] #80
    input_size   = db.configs["input_size"] #511
    output_size  = db.configs["output_sizes"][0] #[[128,128]]

    border        = db.configs["border"] #128
    lighting      = db.configs["lighting"] #True
    rand_crop     = db.configs["rand_crop"] #False
    rand_color    = db.configs["rand_color"] #False
    rand_scales   = db.configs["rand_scales"] #False
    gaussian_bump = db.configs["gaussian_bump"] #True
    gaussian_iou  = db.configs["gaussian_iou"] #0.7
    gaussian_rad  = db.configs["gaussian_radius"] #-1

    max_tag_len = 128 #一张图中最大可能的target数量

    # allocating memory
    images      = np.zeros((batch_size, 3, input_size[0], input_size[1]), dtype=np.float32)
    tl_heatmaps = np.zeros((batch_size, categories, output_size[0], output_size[1]), dtype=np.float32)
    br_heatmaps = np.zeros((batch_size, categories, output_size[0], output_size[1]), dtype=np.float32)
    tl_regrs    = np.zeros((batch_size, max_tag_len, 2), dtype=np.float32)
    br_regrs    = np.zeros((batch_size, max_tag_len, 2), dtype=np.float32)
    tl_tags     = np.zeros((batch_size, max_tag_len), dtype=np.int64)
    br_tags     = np.zeros((batch_size, max_tag_len), dtype=np.int64)
    tag_masks   = np.zeros((batch_size, max_tag_len), dtype=np.uint8)
    tag_lens    = np.zeros((batch_size, ), dtype=np.int32) # store the num of targets for every image in a batch images

该段代码定义了一些全零张量作为容器，将图片和标签数据进行相应的转换。

有意思的是，tl_heatmaps和br_heatmaps的形状大小与网络输出一样，为（batch size,128,128,80)。

但是用于offsets（也就是论文中提到的位置微调），在代码中以regrs后缀表示（也就是上述代码中的 tl_regrs 和 br_regrs），大小为（batch size,128,2)，和网络输出的offsets（大小为（batch size,128,128,2)）不同呀！

用于求解角点相似度的embedding的在代码中用tl_tags、 br_tags以及tag_masks表示，大小为（batch size,128)，和网络输出的embedding也不同呀！

没关系，这里会在损失函数求解时进行对应。

A. heatmaps

我们发现两个heatmap的通道数均为80，而这个80是COCO数据集的类别数。heatmap反映了不同类别物体左上或右下角点的位置范围。那么将ground-truth bounding box的角点位置映射到该heatmap的过程具体是怎么样的呢？

论文中提到，只将物体bounding box的一个角点映射到heatmap中对应的一个角点是不合理的，或者就是说该映射要求太过苛刻了。因为该heatmap中对应角点的隔壁角点，映射回去可能也是会有较好的结果的。

所以将物体bounding box的一个角点映射到heatmap中对应的一个小型的圆形区域，这才是合理的。

图 2 红色框为真实框，绿色框的角点在heatmap中的位置落入红色框角点的范围内，橙色圆形范围以真实框角点为圆心的

如上图2所示。红色框为真实框，橙色圆形范围以真实框角点为圆心的区域，绿色框的角点在橙色圆形范围内，也就是它在heatmap中的位置落入红色框角点的范围内，又或者称为真实框角点的邻近区域（这是我个人定义的），论文中叫做positive location。

根据上面的经验，我们需要将真实框的角点映射到heatmap的一个圆形区域中，作为监督条件之一。代码中是这样实现的

 
  for ind, detection in enumerate(detections):
            # prepare the ground_truth heatmap 
            category = int(detection[-1]) - 1  #get the detected target's category

            xtl, ytl = detection[0], detection[1] # the coordinate of the left-top corner
            xbr, ybr = detection[2], detection[3] # the coordinate of the right-bottom corner

            fxtl = (xtl * width_ratio) # reflect the coordinate to the size of output feature map
            fytl = (ytl * height_ratio)
            fxbr = (xbr * width_ratio)
            fybr = (ybr * height_ratio)

            xtl = int(fxtl) #give the postion at which the corner  actually located
            ytl = int(fytl)
            xbr = int(fxbr)
            ybr = int(fybr)

            if gaussian_bump:
                # 使用高斯分布的heatmap
                # execute
                width  = detection[2] - detection[0]
                height = detection[3] - detection[1]

                width  = math.ceil(width * width_ratio) #取上整
                height = math.ceil(height * height_ratio)

                if gaussian_rad == -1:
                    radius = gaussian_radius((height, width), gaussian_iou) #calculate the radius
                    radius = max(0, int(radius))
                else:
                    radius = gaussian_rad

                draw_gaussian(tl_heatmaps[b_ind, category], [xtl, ytl], radius)
                draw_gaussian(br_heatmaps[b_ind, category], [xbr, ybr], radius)
            else: 
                #if not guassian bump,then the corresponding corner equals 1,others equal 0
                tl_heatmaps[b_ind, category, ytl, xtl] = 1
                br_heatmaps[b_ind, category, ybr, xbr] = 1

该段代码对一张图片中的各个物体的检测框做了以下几件事：

（1）获得左上角点坐标（xtl, ytl )和右上角点坐标（ xbr, ybr ）
（2）将坐标按照类别映射到heatmap上（记作真实映射位置，为整型），代码中的（xtl ，ytl）和（xbr ，ybr）
（3）记录精确的映射点（记作精确映射位置，为浮点型），代码中的（fxtl ，fytl）和（fxbr ，fybr）
（4）以整型的角点（xtl ，ytl）和（xbr ，ybr）所在位置为圆心，使用高斯分布获取其邻近区域（越靠近真实角点的值越大，越远离的越小）
（5）获得heatmap的监督信息

这样的话，我们就将一张图片上所有物体的角点位置按照类别的不同映射到heatmap上了。

B. Offset和embedding

上面我们提到offset和embedding监督信息的形状大小与网络的输出并不一致，代码中制定了个最大长度128

 
    max_tag_len = 128 #一张图中最大可能的target数量

用来指定一张图片中最多可能出现的物体数量。该指定长度作为以下5个张量的第二个维度，用来记录当前物体的索引。

 
   tl_regrs    = np.zeros((batch_size, max_tag_len, 2), dtype=np.float32)
    br_regrs    = np.zeros((batch_size, max_tag_len, 2), dtype=np.float32)
    tl_tags     = np.zeros((batch_size, max_tag_len), dtype=np.int64)
    br_tags     = np.zeros((batch_size, max_tag_len), dtype=np.int64)
    tag_masks   = np.zeros((batch_size, max_tag_len), dtype=np.uint8)

代码中又定义了一个张量tag_lens用来存储一个批次(batch）中不同图片(image）中物体（object)的数量。

 
  tag_lens    = np.zeros((batch_size, ), dtype=np.int32) # store the num of targets for every image in a batch images

更新的原则为

 
     # 每多一个目标(target、detection),对应图片的tag_lens加 1
            tag_lens[b_ind] += 1

也就是每检测出一个目标，该项的值加1。

那么我们想要获取某张图片中的物体offset和embedding监督信息，例如对索引为b_ind的图片，我们获取它里面所有目标的offset监督信息，只需

 
    # the index of target that be detected in current image, a value
            tag_ind = tag_lens[b_ind]
            # the offset between the true coordinate of corner and the actual coordinate of it
            tl_regrs[b_ind, tag_ind, :] = [fxtl - xtl, fytl - ytl]
            br_regrs[b_ind, tag_ind, :] = [fxbr - xbr, fybr - ybr]

也就是将上面提到的精确映射位置（浮点型）与真实映射位置（整型）进行相减的值作为监督信息即可。

而对embedding的值，代码中做了如下处理

 
     # embedding，这里很奇妙，相当于把特征图铺平，然后把corner的位置用该铺平的空间的位置表示
            tl_tags[b_ind, tag_ind] = ytl * output_size[1] + xtl
            br_tags[b_ind, tag_ind] = ybr * output_size[1] + xbr

这里的

（1）ytl * output_size[1]+ xtl
（2）ybr * output_size[1]+ xbr

就相当于将二维平面铺平为一维向量，然后进行求解在二维平面上坐标为（x,y）的点在一维向量中的索引，如图3所示。

图 3

那么进行这样的转换有什么意义呢？我们后面结合损失函数再说！

作者在代码中，有定义了一个张量，名为tag_masks。声明如下

 
      tag_masks   = np.zeros((batch_size, max_tag_len), dtype=np.uint8)

定义如下：

 
      for b_ind in range(batch_size):
        # 用来记录一个batch size图片中target的数量，多少个1表示多少个目标
        tag_len = tag_lens[b_ind]
        tag_masks[b_ind, :tag_len] = 1

tag_mask是一个0-1掩码，shape为（batch_size,128）。即对索引为b_ind的图片中的N个物体，那么该图片对应128维度的向量中，前N位为1，后128-N位为0。该mask掩码被用来区分是否一张图片中前景所在的索引区域和背景所在的索引区域。

最后，将上述转为torch张量，并返回xs和ys。

 
   images      = torch.from_numpy(images)
    tl_heatmaps = torch.from_numpy(tl_heatmaps)
    br_heatmaps = torch.from_numpy(br_heatmaps)
    tl_regrs    = torch.from_numpy(tl_regrs)
    br_regrs    = torch.from_numpy(br_regrs)
    tl_tags     = torch.from_numpy(tl_tags)
    br_tags     = torch.from_numpy(br_tags)
    tag_masks   = torch.from_numpy(tag_masks)

    return {
        "xs": [images, tl_tags, br_tags],
        "ys": [tl_heatmaps, br_heatmaps, tag_masks, tl_regrs, br_regrs]
    }, k_ind

稍微有些令人不解的是，输入的数据（xs)不光只有图片images，还有上面提到的embedding监督信息 tl_tags 和 br_tags，而监督的信息(ys) 除了tl_heatmaps, br_heatmaps和 tl_regrs, br_regrs，只包含了tag_masks这一个embedding监督信息。

这里留下个疑问，后面进行解析。

有关如何将真实标签映射为监督信息（类似网络的输出格式）就讲解完毕了。

‍2.2 损失函数详解

上面已经提到了如何将真实标签转化为网络的监督信息（类似网络的输出格式），这样可以便于损失函数的求解。那么本小节将详细探讨损失函数。

有关损失函数的定义，主要定义在models/py_utils/kp.py中的类AELoss中。定义如下

 
  class AELoss(nn.Module):
    def __init__(self, pull_weight=1, push_weight=1, regr_weight=1, focal_loss=_neg_loss):
        super(AELoss, self).__init__()

        self.pull_weight = pull_weight
        self.push_weight = push_weight
        self.regr_weight = regr_weight
        self.focal_loss  = focal_loss
        self.ae_loss     = _ae_loss
        self.regr_loss   = _regr_loss

    def forward(self, outs, targets):
        stride = 6

        tl_heats = outs[0::stride] #stride就是step，没有end
        br_heats = outs[1::stride]
        tl_tags  = outs[2::stride]
        br_tags  = outs[3::stride]
        tl_regrs = outs[4::stride]
        br_regrs = outs[5::stride]

        gt_tl_heat = targets[0]
        gt_br_heat = targets[1]
        gt_mask    = targets[2]
        gt_tl_regr = targets[3]
        gt_br_regr = targets[4]

        # focal loss
        focal_loss = 0

        tl_heats = [_sigmoid(t) for t in tl_heats]
        br_heats = [_sigmoid(b) for b in br_heats]

        focal_loss += self.focal_loss(tl_heats, gt_tl_heat)
        focal_loss += self.focal_loss(br_heats, gt_br_heat)

        # tag loss
        pull_loss = 0
        push_loss = 0

        for tl_tag, br_tag in zip(tl_tags, br_tags):
            pull, push = self.ae_loss(tl_tag, br_tag, gt_mask)
            pull_loss += pull
            push_loss += push
        pull_loss = self.pull_weight * pull_loss
        push_loss = self.push_weight * push_loss

        # regression loss
        regr_loss = 0
        for tl_regr, br_regr in zip(tl_regrs, br_regrs):
            regr_loss += self.regr_loss(tl_regr, gt_tl_regr, gt_mask)
            regr_loss += self.regr_loss(br_regr, gt_br_regr, gt_mask)
        regr_loss = self.regr_weight * regr_loss

        loss = (focal_loss + pull_loss + push_loss + regr_loss) / len(tl_heats)
        return loss.unsqueeze(0)

在类别的forward中，代码接受监督信息和网络输出作为输入。网络输出为

 
          stride = 6

        tl_heats = outs[0::stride] #stride就是step，没有end
        br_heats = outs[1::stride]
        tl_tags  = outs[2::stride]
        br_tags  = outs[3::stride]
        tl_regrs = outs[4::stride]
        br_regrs = outs[5::stride]

监督信息就是上面提到的ys，有5个输出，分别对应如下

 
          gt_tl_heat = targets[0]
        gt_br_heat = targets[1]
        gt_mask    = targets[2]
        gt_tl_regr = targets[3]
        gt_br_regr = targets[4]

（1） focal loss

这里以输出的heatmaps和监督的heatmaps为输入，计算focal loss。

 
   # focal loss
        focal_loss = 0

        tl_heats = [_sigmoid(t) for t in tl_heats]
        br_heats = [_sigmoid(b) for b in br_heats]

        focal_loss += self.focal_loss(tl_heats, gt_tl_heat)
        focal_loss += self.focal_loss(br_heats, gt_br_heat)

原论文中提到该focal loss的定义，如下

代码中定义如下：

 
  def _neg_loss(preds, gt):
    pos_inds = gt.eq(1)
    neg_inds = gt.lt(1)

    neg_weights = torch.pow(1 - gt[neg_inds], 4) ##由于negative过多，需要降低权重

    loss = 0
    for pred in preds:
        pos_pred = pred[pos_inds]
        neg_pred = pred[neg_inds]

        pos_loss = torch.log(pos_pred) * torch.pow(1 - pos_pred, 2)
        neg_loss = torch.log(1 - neg_pred) * torch.pow(neg_pred, 2) * neg_weights

        num_pos  = pos_inds.float().sum()
        pos_loss = pos_loss.sum()
        neg_loss = neg_loss.sum()

        if pos_pred.nelement() == 0:
            loss = loss - neg_loss
        else:
            loss = loss - (pos_loss + neg_loss) / num_pos
    return loss

结合上面的公式，就很清晰了。

(2) embedding 损失

有关embedding损失的计算，论文中提到

论文中利用该损失函数来减小同一物体bounding box左上角和右下角embedding的距离，增大不同物体bounding box左上角和右下角embedding的距离。前面我们提到网络输出的embedding（shape=(batch size,128,128,1））和对应的监督信息（shape=(batch size,128））的维度都是不一样的，如何进行损失计算呢？

我们转到网络的前向过程中去，网络输出六个输出，分别为

（1）左上corner的heatmaps，大小为（batch size,128,128,80)
（2）左上corner的embedding，大小为（batch size,128,128,1)
（3）左上corner的offsets，大小为（batch size,128,128,2)
（4）右下corner的heatmaps，大小为（batch size,128,128,80)
（5）右下corner的embedding，大小为（batch size,128,128，1)
（6）右下corner的offsets，大小为（batch size,128,128,2)

该过程后，作者对其中的4个输出进行了转换

（1）左上corner的embedding，大小为（batch size,128,128,1)
（2）左上corner的offsets，大小为（batch size,128,128,2)
（3）右下corner的embedding，大小为（batch size,128,128，1)
（4）右下corner的offsets，大小为（batch size,128,128,2)

代码实现如下

 
    #对上面两个分支分别输出三个预测分支
            tl_heat, br_heat = tl_heat_(tl_cnv), br_heat_(br_cnv) #bsx127x127x80
            tl_tag,  br_tag  = tl_tag_(tl_cnv),  br_tag_(br_cnv) #bsx127x127x1
            tl_regr, br_regr = tl_regr_(tl_cnv), br_regr_(br_cnv) #bsx127x127x2

            # 在输出特征图上，取物体的gt bbox的角点对应位置的值（可以是embedding，也可以是regr）
            tl_tag  = _tranpose_and_gather_feat(tl_tag, tl_inds)
            br_tag  = _tranpose_and_gather_feat(br_tag, br_inds)
            tl_regr = _tranpose_and_gather_feat(tl_regr, tl_inds)
            br_regr = _tranpose_and_gather_feat(br_regr, br_inds)

            outs += [tl_heat, br_heat, tl_tag, br_tag, tl_regr, br_regr]

其中转换的函数为 _tranpose_and_gather_feat。该函数接受俩个输入，这里的tl_inds和br_inds就是上面提到的xs中的tl_tags 和 br_tags，不要弄混了！

该函数的含义可以用论文中的一句话进行总结

we only apply the losses at the ground-truth corner location.

意思就是在网络的embedding输出和offset输出后，取对应ground-truth corner位置所在的值，这里的ground-truth corner信息是tl_inds和br_inds给出的（就是图3中给出的位置变化）。

之后，代码中是这样实现损失函数计算的，如下

 
          # tag loss
        pull_loss = 0
        push_loss = 0

        for tl_tag, br_tag in zip(tl_tags, br_tags):
            pull, push = self.ae_loss(tl_tag, br_tag, gt_mask)
            pull_loss += pull
            push_loss += push
        pull_loss = self.pull_weight * pull_loss
        push_loss = self.push_weight * push_loss

这里面有一段

 
              pull, push = self.ae_loss(tl_tag, br_tag, gt_mask)

这里计算一个图片中所有角点的embedding损失。网络以以下两个作为输入

（1）网络的两个输出特征图tl_tag, br_tag
（2）2.1中提到的tag_masks

tag_mask为一个0-1掩码，shape为（batch_size,128）。即对索引为b_ind的图片中的N个物体，那么该图片对应128维度的向量中，前N位为1，后128-N位为0。

有关函数ae_loss的定义如下

 
  def _ae_loss(tag0, tag1, mask):
    num  = mask.sum(dim=1, keepdim=True).float()
    tag0 = tag0.squeeze()
    tag1 = tag1.squeeze()

    tag_mean = (tag0 + tag1) / 2

    tag0 = torch.pow(tag0 - tag_mean, 2) / (num + 1e-4)
    tag0 = tag0[mask].sum()
    tag1 = torch.pow(tag1 - tag_mean, 2) / (num + 1e-4)
    tag1 = tag1[mask].sum()
    pull = tag0 + tag1

    mask = mask.unsqueeze(1) + mask.unsqueeze(2)
    mask = mask.eq(2)
    num  = num.unsqueeze(2)
    num2 = (num - 1) * num
    dist = tag_mean.unsqueeze(1) - tag_mean.unsqueeze(2)
    dist = 1 - torch.abs(dist)
    dist = nn.functional.relu(dist, inplace=True)
    dist = dist - 1 / (num + 1e-4)
    dist = dist / (num2 + 1e-4)
    dist = dist[mask]
    push = dist.sum()
    return pull, push

这不完全按照上面的公式进行了吗？很清晰就不一一做解释了。

（3）修正损失

最后一个损失是修正损失，即在原来预测的基础上，加入offset，使得预测更加精确。其定义如下

 
   # regression loss
        regr_loss = 0
        for tl_regr, br_regr in zip(tl_regrs, br_regrs):
            regr_loss += self.regr_loss(tl_regr, gt_tl_regr, gt_mask)
            regr_loss += self.regr_loss(br_regr, gt_br_regr, gt_mask)
        regr_loss = self.regr_weight * regr_loss

        loss = (focal_loss + pull_loss + push_loss + regr_loss) / len(tl_heats)
        return loss.unsqueeze(0)

原论文中提到

其中上述代码中regr_loss定义如下：

 
  def _regr_loss(regr, gt_regr, mask):
    num  = mask.float().sum()
    mask = mask.unsqueeze(2).expand_as(gt_regr)

    regr    = regr[mask]
    gt_regr = gt_regr[mask]
    
    regr_loss = nn.functional.smooth_l1_loss(regr, gt_regr, size_average=False)
    regr_loss = regr_loss / (num + 1e-4)
    return regr_loss

非常简单的定义好吧！

最后，我们将所有的损失进行加权求和即可

 
          loss = (focal_loss + pull_loss + push_loss + regr_loss) / len(tl_heats)

至此，有关损失的解析就结束了。本文也到了尾声了。

3 总结

有关CornerNet的网络结构、损失函数均已经解析完毕了。

总的来说，两篇解析文章的篇幅都较长，但是内容较充实，希望对大家学习CornerNet提供帮助！

推荐阅读

极市&英特尔 OpenVINO™中级认证，代码均已内置，助力快速完成~

添加极市小助手微信（ID : cvmart2），备注：姓名-学校/公司-研究方向-城市（如：小极-北大-目标检测-深圳），即可申请加入极市目标检测/图像分割/工业检测/人脸/医学影像/3D/SLAM/自动驾驶/超分辨率/姿态估计/ReID/GAN/图像增强/OCR/视频理解等技术交流群：每月大咖直播分享、真实项目需求对接、求职内推、算法竞赛、干货资讯汇总、与 10000+来自港科大、北大、清华、中科院、CMU、腾讯、百度等名校名企视觉开发者互动交流~

△长按添加极市小助手

△长按关注极市平台，获取最新CV干货

觉得有用麻烦给个在看啦~

【声明】内容源于网络

极市平台

为计算机视觉开发者提供全流程算法开发训练平台，以及大咖技术分享、社区交流、竞赛实践等丰富的内容与服务。

内容 8155

粉丝 0

极市平台为计算机视觉开发者提供全流程算法开发训练平台，以及大咖技术分享、社区交流、竞赛实践等丰富的内容与服务。

总阅读7.6k

粉丝0

内容8.2k