失眠网 > 深度学习目标检测详细解析以及Mask R-CNN示例

深度学习目标检测详细解析以及Mask R-CNN示例

时间：2023-01-04 13:36:19

本文详细介绍了R-CNN走到端到端模型的Faster R-CNN的进化流程，以及典型的示例算法Mask R-CNN模型。算法如何变得更快，更强！

如何让检测更快？主要有两种思路：

把好的方法改进的更快！

前面我们提到了从R-CNN到Faster R-CNN主要的技术思想就是避免特征计算浪费。所以，要把ConvNet特征计算前移，只做一次计算。而把区域操作后移。我们也提到Faster R-CNN在ROI之后，还有部分ConvNet的计算。有没有可能把ROI之上的计算进一步前移？R-FCN（Region-Based Fully Convolutional Networks）基于这个思路，做到了，所以更快，某种意义上，是Faster R-CNN。

R-FCN

把快的方法，改进的更好！

前面谈到overfeat的效果一般，对于重叠情况很多不能识别的情况。如何将基于回归的思路，做到逼近区域推荐的小高？YOLO把分而治之和IOU的思想集成进来了。而SSD把多尺度Anchor Box的思想集成进来了。

除了快，还有什么？淡然是做优做强。

Faster R-CNN有三大主要邮件，RPN做区域推荐，ROI Pooling类似特征Pyramid，改善极大极小重叠，分类和Box回归的Log加Smoothed L1损失，针对定位修正。如何做优做强？

能否比RPN做的更优

前面提到RPN能够达到Selective Search的效果，那么假如还要更好，怎么能够做到？

AttractionNet利用了NMS（non-maxima suppression）效果。AttentionNet利用率弱注意力集中的机制。

能否比ROI Pooling做的更优？

前面提到ROI Pooling能够做到和HOG Pyramid和DPM空间限制类似的SPM的效果，那么，加入还要更好，怎么能够做到？ION （Inside-Outside Net）

提出了四方向上下文的思想，FPN提出了特征Pyramid网络。

能否比ROI Pooling做的更强？

前面提到ROI Pooling是建立在ROI基础上的，对应的区域推荐，如何进一步对其到像素点？

Mask R-CNN提出了ROI Align的思想。在误差计算中，除了分类，Box回归基础上再加入像素点Mask Branch距离的思想。

那么，什么是FCN （Fully Convolutional Networks），IOU，NMS，Weak

Attention Narrowing， ION，FPN，ROI Align和Mask Branch思想？理解了这些，你对下面这个图，就不再陌生！

R-CNN，Over feat，DetectionNet，DeepMultibox，SPP-net，Fast R-CNN，MR-CNN，SSD，YOLO，YOLOv2，G-CNN，AttractionNet，Mask R-CNN，R-FCN，RPN，FPN，Faster R-CNN…

下面，开启下半场的路程！

R-FCN

前面提到，Faster R-CNN打通前后端成为端到端的模型的同时，ConvNet模型也换成了VGG-16的模型。但是，在GoogleNet和ResNet网络结构上，全连接FC层就只有一层了，最后一层，为Soft max分类服务了。

那么，如果要把GoogleLeNet和ResNet应用到Faster R-CNN中去，就面临一个现象，去掉最后一层FC层，因为那是用来做分类的。需要切换新的尾部网络，能够兼容分类和Box回归。

这样，再看看ROI Pooling的使用，那么ROI Pooling后面的FC层也要换成卷积层。这样，卷积层就被ROI Pooling层割断了。而且这种隔断使得ROI Pooling后面的ConvNet重复计算了。

一个问题，能不能直接把后面FCs编程ResNet之后的ConvNet直接丢弃？不行，这样的效果会打折扣，为什么？在Fast R-CNN继承SPPNet的SPM技术，演绎出ROI Pooling的时候讲了，ROI PoolING只是相当于最细分的区域固定，那么粗粒度的部分，可以由后续的多层FCs来达到类似的效果。如果去掉，就少了金字塔结构了，或者少了深度了。

那么，如何把ROI后面的卷积计算也移植到前面去？就是R-FCN解决的问题！一方面要保留空间限制，另一方面要有一定的特征层次。E-FCN提出了Position-Sensitive ROI Pooling。

Position-Seneitive ROI Pooling的思想，正式将位置并行起来，是一种结合了空间信息的注意力机制。每个小的数据来源一个和特点位置绑定的ConvNet特征层。

一旦和位置绑定了，那么特征计算，就从以前的中心点，编程了一系列从上下左右的不同子框去看的特征图。那么，再把这些组合起来。即暗含了不同的空间信息。也就是说，先上下左右看这个山峰，回头看的凭借起来，判断山峰有没有认错。旋转好不同位置的特征，再整合起来，得到在不同位置点确认的特征，再做Pooling，通过Pooling进行投票。

Figure 3：Visualization of R-FCN （kk=33）for the

person category.

这样的效果，就是，把特征计算放在前面，而把位置信息拼接投票放在最后处理。而不是先通过位置画出特征，然后把带位置的特征先融合，再做分类和回归。这里直接进行位置投票。要注意，PS ROI Pooling和ROI Pooling并不是一个Pooling。

R-FCN优点：

1）清楚的关联了速度提升和ConvNet特征共享的关系。

2）通过不同的位置为注意力的并行特征计算，再几号的利用Pooling来投票，取代了ROI Pooling后续计算的计算要求。

3）速度快，效果好的均衡下的推荐选择。

R-FCN问题：

1）依然无法实现视频基本的实时（每24帧图像）。

2）功能上没有涉及到像素级别的实例分割。

YOLO

其实，前面提到了Overfeat效果不好，一个很多原因就是Overfeat没有专门为了提高召回率的区域推荐机制。

而有区域推荐RPN的Faster R-CNN慢的一个重要原因，就是RPN的计算量基本也够计算Overfeat了，所以，它是两个阶段。

Overfeat开启了一个阶段端到端的神话，但是，效果却不好。如果不使用区域推荐机制的情况下，仅仅依靠分类和回归的话，如何进一步提升召回率呢？

如何改善滑动窗口呢？

1）分而治之判断类别

2）分而治之预览框

3）合并类别不同框

这里有一个问题，就是如何选择框，用到IOU（intersection over union）。有两个步骤：

1）先根据类别数预测不同的框，比如，3个物体（狗，自行车，汽车），那么就会对应到3个框。

2）判断物体应该对应哪个框呢？这个交集占丙级比会决定应该用哪个框。

发现对于VOC的数据分析有如下结论：

Correct：correct class and IOU> 0.5

Localization：correct class，0.1<IOU<0.5

Similar：class is similar，IOU>0.1

Other：class is wrong，IOU>0.1

Background：IOU<0.1 for any object

这样YOLO的损失函数考虑了。

（1）回归

（2）是否有物体

（3）有哪个物体

（4）区域最合适的物体

前面提到，Faster R-CNN已经很快了，但是做不到实时，因为视频要求1秒24帧以上。既然YOLO很快，那么必然用到视频中去了。如果再视频，还可以进一步优化YOLO到Fast YOLO更快。更快，就是共享！共享了类别的概率图Class Prob. Map。通过修正而不是重新学习。所以更快！

YOLO的优点：

1）典型的回归加分类模型和单一的CNN网络

2）分治思想很好

3）实时性很好，基本上接近1秒24帧的标准。

4）比Select

Search找的框少很多（区域推荐更看重召回率）

YOLO的问题：

1）准确度不高，不如Faster

R-CNN和R-FCN

2）小物体，不规则物体识别差

3）定位精度不高

YOLO-v2

如何进一步提高YOLO的准确度呢？记得RPN里面利用了各种框的长宽比先验知识么？Anchor Box。大概5种左右的框就占据了60%的情况。

这样，把单纯的框预测，编程带先验的框预测，就是长度和宽度拥有一定的先验。

其它一系列改进技巧，使得YOLOv2要比YOLOv1好！提升最大的是dimension

priors。，尺度计算一个先验分布的帮助很大！

然后，采用了DarkNet19 的网络，速度变得更快。

YOLO9000，分层的物体标签实现word

tree。

分层物体类标签：word tree

YOLOv2的优点：

1）引入BN（Batch

normalization）（%2 mAP改进）

2）高分辨率图片（448*448），改善小物体识别（%4mAP改进）

3）更细化的分块（13*13）（1%mAP改进）

4）引入Anchor框（K-means）（81%召回到88%召回）

YOLO-9000的优点：

1）分层的结果标签COCO ImageNet

YOLOv2的问题：

2）没有实现实例分割。

SSD

和Anchor Box思想和Pyramid思想一致，引入多尺度和多默认比率。

多尺度CNN网络采用类似GoogleLeNet的那种分层输出模式。

所以，结合起来，就有SSD如下网络：

从SSD的网络可以看到，这个多尺度是并行实现的。

SSD优点：

1）在YOLO基础上引入多尺度特征映射，并且分成ConvNet并行实现

2）引入Anchor Box机制

3）和YOLO比，效果更好，速度更快

SSD问题：

1）效果很难突破R-FCN和Faster R-CNN

AttentionNet

主力集中的思想比较简单：

和区域推荐相比，有一定的优势：

而这个注意力迁移的过程，可以解读为左上点和右下点相互尽可能靠近的一个过程：

整个过程循环迭代，直到检测的比较精准为止。

这种注意力移动的过程，也必须和具体目标对应起来，才能应用到多目标的情况下：

不同类别可以配置成并行的结构框架。

这样的话，多个目标实例都要拥有一个这样的注意力移动的过程。而多个实例，也可以并行实现。

这样的话，采用两阶段过程，第一步先找到每个实例对应的一个大框，第二部，细化找到准确的框。

AttentionNet优点：

2）全新的区域查找方式

3）对比R-CNN，效果有提升

AttentionNet问题：

1）多实例的方式较为复杂

2）移动迭代

AttractionNet

（Act）ive Box Proposal Generation via (I)n-(O)ut Localization Network，如何优化框？

1）更集中注意！

2）更细化定位

如何细化定位？

通过对物体分布概率在横轴和纵轴上裁剪的方法。

对应的网络结构ARN（Attend & Refine），然后反复迭代，最后通过NMS矫正。这个过程是不是和RPN结构加ROI Pooling迭代过程有点类似。不一样的地方，每个ARN的框推荐都会用上，使用NMS进行修正。

而ARN和之前RPN结构不太一样，它的横轴和纵轴是分别细化，然后，通过In-Out最大似然度来定义的，也就是前面的哪个细化的示意图。

上面解释了ARN，那么，NMS是什么呢？其实就是一个局部求最值的过程！

NMS修正的过程，效果能从多个框中找到一个最符合的框，有点类似投票。

AttractioNet优点：

1）实现提出迭代优化区域思想

2） AttractionNet要比Selective Search效果更好

3）基于CNN网络上的区域推荐

AttractionNet问题：

1）反复迭代会降低运行速度

2）网络结构复杂，不如RPN简单

G-CNN

Grid-CNN吸引了YOLO分而治之的思想，进行区域合并。

不是简单的合并，而是采用迭代优化的思路。

这个过程和NMS非常不一样，通过反复的IOU计算，迭代优化。

为了避免特征的反复计算，它把特征计算作为全局步骤，而把回归反复优化的部分称为回归部分。

可以看到回归框的移动过程：

G-CNN优点：

1）通过迭代优化，替换了类似NMS的简单合并

2）效果比Faster R-CNN要好点

3）通过分而治之，速度要比Faster

R-CNN快点

G-CNN问题：

速度依然太慢，难以实用。

ION

Inside-Outside Net是提出基于RNN的上下文的目标检测方法。对于图像上下左右移动像素，用RNN来编码，称为这个方向上的上下文。

这样，实现了4方向RNN上下文，用来提取上下文特征。

并且设置了RNN堆栈来强化不太粒度的上下文。

R-FCN里面对空间限制进行迭代编码类似，不过，这次不是认为划分框的位置，而是通过IRNN直接编码。

对比添加上下文和没有上下文的网络设置区别。对比得到IRNN可以提高２个mAP的提升。

ION优点：

１）提出RNN上下文的概念

２）对小物体识别的效果提升

３）比R-CNN效果要佳

ION问题：

１）RNN计算量增加，速度要慢

FPN

如何将特征金字塔融合成神经网络，为了避免重复计算。

提出FPN网络，通过卷积和拼接得到特征金字塔网络

有了金字塔，有什么好处呢？对于不同大小的物体，可以在不同缩放上进行分割。

这样，在每个层次就可以利用类似的尺寸来发现目标物体。

做到各个尺度兼容：

FPN优点：

２）多尺度和小物体融合考虑

３）速度和准确率兼容

４）可以广泛结合，提高不同模型的效果

FPN问题：

１）需要多层计算，增加计算量

MaskR-CNN

回顾一下，第一次提出ＲＯＩ，在R-CNN里面

第一次提出ROIPoolin在FastR-CNN里面

到了MaskR-CNN，做了什么改进呢？提出了ROIAlign，方便后面增加的MaskBranch，对应到像素点。

什么是ｍａｓｋ？

有了ｍａｓｋ之后，就能实现实例分割了

那么，ROIPooling和ROIAlign的区别在哪里呢？如何能够精确的反响找到像素点边沿？这样的话，就可以对Pooling的划分不能按照Pooling边沿，而是按照像素点缩放后的边沿。

而用Pooling的话，就会有偏差，这种偏差对应到像素的Mask上就会找不准边界，之前有人利用ROIWrapping进行插值矫正。

对于Mask和分类，回归学习，即可以基于ＦＰＮ或者就是ROIAlign的特征

Mask计算的先驱：

１）MNC(Multi-task

Network Cascade)的RoI Wrapping，插值估算

２）FCIS

(Fully Convolutional Instance Segmentation)的positional

aware sliding masks

RoI Align要比Segment要好很多！

要在人头姿势的１７个关键点

MaskR-CNN优点：

１）ROIPool到ROIAlign（借鉴了ROI Wraping）

２）Mask的预测（借鉴了MNC和FCIS）

３）State-of-Art的效果

４）轻微调整可以做人体姿态识别

Mask R-CNN问题：

1）速度不够

2）像素预测需要大量训练数据

Mask X R-CNN

带WeightTransferLearning的MaskR-CNN

效果提升：

小结：

给一个概要的takeaway：

1）速度优先：SSD算法

2）速度和效果均衡：R-FCN算法

3）效果优先：Faster R-CNN，Mask R-CNN

4）一网多用：Mask R-CNN

下面介绍一个深度学习目标检测示例：

算法模型示例----Mask R-CNN算法模型原理与实现

Mask R-CNN for Object Detection and Segmentation

This is an implementation of Mask R-CNN on Python 3, Keras, and TensorFlow. The model generates bounding boxes and segmentation masks for each instance of an object in the image. It’s based on Feature Pyramid Network (FPN) and a ResNet101 backbone.

The repository includes:

Source code of Mask R-CNN built on FPN and ResNet101.

Training code for MS COCO

Pre-trained weights for MS COCO

Jupyter notebooks to visualize the detection pipeline at every step

ParallelModel class for multi-GPU training

Evaluation on MS COCO metrics (AP)

Example of training on your own dataset

The code is documented and designed to be easy to extend. If you use it in your research, please consider citing this repository (bibtex below). If you work on 3D vision, you might find our recently released Matterport3D dataset useful as well.

This dataset was created from 3D-reconstructed spaces captured by our customers who agreed to make them publicly available for academic use. You can see more examples here.

Getting Started

demo.ipynb Is the easiest way to start. It shows an example of using a model pre-trained on MS COCO to segment objects in your own images.

It includes code to run object detection and instance segmentation on arbitrary images.

train_shapes.ipynb shows how to train Mask R-CNN on your own dataset. This notebook introduces a toy dataset (Shapes) to demonstrate training on a new dataset.

(model.py, utils.py, config.py): These files contain the main Mask RCNN implementation.

inspect_data.ipynb. This notebook visualizes the different pre-processing steps

to prepare the training data.

inspect_model.ipynb This notebook goes in depth into the steps performed to detect and segment objects. It provides visualizations of every step of the pipeline.

inspect_weights.ipynb

This notebooks inspects the weights of a trained model and looks for anomalies and odd patterns.

Step by Step Detection

To help with debugging and understanding the model, there are 3 notebooks

(inspect_data.ipynb, inspect_model.ipynb,

inspect_weights.ipynb) that provide a lot of visualizations and allow running the model step by step to inspect the output at each point. Here are a few examples:

Anchor sorting and filtering

Visualizes every step of the first stage Region Proposal Network and displays positive and negative anchors along with anchor box refinement.

Bounding Box Refinement

This is an example of final detection boxes (dotted lines) and the refinement applied to them (solid lines) in the second stage.

Mask Generation

Examples of generated masks. These then get scaled and placed on the image in the right location.

4.Layer activations

Often it’s useful to inspect the activations at different layers to look for signs of trouble (all zeros or random noise).

Weight Histograms

Another useful debugging tool is to inspect the weight histograms. These are included in the inspect_weights.ipynb notebook.

6. Logging to TensorBoard

TensorBoard is another great debugging and visualization tool. The model is configured to log losses and save weights at the end of every epoch.

7. Composing the different pieces into a final result

Training on MS COCO

We’re providing pre-trained weights for MS COCO to make it easier to start. You can

use those weights as a starting point to train your own variation on the network.

Training and evaluation code is in samples/coco/coco.py. You can import this

module in Jupyter notebook (see the provided notebooks for examples) or you

can run it directly from the command line as such:

Train a new model starting from pre-trained COCO weights

python3 samples/coco/coco.py train --dataset=/path/to/coco/ --model=coco

Train a new model starting from ImageNet weights

python3 samples/coco/coco.py train --dataset=/path/to/coco/ --model=imagenet

Continue training a model that you had trained earlier

python3 samples/coco/coco.py train --dataset=/path/to/coco/ --model=/path/to/weights.h5

Continue training the last model you trained. This will find

the last trained weights in the model directory.

python3 samples/coco/coco.py train --dataset=/path/to/coco/ --model=last

You can also run the COCO evaluation code with:

Run COCO evaluation on the last trained model

python3 samples/coco/coco.py evaluate --dataset=/path/to/coco/ --model=last

The training schedule, learning rate, and other parameters should be set in samples/coco/coco.py.

Training on Your Own Dataset

Start by reading this blog post about the balloon color splash sample. It covers the process starting from annotating images to training to using the results in a sample application.

In summary, to train the model on your own dataset you’ll need to extend two classes:

Config

This class contains the default configuration. Subclass it and modify the attributes you need to change.

Dataset

This class provides a consistent way to work with any dataset.

It allows you to use new datasets for training without having to change

the code of the model. It also supports loading multiple datasets at the

same time, which is useful if the objects you want to detect are not

all available in one dataset.

See examples in samples/shapes/train_shapes.ipynb, samples/coco/coco.py, samples/balloon/balloon.py, and samples/nucleus/nucleus.py.

Differences from the Official Paper

This implementation follows the Mask RCNN paper for the most part, but there are a few cases where we deviated in favor of code simplicity and generalization. These are some of the differences we’re aware of. If you encounter other differences, please do let us know.

Image Resizing: To support training multiple images per batch we resize all images to the same size. For example, 1024x1024px on MS COCO. We preserve the aspect ratio, so if an image is not square we pad it with zeros. In the paper the resizing is done such that the smallest side is 800px and the largest is trimmed at 1000px.

Bounding Boxes: Some datasets provide bounding boxes and some provide masks only. To support training on multiple datasets we opted to ignore the bounding boxes that come with the dataset and generate them on the fly instead. We pick the smallest box that encapsulates all the pixels of the mask as the bounding box. This simplifies the implementation and also makes it easy to apply image augmentations that would otherwise be harder to apply to bounding boxes, such as image rotation.

To validate this approach, we compared our computed bounding boxes to those provided by the COCO dataset.

We found that ~2% of bounding boxes differed by 1px or more, ~0.05% differed by 5px or more,

and only 0.01% differed by 10px or more.

Learning Rate: The paper uses a learning rate of 0.02, but we found that to be

too high, and often causes the weights to explode, especially when using a small batch

size. It might be related to differences between how Caffe and TensorFlow compute

gradients (sum vs mean across batches and GPUs). Or, maybe the official model uses gradient

clipping to avoid this issue. We do use gradient clipping, but don’t set it too aggressively.

We found that smaller learning rates converge faster anyway so we go with that.

Citation

Use this bibtex to cite this repository:

@misc{matterport_maskrcnn_,

title={Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow},

author={Waleed Abdulla},

year={},

publisher={Github},

journal={GitHub repository},

howpublished={\url{/matterport/Mask_RCNN}},

}

Contributing

Contributions to this repository are welcome. Examples of things you can contribute:

Speed Improvements. Like re-writing some Python code in TensorFlow or Cython.

Training on other datasets.

Accuracy Improvements.

Visualizations and examples.

You can also join our team and help us build even more projects like this one.

Requirements

Python 3.4, TensorFlow 1.3, Keras 2.0.8 and other common packages listed in requirements.txt.

MS COCO Requirements:

To train or test on MS COCO, you’ll also need:

pycocotools (installation instructions below)

MS COCO Dataset

Download the 5K minival

and the 35K validation-minus-minival

subsets. More details in the original Faster R-CNN implementation.

If you use Docker, the code has been verified to work on

this Docker container.

Installation

Clone this repository

Install dependencies

pip3 install -r requirements.txt

Run setup from the repository root directory

python3 setup.py install

Download pre-trained COCO weights (mask_rcnn_coco.h5) from the releases page.

(Optional) To train or test on MS COCO install pycocotools from one of these repos. They are forks of the original pycocotools with fixes for Python3 and Windows (the official repo doesn’t seem to be active anymore).

Linux: /waleedka/coco

Windows: /philferriere/cocoapi.

You must have the Visual C++ build tools on your path (see the repo for additional details)

Projects Using this Model

If you extend this model to other datasets or build projects that use it, we’d love to hear from you.

4K Video Demo by Karol Majek.

Images to OSM: Improve OpenStreetMap by adding baseball, soccer, tennis, football, and basketball fields.

Splash of Color. A blog post explaining how to train this model from scratch and use it to implement a color splash effect.

Segmenting Nuclei in Microscopy Images. Built for the Data Science Bowl

Code is in the samples/nucleus directory.

Detection and Segmentation for Surgery Robots by the NUS Control & Mechatronics Lab.

Reconstructing 3D buildings from aerial LiDAR

A proof of concept project by Esri, in collaboration with Nvidia and Miami-Dade County. Along with a great write up and code by Dmitry Kudinov, Daniel Hedges, and Omar Maher.

Usiigaci: Label-free Cell Tracking in Phase Contrast Microscopy

A project from Japan to automatically track cells in a microfluidics platform. Paper is pending, but the source code is released.

Characterization of Arctic Ice-Wedge Polygons in Very High Spatial Resolution Aerial Imagery

Research project to understand the complex processes between degradations in the Arctic and climate change. By Weixing Zhang, Chandi Witharana, Anna Liljedahl, and Mikhail Kanevskiy.

Mask-RCNN Shiny

A computer vision class project by HU Shiyu to apply the color pop effect on people with beautiful results.

Mapping Challenge: Convert satellite imagery to maps for use by humanitarian organisations.

GRASS GIS Addon to generate vector masks from geospatial imagery. Based on a Master’s thesis by Ondřej Pešek.

如果觉得《深度学习目标检测详细解析以及Mask R-CNN示例》对你有帮助，请点赞、收藏，并留下你的观点哦！

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。