Detection frameworks

# Deformable parts models

Use a sliding window for detection. Extract static features, classify regions, predict bounding boxes for high scoring regions.

Static features


Use region proposals before classification. Selective Search generates potential bounding boxes. CNN only extracts features, followed by a SVM to score boxes

Problem: Slow. Images warped before sending to the network. Bad candidates from fixed (cannot train from data) selective search algorithm.

Image for post


# Fast R-CNN

Use neural networks to generate a feature map (conv only once per image). From feature map identify the region of proposals and warp them into squares by RoI pooling.

Region proposals become bottlenecks for speed.

Image for post

Fast R-CNN

# Faster R-CNN

Eliminates the selective search algorithm and lets the network learn the region proposals. A separate network is used to predict the region proposals.


Faster R-CNN

# Mask R-CNN

Mask R-CNN extends Faster R-CNN by adding a branch for semantic segmentation.

ROI align to change the bounding box shape to fixed size.


Mask R-CNN

# Yolo

Directly regress bounding boxes from the whole image. Seperate image into 7*7 grids. Each grid cell proposes potential bounding boxes (two boxes per grid, 7*7*2=98 per image, responsible for detecting the object if the center of object is in the center of the grid) and classify using convolutional features.

Fully conneced layers at last, so the input must be 448*448. Output is 7*7*30, 30 includes probability of 20 classes of the grid (not bounding box) (if the grid contains an object, the probability has certain class P(Ci|object)), Confidence of two boxes (Probability of having an object * IOU of the object with ground truth), x/y/w/h of two boxes. 

Loss: x/y deviation, sqrt(w/h) deviation, confidence difference with object in grid, without object in grid, classification error

Non Maximum suppression to remove redundant boxes, according to P(Ci|object)*Confidence_j, which means probability of class i in the box of j

Con: bad at small and close objects, as only two boxes per grid and must be same class



# Yolo V2/Yolo 9000

Better, Faster, Stronger. 

Several innovative ideas to improve mAP. Using batch normalization. Using higher resolution image for classification. Using Anchor Boxes (predefined boxes in each grid, then predict size modification to boxes), the anchor boxes are not manually decided, but from stat of objects in training samples (using k mean clustering algorithm to get centroid boxes most representitive of object size). Restrict predicted box location within grid. Add passthrough layer to allow small object to be detected. Multi image scale for training. 

Output: 13*13(grid size)*5(anchor number)*25(20 classes, 4 bounding box parameters, 1 confidence)

Construct a WordTree for 9000 classes classification. Each class has cascaded labels, such as P(Norfolk terrier) = P(Norfolk terrier|terrier) * P(terrier|hunting dog) * P(hunting dog|dog) *......* P(animal|physical object) * P(physical object), all parent nodes are set to 1 for training. Trained on both ImageNet (classification dataset, larger size) and COCO(detection dataset, smaller size) dataset

# Yolo V3

Incremental improvement on performance. 

New network structure from ideas of Resnet, so can be deeper. Object detection in three scales (with different grid size), three anchors per scale. Classify using logistic instead of softmax.

Output 13*13(grid size at scale 1)*3(anchor size) + 26*26*3 + 52*52*3 = 10647 predictions (845 in V2), each prediction has (4+1+80)=85 dimensions, 80 classes in COCO.

# Single Shot MultiBox Detector (SSD)

Also one-stage detector, predict using conv layers instead of FCN in Yolo. Has higher mAP than Yolo and FasterRCNN, but slightly lower calculation time.

After VGG network, add 5 conv layers with difference feature map size to extract multi-scale features. Feature map is divided into grids (m*n), each grid export K default boxes (like anchors) with c classes, so the output of m*n*K*(4+c) predictions needs m*n*k*(c+4) 3*3 conv kernels to get the result. Concatenate all 5 feature maps.



# Deep MultiBox

Use CNN to predict bounding boxes. But requiring further image patch classification.

# OverFeat

CNN to perform localization and adapt for detection. Detection performance not optimized. OverFeat cannot reason about global context and thus requires significant post-processing to pro- duce coherent detections.

# MultiGrasp

only needs to predict a single graspable region for an image containing one object. It doesn’t have to estimate the size, location, or boundaries of the object or predict it’s class, only find a region suitable for grasping.


Yolo paper

Last Article

Comment 评论

Share 分享

New Users 最新加入

  • hokurikustr

  • refrain

  • 鑫鑫

New comments 最新评论

test123: aasdas Details Apr 13 16:39
admin: Thanks! Details Apr 09 11:46
admin: Google map api Details Apr 09 11:46
lqj12: cooooooooool Details Apr 08 21:34
Yunhan Huang: 这个功能是如何实现的? Details Apr 08 13:23