R-CNN was proposed by Ross Girshick et al. in 2014 to deal with the problem of efficient object localization in object detection. The previous methods use what is called Exhaustive Search which uses sliding windows of different scales on image to propose region proposals Instead, this paper uses the Selective search algorithm which takes advantage of segmentation of objects and Exhaustive search to efficiently determine the region proposals. This selective search algorithm proposes approximately 2000 region proposals per image. These are then passed to the CNN model (Here AlexNet is used).
This CNN model then outputs a (1, 4096) feature vector from each region proposal. This vector then passed into the SVM model for classification of object and bounding box regressor for localization.
Problem with R-CNN:
- Each image needs to classify 2000 region proposals. So, it takes a lot of time to train the network.
- It requires 49 seconds to detect the objects in an image on GPU.
- To store the feature map of the region proposal, lots of Disk space is also required.
Fast R-CNN :
In R-CNN we passed each region proposal one by one in the CNN architecture and selective search generated around 2000 region proposal for an image. So, it is computationally expensive to train and even test the image using R-CNN. To deal with this problem Fast R-CNN was proposed, It takes the whole image and region proposals as input in its CNN architecture in one forward propagation. It also combines different parts of architecture (such as ConvNet, RoI pooling, and classification layer) in one complete architecture. That also removes the requirement to store a feature map and saves disk space. It also uses the softmax layer instead of SVM in its classification of region proposal which proved to be faster and generate better accuracy than SVM.
Fast R-CNN drastically improves the training (8.75 hrs vs 84 hrs) and detection time from R-CNN. It also improves Mean Average Precision (mAP) marginally as compare to R-CNN.
Problems with Fast R-CNN:
- Most of the time taken by Fast R-CNN during detection is a selective search region proposal generation algorithm. Hence, it is the bottleneck of this architecture which was dealt with in Faster R-CNN.
Faster R-CNN was introduced in 2015 by k He et al. After the Fast R-CNN, the bottleneck of the architecture is selective search. Since it needs to generate 2000 proposals per image. It constitutes a major part of the training time of the whole architecture. In Faster R-CNN, it was replaced by the region proposal network. First of all, in this network, we passed the image into the backbone network. This backbone network generates a convolution feature map. These feature maps are then passed into the region proposal network. The region proposal network takes a feature map and generates the anchors (the centre of the sliding window with a unique size and scale). These anchors are then passed into the classification layer (which classifies that there is an object or not) and the regression layer (which localize the bounding box associated with an object).
In terms of Detection time, Faster R-CNN is faster than both R-CNN and Fast R-CNN. The Faster R-CNN also has better mAP than both the previous ones.
Comparison Between R-CNN, Fast R-CNN and Faster R-CNN:
|R-CNN||Fast R-CNN||Faster R-CNN|
|Method For Generating Region Proposals||Selective Search||Selective Search||Region Proposal Network|
|The mAP on Pascal VOC 2007 test dataset(%)||58.5||
66.9 (when trained with VOC 2007 only)
70.0 (when trained with VOC 2007 and 2012 both)
69.9(when trained with VOC 2007 only)
73.2 (when trained with VOC 2007 and 2012 both)
78.8(when trained with VOC 2007 and 2012 and COCO)
|The mAP on Pascal VOC 2012 test dataset (%)||53.3||
65.7 (when trained with VOC 2012 only)
68.4 (when trained with VOC 2007 and 2012 both)
67.0(when trained with VOC 2012 only)
70.4 (when trained with VOC 2007 and 2012 both)
75.9(when trained with VOC 2007 and 2012 and COCO)
|Detection Time (sec)||~ 49 (with region proposal generation)||~ 2.32(with region proposal generation)||
0.2 (with VGG),
0.059 (with ZF)
The above Detection time results are from the research paper. They can vary depending upon the machine configurations.