Open In App

Fast R-CNN | ML

Before discussing Fast R-CNN, let’s look at the challenges faced by R-CNN.

Fast R-CNN works to solve these problems. Let’s look at the architecture of Fast R-CNN.



Fast R-CNN architecture

First, we generate the region proposal from a selective search algorithm. This selective search algorithm generates up to approximately 2000 region proposals. These region proposals (RoI projections) combine with input images passed into a CNN network. This CNN network generates the convolution feature map as output. Then for each object proposal, a Region of Interest (RoI) pooling layer extracts the feature vector of fixed length for each feature map. Every feature vector is then passed into twin layers of softmax classifier and Bbox regression for classification of region proposal and improve the position of the bounding box of that object.

CNN Network of Fast R-CNN

Fast R-CNN is experimented with three pre-trained ImageNet networks each with 5 max-pooling layers and 5-13 convolution layers (such as VGG-16). There are some changes proposed in this pre-trained network, These changes are:



VGG-16 architecture

This CNN architecture takes the image (size = 224 x 224 x 3 for VGG-16) and its region proposal and outputs the convolution feature map (size = 14 x 14 x 512 for VGG-16).

Region of Interest (RoI) pooling:

(Source: Fast R-CNN slides)

RoI pooling is a novel thing that was introduced in the Fast R-CNN paper. Its purpose is to produce uniform, fixed-size feature maps from non-uniform inputs (RoIs). It takes two values as inputs:

Let’s consider we have 8*8 feature maps, we need to extract an output of size 2*2. We will follow the steps below.

 

Suppose we were given RoI’s left corner coordinates as (0, 3) and height, and width as (5, 7).

 

Now if we need to convert this region proposal into a 2 x 2 output block and we know that the dimensions of the pooling section do not perfectly divisible by output dimension. We take pooling such that it is fixed into 2 x 2 dimensions.

 

Now we apply the max pooling operator to select the maximum value from each of the regions that we divided into.

Max pooling output

Training and Loss Function

First, we take each training region of interest labeled with ground truth class u and ground truth bounding box v. Then we take the output generated by the softmax classifier and bounding box regressor and apply the loss function to them. We defined our loss function such that it takes into account both the classification and bounding box localization. This loss function is called multi-task loss. This is defined as follows:

Multi-task Loss

where Lcls is classification loss, and Llocis localization loss. lambda is a balancing parameter and u is a function (the value of u=0 for background, otherwise  u=1) to make sure that loss is only calculated when we need to define the bounding box. Here, Lclsis the log loss and Lloc  is defined as

Loss function of Fast R-CNN model

Results and Conclusion

Performance comparison between R-CNN and Fast R-CNN

Advantages of Fast R-CNN over R-CNN

Article Tags :