Faster R-CNN and YOLO are good at detecting the objects in the input image. They also have very low detection time and can be used in real-time systems. However, there is a challenge that can’t be dealt with object detection, the bounding box generated by YOLO and Faster R-CNN does not give any indication about the shape of the object.
This segmentation identifies each instance (occurrence of each object present in the image and colour them with different pixel). It basically works to classify each pixel location and generate the segmentation mask for each of the objects in the image. This approach gives more idea about the objects in the image because it preserves the safety of those objects while recognizing it.
Mask R-CNN architecture:Mask R-CNN was proposed by Kaiming He et al. in 2017. It is very similar to Faster R-CNN except there is another layer to predict segmented. The stage of region proposal generation is same in both the architecture the second stage which works in parallel predict class, generate bounding box as well as outputs a binary mask for each RoI.
It comprises of –
- Backbone Network
- Region Proposal Network
- Mask Representation
- RoI Align
The authors of Mask R-CNN experimented on two kinds of backbone network. The first is standard ResNet architecture (ResNet-C4) and another is ResNet with feature pyramid network. The standard ResNet architecture was similar to that of Faster R-CNN but the ResNet-FPN has proposed some modification. This consists of a multi-layer RoI generation. This multi layer feature pyramid network generate RoI of different scale which improves the accuracy of previous ResNet architecture.
At every layer the feature maps size is reduced by half and number of feature maps are doubled. We took output from four layers (layer – 1, 2, 3 and 4). To generate final feature maps, we use an approach called top-bottom pathway. We start from the top feature map(w/32, h/32, 256) and work our way down to bigger ones, by upscale operations. Before up sampling we also apply the 1*1 convolution to bring down the number of channels to 256. This is then added element-wise to the up-sampled output from the previous iteration. All the outputs are subjected to 3 X 3 convolution layer to create final 4 feature maps(P2, P3, P4, P5). The 5th feature map (P6) is generated from a max pooling operation from P5.
Region Proposal Network:
All the convolution feature map that is generated by the previous layer is passed through a 3*3 convolution layer. The output of this then passed into two parallel branches that determine the objectness score and regress the bounding box coordinates.
Here, we only use only one anchor stride and 3 anchor ratios for a feature pyramid (because we already have feature maps of different sizes to check for objects of different size).
A mask contains spatial information about the object. Thus, unlike the classification and bounding box regression layers, we could not collapse the output to fully connected layer to improve since it requires pixel-to-pixel correspondence from the above layer. Mask R-CNN uses a fully connected network to predict the mask. This ConvNet takes an RoI as input and outputs the m*m mask representation. We also upscale this mask for inference on input image and reduce the channels to 256 using 1*1 convolution. In order to generate input for this fully connected network that predicts mask, we use RoIAlign. The purpose of RoIAlign is to use convert different size feature map generated by region proposal network into a fixed-size feature map. Mask R-CNN paper suggested two variants of architecture. In one variant, the input of mask generation CNN is passed after RoIAlign is applied (ResNet C4), but in another variant, the input is passed just before the fully connected layer (FPN Network).
This mask generation branch is fully convolution network and it output a K * (m*m), where K is the number of classes (one for each class) and m=14 for ResNet-C4 and 28 for ResNet_FPN.
RoI align has same motive as of RoI pool, to generate the fixed size regions of interest from region proposals. It works in following steps:
Given the feature map of previous Convolution layer of size h*w, divide this feature map into M * N grids of equal size (we will NOT just take integer value).
The mask R-CNN inference speed is around 2 fps, which is good considering the addition of segmentation branch in the architecture.
Due to its additional capability to generate segmented mask, it is used in many computer vision applications such as:
- Human Pose Estimation
- Self Driving Car
- Drone Image Mapping etc.