OpenPose : Human Pose Estimation Method

OpenPose is the first real-time multi-person system to jointly detect human body, hand, facial, and foot key-points (in total 135 key-points) on single images. It was proposed by researchers at Carnegie Mellon University. They have released in the form of Python code, C++ implementation and Unity Plugin. These resources can be downloaded from OpenPose repository.

Architecture:

In first step the image is passed through baseline CNN network to extract the feature maps of the input In the paper. In this paper the authors used first 10 layers of VGG-19 network.
The feature map is then process in a multi-stage CNN pipeline to generate the Part Confidence Maps and Part Affinity Field
- Part Confidence Maps:
- Part Affinity Field
In the last step, the Confidence Maps and Part Affinity Fields that are generated above are processed by a greedy bipartite matching algorithm to obtain the poses for each person in the image.

Confidence Maps and Part Affinity Fields

Confidence Maps: A Confidence Map is a 2D representation of the belief that a particular body part can be located in any given pixel. Confidence Maps are described by following equation:

where J is the number of body parts locations.

Part Affinity Fields: Part Affinity is a set of 2D vector fields that encodes location and orientation of limbs of different people in the image. It encodes the data in the form of pairwise connections between body parts.

Multi Stage CNN:

The above multi-CNN architecture has three major steps:

The first set of stages predicted the Part Affinity Fields refines L_t from the feature maps of base network F.

The second set of stages takes use the output Part Affinity Fields from the previous layers to refine the prediction of confidence maps detection.

The final S (confidence maps) and L (Part Affinity Field) are then passed into the greedy algorithm for further process.

Loss functions:

An L2-loss function is used to calculate the loss between the predicted confidence maps and Part Affinity fields to the ground truth maps and fields.

where L_c^*is the ground truth part affinity fields, S_j^* is the ground truth part confidence map, and W is a binary mask with W(p) = 0 when the annotation is missing at the pixel p. This is to prevent the extra loss that can be generated by these mask. The intermediate supervision at each stage is used to address the problem of vanishing gradient problem by replenishing the gradient periodically.

Confidence Maps:

The Confidence maps for each person k and each body part j is defined by:

It is a Gaussian curve with gradual changes where sigma controls the spread of the peak. The predicted peak of the network is an aggregation of the individual confidence maps by a max operator.

Part Affinity Fields:

The part affinity field is required especially in multi person pose detection we are required to map the correct body parts to its body. Because for multiple persons, there are multiple heads, hands, shoulders etc. Thus it becomes difficult to distinguish sometimes when they closely grouped together. PAF provides a connection between different part of the body that belongs to the same person. A stronger PAF link between body parts represents that high chances that those body parts belong to the same person.

If the p is on the limb, then L* is the unit vector otherwise it is 0.

The predicted Part affinity field, L_calong the line segment is to measure the confidence for two candidate part locations dj₁ and dj₂:

For multi person, Total E needs to be maximized :

There are multiple approaches to connect the body part as shown in the image below:

Association by Body part detection.
Association by considering all edges and generate a k-partite graph
Association by generating the tree structure.
Association by generating different bi-partite graph using greedy algorithm.

Changes from CMUPose:

CMUPose is the earlier version of OpenPose. It is the architecture that won the COCO 2016 Key point detection challenge 2016.

In the multi CNN architecture to refine confidence maps and Part Affinity, the convolution of kernel 7 is replaced by 3 convolutions of kernel size 3 which are concatenated at the end. This causes the reduction in number of operations from 97 to 51. The concatenation of convolutions allow network field to keep both higher and lower level features.
The authors also concluded the Part Affinity Field (PAF) refinement is more important than Confidence maps and leads to higher accuracy even without confidence maps. Therefore the multi CNN architecture first refine the PAF and then Confidence Maps.

Foot Detection:

OpenPose also proposed a foot detection algorithm. It makes OpenPose the first combined body and foot keypoint dataset and detector. By adding that it is able to detect ankle more accurately.

Vehicle Detection:

Similar to body Pose detection, the author of OpenPose experimented this algorithm on Vehicle Detection. It records high Average Precision and Recall on that.

Results:

In the MPII Multi-Person dataset, OpenPose obtained state-of-the-art mAP for the 288 images subset as well as the full testing set.

In the COCO keypoints challenge, there are two types of approaches. The top-down approaches which detect person first then detect the keypoint while bottom-up approaches are to detect keypoints first to form the person skeleton. This is because there is higher drop in accuracy when considering only people of higher scales.

The table below records the Average Precision on COCO keypoints datasets while using different number of PAFs and CMs. It can be concluded from the above table is that increasing PAFs can increase the Average Precision and Average Recall but the same is not true for confidence maps.

The graph below shows that OpenPose has almost no effect on number of people present in the image unlike the top -down approaches like Mask-RCNN, AlphaPose etc.

Caveats:

OpenPose have problems estimating pose when the ground truth example has non typical poses and upside down examples.
In highly crowded images where people are overlapping, the approach tends to merge annotations from different people, while missing others, due to the overlapping PAFs that make the greedy multi-person parsing fail

References:

Article Tags :

Machine Learning

Image-Processing

Neural Network