Pose estimation refers to computer vision techniques that detect persons or objects in images and video so that one could determine, for example, where someone’s elbow shows up in an image. Pose Estimation techniques have many applications such as Gesture Control, Action Recognition and also in the field of augmented reality. In this article, we will be discussing PoseNet, which uses a Convolution Neural Network (CNN) model to regress pose from a single RGB image. It can also be used in the real-time system providing a 5ms/frame speed.
Deep Learning Regression Model:
The goal of the convolution neural network (ConvNet) we train estimating camera pose directly from a monocular image, I. The network outputs a pose vector p, given by a 3-D camera position x and orientation represented by quaternion q:
Where pose p is defined relative to an arbitrary global reference frame. We chose quaternions as our orientation representation because arbitrary 4-D values are easily mapped to legitimate rotations by normalizing them to unit length. The loss function of our regressor can be defined as:
Where beta is the scale factor chosen to keep the expected value of position and orientation errors to be approximately equal. For the indoor scenes, it was between120-750 and outdoor scenes between 250-2000
The authors use GoogLeNet architecture for developing a pose regression network. The original GoogLenet architecture contains 22 layers that contain 6 Inception modules and two additional classifiers. The authors made some changes in the architecture, these changes are:
- Replace every one of the three softmax classifiers with affine regressors. The softmax layers were taken out and every fully connected layer was modified to yield a pose vector of 7-dimensional representing position and orientation.
- Add another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization.
- At test time we also normalize the quaternion orientation vector to unit length.
- In this code, we will be using PoseNet model created and trained by TensorFlow. These models are available for various devices such as they can be run on the browser or an android or iOS device. To run them in python. We will use this implementation.
- This will generate a video output file. We have tested the model on this video from the OpenPose GitHub repository. I cannot upload it here as it exceeds the size limit. You can see the resultant video here.
The dataset was generated using structure from motion (SfM) techniques which the authors use as ground truth measurements for this paper. A Google LG Nexus 5 smartphone was used by a pedestrian to take HD video around each scene. Below are some results on this dataset.