PoseNet Pose Estimation

Last Updated : 23 Jun, 2022

Pose estimation refers to computer vision techniques that detect persons or objects in images and video so that one could determine , for example, where someone’s elbow shown up in an image. Pose Estimation techniques have many applications such as Gesture Control, Action Recognition and also in the field of augmented reality. In this article, we will be discussing PoseNet, which uses a Convolution Neural Network (CNN) model to regress pose from a single RGB image. It can also be used in the real-time system providing a 5ms/frame speed.

Deep Learning Regression Model:

Convolution neural network (ConvNet) trained to estimate camera pose directly from a monocular image, I. The network outputs a pose vector p, given by a 3-D camera position x and orientation represented by quaternion q:

$p = \left [ x, q \right ]$

Where pose p is defined relative to an arbitrary global reference frame. We chose quaternions as our orientation representation because arbitrary 4-D values are easily mapped to legitimate rotations by normalizing them to unit length. The loss function of our regressor can be defined as:

$loss\left ( I \right ) = \left \| \hat{x} - x \right \| + \beta\left \| \hat{q} - \frac{q}{\left \| q \right \|} \right \|_2$

Where beta is the scale factor chosen to keep the expected value of position and orientation errors to be approximately equal. For the indoor scenes, it was between120-750 and outdoor scenes between 250-2000

Architecture:

The authors use GoogLeNet architecture for developing a pose regression network. The original GoogLenet architecture contains 22 layers that contain 6 Inception modules and two additional classifiers. The authors made some changes in the architecture, these changes are:

Replace every one of the three softmax classifiers with affine regressors. The softmax layers were taken out and every fully connected layer was modified to yield a pose vector of 7-dimensional representing position and orientation.
Add another fully connected layer before the final regressor of feature size 2048. This was to form a localization feature vector which may then be explored for generalization.
At test time we also normalize the quaternion orientation vector to unit length.

Implementation:

In this code, we will be using PoseNet model created and trained by TensorFlow. These models are available for various devices such as they can be run on the browser or an android or iOS device. To run them in python. We will use this implementation.

Python3

# Necessary imports
%tensorflow_version 1.x
!pip3 install scipy pyyaml ipykernel opencv-python==3.4.5.20
 
# Clone some Code from GitHub
!git clone https://www.github.com/rwightman/posenet-python
 
import os
import cv2
import time
import argparse
import posenet
import tensorflow as tf
import matplotlib.pyplot as plt
 
print('Initializing')
input_file = '/content/posenet-python/video.avi'
output_file = '/content/posenet-python/output.mp4'
 
# Load input video files and 
cap = cv2.VideoCapture(input_file)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
# create a video writer to write the output file
fourcc = cv2.VideoWriter_fourcc('M','J','P','G')
video = cv2.VideoWriter(output_file, fourcc, fps, (width, height))
 
model = 101
###scale_factor = 1.0
scale_factor = 0.4
 
with tf.Session() as sess:
      # Load PoseNet model
    model_cfg, model_outputs = posenet.load_model(model, sess)
    output_stride = model_cfg['output_stride']
    start = time.time()
 
    incnt = 0
    # Process the whole video frame by frame 
    while True:
        # Increase frame count by one
        incnt = incnt + 1
        try: 
          # read_cap is utility function to read and process from video
          input_image, draw_image, output_scale = posenet.read_cap(
                cap, scale_factor=scale_factor, output_stride=output_stride)
        except:
          break
        # run the model on the image and generate output results
        heatmaps_result, offsets_result, displacement_fwd_result, displacement_bwd_result = sess.run(
            model_outputs,
            feed_dict={'image:0': input_image}
        )
        # here we filter poses generated by above model
        # and output pose score, keypoint scores and their keypoint coordinates
        # this function will return maximum 10 pose, it can be changed by maximum_pose
        # variable.
        pose_scores, keypoint_scores, keypoint_coords = posenet.decode_multiple_poses(
            heatmaps_result.squeeze(axis=0),
            offsets_result.squeeze(axis=0),
            displacement_fwd_result.squeeze(axis=0),
            displacement_bwd_result.squeeze(axis=0),
            output_stride=output_stride,
            min_pose_score=0.25)
        # scale keypoint co-ordinate to output scale
        keypoint_coords *= output_scale
        # draw pose on input frame to obtain output frame
        draw_image = posenet.draw_skel_and_kp(
                draw_image, pose_scores, keypoint_scores, keypoint_coords,
                min_pose_score=0.25, min_part_score=0.25)
        video.write(draw_image)
# release the videoreader and writer
video.release()
cap.release()

This will generate a video output file. We have tested the model on this video from the OpenPose GitHub repository. I cannot upload it here as it exceeds the size limit. You can see the resultant video here.

Dataset Used:

The dataset was generated using structure from motion (SfM) techniques which the authors use as ground truth measurements for this paper. A Google LG Nexus 5 smartphone was used by a pedestrian to take HD video around each scene. Below are some results on this dataset.