Understanding 3D Human Pose and Shape Estimation From HybrIK

Recently, Jiefeng Li et. al proposed HybrIK to combine 3D keypoint estimation with human pose and shape estimation. It introduces inverse kinematics to find the rotation of joints and achieves new SOTA performance on several public benchmarks. Here I summarize the paradigm of HybrIK to obtain a concept of 3D human pose and shape estimation.

Network structure

From the technical perspective, the task of human pose and body estimation using neural networks is clearly defined in current context. Given an Image represented by a tensor xx with shape 3×H×W3\times H \times W, where HH and WW indicate the height and width, a typical neural network initially produces two tensors representing the pose and shape, respectively.

The network can be formulated as:

P,(Îē,Îļ)=F(x),P, (\beta, \theta) = \mathcal{F}(x),

where PP is the heatmap of shape J∗D×H×WJ*D\times H \times W and JJ and DD are the number of joints and DD is the range of depth, respectively. F\mathcal{F} is composed of two parts, a backbone that extracts high-dimensional features and two prediction heads that produce pose heatmap PP and shape parameters.

Prediction of PP

The prediction of pose heatmap is similar with image segmentation as it applies (de)convolutional layers to produce spatial corresponding outputs. Specifically in HybrIK, 33 deconvolutional layers with kernel size 4 and stride 2 followed by BN are stacked to up-sample the features. The final prediction layer, which is a 3×33 \times 3 convolution, yields out J∗DJ*D channels.

The initial heatmap PP should be normalized, by means such as softmax, over all spatial locations and then reshaped into 3D space as P^∈RJ×D×H×W\hat{P} \in R^{J\times D \times H \times W}. Now the coordinate Q={qi∈R3}i=1JQ = \{q_i\in R^3\}^J_{i=1}of each joint can be computed from the integration over each dimension. Note the value range of QQ is (−0.5,0.5)(-0.5, 0.5) of the world.

Prediction of shape parameters

The prediction of shape parameters is even straightforward. Akin to image classification, the features are squeezed into a vector through an average pooling, upon which fully connected layers are adopted to predict the difference between shape estimation and the average shape provided by the SMPL model. There are two versions of shape representation proposed by SMPL, with 10 and 1000 degree of freedom, respectively. They are both linear combination of orthonormal principal components of shape displacements. HybrIK uses the 10-dimensional version, while the SMPL has updated to support the 1000-dimensional version only.

Post-processing

Since the ultimate goal of the task is the mesh representation of human body, post-processing on the predicted values is desired to form the final results.

Coordinate transformation

Note the predicted pose coordinates are based on the volumetric representation, aka. UVD coordinates. Thus, one needs to transform the UVD coordinates back to the image space to utilize 2D keypoint annotations.

TO BE CONTINUED...