HPRNet: Hierarchical point regression for whole-body human pose estimation

https://doi.org/10.1016/j.imavis.2021.104285Get rights and content

Abstract

In this paper, we present a new bottom-up one-stage method for whole-body pose estimation, which we call “hierarchical point regression,” or HPRNet for short. In standard body pose estimation, the locations of ~17 major joints on the human body are estimated. Differently, in whole-body pose estimation, the locations of fine-grained keypoints (68 on face, 21 on each hand and 3 on each foot) are estimated as well, which creates a scale variance problem that needs to be addressed. To handle the scale variance among different body parts, we build a hierarchical point representation of body parts and jointly regress them. The relative locations of fine-grained keypoints in each part (e.g. face) are regressed in reference to the center of that part, whose location itself is estimated relative to the person center. In addition, unlike the existing two-stage methods, our method predicts whole-body pose in a constant time independent of the number of people in an image. On the COCO WholeBody dataset, HPRNet significantly outperforms all previous bottom-up methods on the keypoint detection of all whole-body parts (i.e. body, foot, face and hand); it also achieves state-of-the-art results on face (75.4 AP) and hand (50.4 AP) keypoint detection. Code and models are available at https://github.com/nerminsamet/HPRNet.git.

Introduction

As a challenging computer vision task, human pose estimation aims to localize human body keypoints in images and videos. Human pose estimation has an important role in several vision tasks and applications such as action recognition [1], [2], [3], [4], [5], human mesh recovery [6], [7], [8], [9], augmented/virtual reality [10], [11], [12], animation and gaming [13], [14], [15], [16]. Unlike the standard human pose estimation task, whole-body pose estimation aims to detect face, hand and foot keypoints in addition to the standard human body keypoints. The challenge in this problem is the extreme scale variance or imbalance among different whole-body parts. For example, the relatively small scale of face and hand keypoints make accurate localization of face and hand keypoints more difficult compared to the standard body keypoints such as elbow, knee and hip. Direct application of existing human pose estimation methods do not yield satisfactory results due to this scale variance problem.
Even though human pose estimation has been well studied for the past few decades, the whole-body pose estimation task has not been sufficiently explored, mainly due to the lack of large-scale fully annotated whole-body keypoint datasets. The previous few methods [17], [18], trained several deep networks separately on different face, hand and body datasets, and ensembled them during inference. These methods suffer from issues arising from datasets’ biases, variations of illumination, pose and scales, and complex training and inference pipelines.
Recently, in order to address the missing benchmark issue, Jin et al. [19] introduced a novel dataset for whole-body pose estimation, called COCO WholeBody. COCO WholeBody extends COCO keypoints dataset [20] by further annotating face, hand and foot keypoints. In addition to the standard, 17 human body keypoints from the COCO keypoints dataset; 68 facial landmarks, 42 hand keypoints and 6 foot keypoints are annotated (Fig. 1). Along with these 133 whole-body keypoint annotations, the dataset also has face and hand bounding box annotations that were automatically computed from the extreme keypoints of the corresponding part. They also proposed a strong baseline, called ZoomNet, which has set the state of the art. ZoomNet is a top-down, two-stage method based on the human pose estimation model HRNet [21]. Given an image, ZoomNet first detects person instances using the FasterRCNN [22] person detector, then it predicts 17 body and 6 foot keypoints using a CNN model. Later, to overcome the scale variance between whole-body parts, ZoomNet crops the hand and face areas that it detected and transforms them to higher resolutions using seperate CNNs to further perform face and hand keypoint estimation.
There are two main approaches for human pose and whole-body pose estimation; bottom-up [23], [24], [25], [26], [27], [28], [29], [30], [18], [31], [32], [33], [34], [35] and top-down [36], [37], [38], [39], [21], [40]. Bottom-up methods directly detect human body keypoints and later group them to obtain final poses for each person in a given image. On the other hand, top-down methods (e.g. ZoomNet) first detect and extract person instances, then apply pose estimation on each instance separately. The grouping stage of bottom-up methods is more efficient than repeating pose estimation for each person instance. As a result, top-down methods slow down with the increasing number of people (Fig. 5). However, compared to bottom-up methods, better accuracies are obtained by top-down approaches.
In this paper, we propose a new bottom-up method, HPRNet, that explicitly handles the hierarchical nature of whole-body pose estimation by regressing keypoints hierarchically. To this end, in addition to the estimation of standard body keypoints, we define the bounding box centers of relatively small body parts such as face and hands with offsets to the person instance center (Fig. 3). Concurrently, we build another level of regression where we define each hand and face keypoints with an offset to their corresponding hand and face bounding box centers. We jointly train each level of regression hierarchy and regress all whole-body keypoints with respect to their defined center points. This hierarchical bottom-up approach brings two benefits. First, the scale variance among different body parts are handled naturally as the relative distances within each part are in a similar range and each part-type is processed by a separate sub-network. Second, being a bottom-up method, HPRNet's inference speed is minimally affected by the number of persons in the input image. This is in contrast to the top-down methods such as ZoomNet, which significantly slows down with more person instances (65.7 ms for an image containing 1 person vs. 668.2 ms for an image with 10 persons). Our method is based on the center-point based bottom-up object detection methods [41], [42], [43], [44]. These methods can easily be extended to the keypoint estimation task [41], [45].
We validated the effectiveness of our method through ablation experiments and comparisons with the state of the art (SOTA) on the COCO WholeBody dataset. Our method significantly outperforms all bottom-up methods. It also outperforms the SOTA top-down method ZoomNet in the detection of face and hand keypoints, while being significantly faster than ZoomNet.
Our major contribution in this paper is the proposal of a one-stage, bottom-up method to close the performance gap between the bottom-up and top-down methods. In contrast to top-down methods, our method runs almost in constant time, independent from the number of persons in the input image.

Access through your organization

Check access to the full text by signing in through your organization.

Access through your organization

Section snippets

Human body pose estimation

We can categorize the current approaches for multi-person pose estimation into two: bottom-up and top-down. In the bottom-up methods [23], [24], [25], [26], [27], [28], [29], [30], [18], [31], [32], [33], [34], [35], given an image, body keypoints detected first, without knowing the number or location of person instances or to which person instances these keypoints belong. Later, detected keypoints are grouped and assigned to person instances. Recently, center-based object detection methods [41]

Model

HPRNet is a one-stage end-to-end trainable network that learns regressing the whole-body keypoints. In HPRNet, the input image first passes through a backbone network and output of the backbone is fed to 8 separate branches, namely; Person Center Heatmap, Person Center Correction, Person W & H, Body Keypoint Offsets, Body Keypoint Heatmaps, Hand Keypoint Offsets, Face Keypoint Offsets and Face Box W & H. We show the network architecture of HPRNet in Fig. 2.

Experiments

This section describes the experiments we conducted to show the effectiveness of our proposed method. First, we present ablation experiments to compare hierarchical models I and II shown in Fig. 4. Next, we compare our method with our baseline CenterNet [41] (Fig. 4b). Finally, we provide performance comparison with the state of the art and a run-time analysis.

Conclusion

In this work, we introduced HPRNet as a bottom-up, one-stage method for whole-body keypoint detection. HPRNet handles scale variance among whole-body parts by hierarchically regressing whole-body keypoints. We evaluated the effectiveness of our method through baseline comparison and ablation experiments on hierarchical structure of whole-body keypoints. Our method achieves state-of-the-art results in the detection of face and hand keypoints on the COCO WholeBody dataset; it also outperforms all

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The numerical calculations reported in this paper were fully performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).

References (61)

  • Y. Du et al.

    Hierarchical recurrent neural network for skeleton based action recognition

  • M. Li et al.

    Actional-structural graph convolutional networks for skeleton-based action recognition

  • A. Yan et al.

    Pa3d: pose-action 3d machine for video recognition

  • L. Huang et al.

    Part-aligned pose-guided recurrent network for action recognition

    Pattern Recognit.

    (2019)
  • D.C. Luvizon et al.

    2d/3d pose estimation and action recognition using multitask deep learning

  • H. Choi et al.

    Pose2mesh: graph convolutional network for 3d human pose and mesh recovery from a 2d human pose

  • J.N. Kundu et al.

    Appearance consensus driven self-supervised human mesh recovery

  • U. Iqbal et al.

    Kama: 3d Keypoint Aware Body Mesh Articulation

    (2021)
  • A. Kanazawa et al.

    End-to-end recovery of human shape and pose

  • G. Cimen et al.

    Ar poser: automatically augmenting mobile pictures with digital avatars imitating poses

  • A. Elhayek et al.

    Fully automatic multi-person human motion capture for vr applications

  • W. Xu et al.

    Mo2cap2: real-time mobile 3d motion capture with a cap-mounted fisheye camera

    IEEE Trans. Vis. Comput. Graph.

    (2019)
  • Azure Kinect Body Tracking Joints

    (2019)
  • 3d Skeletal Tracking on Azure Kinect

    (2019)
  • How Huawei ml Kit's Face Detection and Hand Keypoint Detection Capabilities Helped With Creating the Game Crazy Rockets

    (2020)
  • L. Kumarapu et al.

    Animepose: multi-person 3d pose estimation and animation

    Pattern Recognit. Lett.

    (2020)
  • Z. Cao et al.

    Openpose: realtime multi-person 2d pose estimation using part affinity fields

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2021)
  • G. Hidalgo et al.

    Single-network whole-body pose estimation

  • S. Jin et al.

    Whole-body human pose estimation in the wild

  • T.-Y. Lin et al.

    Microsoft COCO: common objects in context

  • K. Sun et al.

    Deep high-resolution representation learning for human pose estimation

  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks

    Advances in Neural Information Processing Systems

    (2015)
  • Z. Cao et al.

    Realtime multi-person 2d pose estimation using part affinity fields

  • G. Ning et al.

    Knowledge-guided deep fractal neural networks for human pose estimation

    IEEE Trans. Multimed.

    (2017)
  • A. Newell et al.

    Associative Embedding: End-to-End Learning for Joint Detection and Grouping

    (2016)
  • A. Newell et al.

    Stacked hourglass networks for human pose estimation

  • M. Kocabas et al.

    Multiposenet: fast multi-person pose estimation using pose residual network

  • G. Papandreou et al.

    Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model

  • A. Bulat et al.

    Human pose estimation via convolutional part heatmap regression

  • L. Pishchulin et al.

    Deepcut: joint subset partition and labeling for multi person pose estimation

  • Cited by (20)

    • Challenges and solutions for vision-based hand gesture interpretation: A review

      2024, Computer Vision and Image Understanding
      Citation Excerpt :

      To better comprehend Arm–Hand correlation, this study reported a Spatial–Temporal Parallel Arm–Hand Motion Transformer to enhance the robustness of the model. Recently, there has been a significant focus on whole-body pose estimation in the field of HCI (Xiang et al., 2019; Xu et al., 2021; Samet and Akbas, 2021). There is currently a growing trend towards training models on multi-modality and inferring on a single RGB modality.

    • Tremor detection Transformer: An automatic symptom assessment framework based on refined whole-body pose estimation

      2023, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      Detailed information on each component will be provided. The top-down methods are more capable of handling scale differences than the bottom-up methods that directly extract keypoints from the human body (Samet and Akbas, 2021). As shown in Fig. 3, the top-down approach has the advantage of zooming in on human details by cropping and adjusting the box for each person.

    View all citing articles on Scopus
    View full text