HPRNet: Hierarchical point regression for whole-body human pose estimation

doi:10.1016/j.imavis.2021.104285

Image and Vision Computing

Volume 115, November 2021, 104285

https://doi.org/10.1016/j.imavis.2021.104285 Get rights and content

Abstract

In this paper, we present a new bottom-up one-stage method for whole-body pose estimation, which we call “hierarchical point regression,” or HPRNet for short. In standard body pose estimation, the locations of ~17 major joints on the human body are estimated. Differently, in whole-body pose estimation, the locations of fine-grained keypoints (68 on face, 21 on each hand and 3 on each foot) are estimated as well, which creates a scale variance problem that needs to be addressed. To handle the scale variance among different body parts, we build a hierarchical point representation of body parts and jointly regress them. The relative locations of fine-grained keypoints in each part (e.g. face) are regressed in reference to the center of that part, whose location itself is estimated relative to the person center. In addition, unlike the existing two-stage methods, our method predicts whole-body pose in a constant time independent of the number of people in an image. On the COCO WholeBody dataset, HPRNet significantly outperforms all previous bottom-up methods on the keypoint detection of all whole-body parts (i.e. body, foot, face and hand); it also achieves state-of-the-art results on face (75.4 AP) and hand (50.4 AP) keypoint detection. Code and models are available at https://github.com/nerminsamet/HPRNet.git.

Introduction

As a challenging computer vision task, human pose estimation aims to localize human body keypoints in images and videos. Human pose estimation has an important role in several vision tasks and applications such as action recognition [1], [2], [3], [4], [5], human mesh recovery [6], [7], [8], [9], augmented/virtual reality [10], [11], [12], animation and gaming [13], [14], [15], [16]. Unlike the standard human pose estimation task, whole-body pose estimation aims to detect face, hand and foot keypoints in addition to the standard human body keypoints. The challenge in this problem is the extreme scale variance or imbalance among different whole-body parts. For example, the relatively small scale of face and hand keypoints make accurate localization of face and hand keypoints more difficult compared to the standard body keypoints such as elbow, knee and hip. Direct application of existing human pose estimation methods do not yield satisfactory results due to this scale variance problem.

Even though human pose estimation has been well studied for the past few decades, the whole-body pose estimation task has not been sufficiently explored, mainly due to the lack of large-scale fully annotated whole-body keypoint datasets. The previous few methods [17], [18], trained several deep networks separately on different face, hand and body datasets, and ensembled them during inference. These methods suffer from issues arising from datasets’ biases, variations of illumination, pose and scales, and complex training and inference pipelines.

Recently, in order to address the missing benchmark issue, Jin et al. [19] introduced a novel dataset for whole-body pose estimation, called COCO WholeBody. COCO WholeBody extends COCO keypoints dataset [20] by further annotating face, hand and foot keypoints. In addition to the standard, 17 human body keypoints from the COCO keypoints dataset; 68 facial landmarks, 42 hand keypoints and 6 foot keypoints are annotated (Fig. 1). Along with these 133 whole-body keypoint annotations, the dataset also has face and hand bounding box annotations that were automatically computed from the extreme keypoints of the corresponding part. They also proposed a strong baseline, called ZoomNet, which has set the state of the art. ZoomNet is a top-down, two-stage method based on the human pose estimation model HRNet [21]. Given an image, ZoomNet first detects person instances using the FasterRCNN [22] person detector, then it predicts 17 body and 6 foot keypoints using a CNN model. Later, to overcome the scale variance between whole-body parts, ZoomNet crops the hand and face areas that it detected and transforms them to higher resolutions using seperate CNNs to further perform face and hand keypoint estimation.

There are two main approaches for human pose and whole-body pose estimation; bottom-up [23], [24], [25], [26], [27], [28], [29], [30], [18], [31], [32], [33], [34], [35] and top-down [36], [37], [38], [39], [21], [40]. Bottom-up methods directly detect human body keypoints and later group them to obtain final poses for each person in a given image. On the other hand, top-down methods (e.g. ZoomNet) first detect and extract person instances, then apply pose estimation on each instance separately. The grouping stage of bottom-up methods is more efficient than repeating pose estimation for each person instance. As a result, top-down methods slow down with the increasing number of people (Fig. 5). However, compared to bottom-up methods, better accuracies are obtained by top-down approaches.

In this paper, we propose a new bottom-up method, HPRNet, that explicitly handles the hierarchical nature of whole-body pose estimation by regressing keypoints hierarchically. To this end, in addition to the estimation of standard body keypoints, we define the bounding box centers of relatively small body parts such as face and hands with offsets to the person instance center (Fig. 3). Concurrently, we build another level of regression where we define each hand and face keypoints with an offset to their corresponding hand and face bounding box centers. We jointly train each level of regression hierarchy and regress all whole-body keypoints with respect to their defined center points. This hierarchical bottom-up approach brings two benefits. First, the scale variance among different body parts are handled naturally as the relative distances within each part are in a similar range and each part-type is processed by a separate sub-network. Second, being a bottom-up method, HPRNet's inference speed is minimally affected by the number of persons in the input image. This is in contrast to the top-down methods such as ZoomNet, which significantly slows down with more person instances (65.7 ms for an image containing 1 person vs. 668.2 ms for an image with 10 persons). Our method is based on the center-point based bottom-up object detection methods [41], [42], [43], [44]. These methods can easily be extended to the keypoint estimation task [41], [45].

We validated the effectiveness of our method through ablation experiments and comparisons with the state of the art (SOTA) on the COCO WholeBody dataset. Our method significantly outperforms all bottom-up methods. It also outperforms the SOTA top-down method ZoomNet in the detection of face and hand keypoints, while being significantly faster than ZoomNet.

Our major contribution in this paper is the proposal of a one-stage, bottom-up method to close the performance gap between the bottom-up and top-down methods. In contrast to top-down methods, our method runs almost in constant time, independent from the number of persons in the input image.

Access through your organization

Check access to the full text by signing in through your organization.

Access through your organization

Section snippets

Human body pose estimation

We can categorize the current approaches for multi-person pose estimation into two: bottom-up and top-down. In the bottom-up methods [23], [24], [25], [26], [27], [28], [29], [30], [18], [31], [32], [33], [34], [35], given an image, body keypoints detected first, without knowing the number or location of person instances or to which person instances these keypoints belong. Later, detected keypoints are grouped and assigned to person instances. Recently, center-based object detection methods [41]

Model

HPRNet is a one-stage end-to-end trainable network that learns regressing the whole-body keypoints. In HPRNet, the input image first passes through a backbone network and output of the backbone is fed to 8 separate branches, namely; Person Center Heatmap, Person Center Correction, Person W & H, Body Keypoint Offsets, Body Keypoint Heatmaps, Hand Keypoint Offsets, Face Keypoint Offsets and Face Box W & H. We show the network architecture of HPRNet in Fig. 2.

Experiments

This section describes the experiments we conducted to show the effectiveness of our proposed method. First, we present ablation experiments to compare hierarchical models I and II shown in Fig. 4. Next, we compare our method with our baseline CenterNet [41] (Fig. 4b). Finally, we provide performance comparison with the state of the art and a run-time analysis.

Conclusion

In this work, we introduced HPRNet as a bottom-up, one-stage method for whole-body keypoint detection. HPRNet handles scale variance among whole-body parts by hierarchically regressing whole-body keypoints. We evaluated the effectiveness of our method through baseline comparison and ablation experiments on hierarchical structure of whole-body keypoints. Our method achieves state-of-the-art results in the detection of face and hand keypoints on the COCO WholeBody dataset; it also outperforms all

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The numerical calculations reported in this paper were fully performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).

References (61)

Y. Du et al.
Hierarchical recurrent neural network for skeleton based action recognition
M. Li et al.
Actional-structural graph convolutional networks for skeleton-based action recognition
A. Yan et al.
Pa3d: pose-action 3d machine for video recognition
L. Huang et al.
Part-aligned pose-guided recurrent network for action recognition
Pattern Recognit.
(2019)
D.C. Luvizon et al.
2d/3d pose estimation and action recognition using multitask deep learning
H. Choi et al.
Pose2mesh: graph convolutional network for 3d human pose and mesh recovery from a 2d human pose
J.N. Kundu et al.
Appearance consensus driven self-supervised human mesh recovery
U. Iqbal et al.
Kama: 3d Keypoint Aware Body Mesh Articulation
(2021)
A. Kanazawa et al.
End-to-end recovery of human shape and pose
G. Cimen et al.
Ar poser: automatically augmenting mobile pictures with digital avatars imitating poses

A. Elhayek et al.

Fully automatic multi-person human motion capture for vr applications

W. Xu et al.

Mo2cap2: real-time mobile 3d motion capture with a cap-mounted fisheye camera

IEEE Trans. Vis. Comput. Graph.

(2019)

Azure Kinect Body Tracking Joints

(2019)

3d Skeletal Tracking on Azure Kinect

(2019)

How Huawei ml Kit's Face Detection and Hand Keypoint Detection Capabilities Helped With Creating the Game Crazy Rockets

(2020)

L. Kumarapu et al.

Animepose: multi-person 3d pose estimation and animation

Pattern Recognit. Lett.

(2020)

Z. Cao et al.

Openpose: realtime multi-person 2d pose estimation using part affinity fields

IEEE Trans. Pattern Anal. Mach. Intell.

(2021)

G. Hidalgo et al.

Single-network whole-body pose estimation

S. Jin et al.

Whole-body human pose estimation in the wild

T.-Y. Lin et al.

Microsoft COCO: common objects in context

K. Sun et al.

Deep high-resolution representation learning for human pose estimation

S. Ren et al.

Faster R-CNN: towards real-time object detection with region proposal networks

Advances in Neural Information Processing Systems

(2015)

Z. Cao et al.

Realtime multi-person 2d pose estimation using part affinity fields

G. Ning et al.

Knowledge-guided deep fractal neural networks for human pose estimation

IEEE Trans. Multimed.

(2017)

A. Newell et al.

Associative Embedding: End-to-End Learning for Joint Detection and Grouping

(2016)

A. Newell et al.

Stacked hourglass networks for human pose estimation

M. Kocabas et al.

Multiposenet: fast multi-person pose estimation using pose residual network

G. Papandreou et al.

Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model

A. Bulat et al.

Human pose estimation via convolutional part heatmap regression

L. Pishchulin et al.

Deepcut: joint subset partition and labeling for multi person pose estimation

Cited by (20)

Challenges and solutions for vision-based hand gesture interpretation: A review
2024, Computer Vision and Image Understanding
Citation Excerpt :
To better comprehend Arm–Hand correlation, this study reported a Spatial–Temporal Parallel Arm–Hand Motion Transformer to enhance the robustness of the model. Recently, there has been a significant focus on whole-body pose estimation in the field of HCI (Xiang et al., 2019; Xu et al., 2021; Samet and Akbas, 2021). There is currently a growing trend towards training models on multi-modality and inferring on a single RGB modality.
Hand gesture is one of the most efficient and natural interfaces in current human–computer interaction (HCI) systems. Despite the great progress achieved in hand gesture-based HCI, perceiving or tracking the hand pose from images remains challenging. In the past decade, several challenges have been indicated and explored, such as incomplete data issue, the requirement of large-scale annotated dataset, and 3D hand pose estimation from monocular RGB image; however, there is a lack of surveys to provide comprehensive collection and analysis for these challenges and corresponding solutions. To this end, this paper devotes effort to the general challenges of hand gesture interpretation techniques in HCI systems based on visual sensors and elaborates on the corresponding solutions in current state-of-the-arts, which can provide a systematic reminder for practical problems of hand gesture interpretation. Moreover, this paper provides informative cues for recent datasets to further point out the inherent differences and connections among them, such as the annotation of objects and the number of hands, which is important for conducting research yet ignored by previous reviews. In retrospect of recent developments, this paper also conjectures what the future work will concentrate on, from the perspectives of both hand gesture interpretation and dataset construction.
Tremor detection Transformer: An automatic symptom assessment framework based on refined whole-body pose estimation
2023, Engineering Applications of Artificial Intelligence
Citation Excerpt :
Detailed information on each component will be provided. The top-down methods are more capable of handling scale differences than the bottom-up methods that directly extract keypoints from the human body (Samet and Akbas, 2021). As shown in Fig. 3, the top-down approach has the advantage of zooming in on human details by cropping and adjusting the box for each person.
Essential tremor (ET) is a prevalent neurological disorder that necessitates using objective and non-invasive methods for assessing symptom severity. Traditional visual assessments are often limited by subjectivity, while other wearable sensors or visual markers may result in unnatural movements. This study presents a novel contact-free visual-based pipeline for ET assessment that integrates refined whole-body pose estimation with Transformer-based tremor detection to quantify tremor severity at a fine-grained level. The proposed pose estimation method combines the Transformer with HRNet, effectively capturing spatial-temporal complementary information from multiple body parts and enabling highly accurate tremor detection. The Transformer-based tremor detection is well-suited for modeling long-range dependencies and sequential tremor data extracted by the pose estimation model, further improving the performance of our proposed method. Our study collected data from 61 patients with ET, achieving an average accuracy, recall, and F1 score of 95.6%/95.6%, 89.2%/95.0%, and 83.0%/92.4% for classifying ET severity both during rest and postural tasks, respectively. Our proposed method outperforms the temporal convolutional network baseline, increasing F1 scores by 21.17% and 14.22% for rest and postural tasks, respectively. This high level of accuracy makes our method highly useful for clinical applications such as remote monitoring, diagnosis, and treatment evaluation. Our proposed technique has many advantages over traditional ET assessment techniques, including non-invasiveness, contact-free operation, and not requiring any wearable sensors or visual markers. Moreover, our method can be applied to other movement disorders requiring objective measurements of symptom severity. In summary, our contact-free visual-based pipeline for ET assessment represents a significant improvement over traditional ET assessment techniques, and our quantification results demonstrate its potential for use in clinical settings.
The use of CNNs in VR/AR/MR/XR: a systematic literature review
2024, Virtual Reality
Lightweight Whole-Body Human Pose Estimation With Two-Stage Refinement Training Strategy
2024, IEEE Transactions on Human Machine Systems
Decision-level information fusion powered human pose estimation
2023, Applied Intelligence
STFE-Net: A Spatial-Temporal Feature Extraction Network for Continuous Sign Language Translation
2023, IEEE Access

View all citing articles on Scopus

View full text

HPRNet: Hierarchical point regression for whole-body human pose estimation

Abstract

Introduction

Access through your organization

Section snippets

Human body pose estimation

Model

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgements

Hierarchical recurrent neural network for skeleton based action recognition

Actional-structural graph convolutional networks for skeleton-based action recognition

Pa3d: pose-action 3d machine for video recognition

Part-aligned pose-guided recurrent network for action recognition

Pattern Recognit.

2d/3d pose estimation and action recognition using multitask deep learning

Pose2mesh: graph convolutional network for 3d human pose and mesh recovery from a 2d human pose

Appearance consensus driven self-supervised human mesh recovery

Kama: 3d Keypoint Aware Body Mesh Articulation

End-to-end recovery of human shape and pose

Ar poser: automatically augmenting mobile pictures with digital avatars imitating poses

Fully automatic multi-person human motion capture for vr applications

Mo2cap2: real-time mobile 3d motion capture with a cap-mounted fisheye camera

IEEE Trans. Vis. Comput. Graph.

Azure Kinect Body Tracking Joints

3d Skeletal Tracking on Azure Kinect

How Huawei ml Kit's Face Detection and Hand Keypoint Detection Capabilities Helped With Creating the Game Crazy Rockets

Animepose: multi-person 3d pose estimation and animation

Pattern Recognit. Lett.

Openpose: realtime multi-person 2d pose estimation using part affinity fields

IEEE Trans. Pattern Anal. Mach. Intell.

Single-network whole-body pose estimation

Whole-body human pose estimation in the wild

Microsoft COCO: common objects in context

Deep high-resolution representation learning for human pose estimation

Faster R-CNN: towards real-time object detection with region proposal networks

Advances in Neural Information Processing Systems

Realtime multi-person 2d pose estimation using part affinity fields

Knowledge-guided deep fractal neural networks for human pose estimation

IEEE Trans. Multimed.

Associative Embedding: End-to-End Learning for Joint Detection and Grouping

Stacked hourglass networks for human pose estimation

Multiposenet: fast multi-person pose estimation using pose residual network

Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model

Human pose estimation via convolutional part heatmap regression

Deepcut: joint subset partition and labeling for multi person pose estimation