My research interests lie in computer vision (e.g., detection, segmentation, and pose estimation), also combining with some machine learning techniques such as multi-task learning, domain adaptation, domain generalization and semi-supervised learning. Lately, I have been exploring vision-language-action based generalizable robotic grasping and manipulation tasks. I am committed to pursuing simple yet efficient design, and designing data/label efficient learning. Most of my works are about inferring the physical world (location, association, affordance, pose, shape, etc) from RGB images.
● [2025.11] Our work SemiUHPE for Semi-Supervised Unconstrained Head Pose Estimation in the Wild is accepted by TPAMI 2025.
● [2025.09] Our work VLBiMan for Learning Generalizable Bimanual Manipulation from One-Shot Demonstration is released.
● [2025.09] Our work BiNoMaP for Learning Category-Level Bimanual Non-Prehensile Manipulation Primitives is released.
● [2025.06] Our work GAT-Grasp for Gesture-Driven and Affordance-based Task-Aware Robotic Grasping is accepted by IROS 2025.
● [2025.05] Our work DexScale for Automating Data Scaling for Sim2Real Generalizable Robot Skills is accepted by ICML 2025.
● [2025.04] Our work YOTO for Bimanual Robotic Manipulation is accepted by Robotics: Science and Systems (RSS) 2025.
● [2025.01] Our YOTO for Learning One-Shot Bimanual Robotic Manipulation from Video Demonstrations is released.
● [2024.07] I have joined the School of Data Science at CUHK-SZ being a postdoctoral researcher.
● [2024.05] I have passed the doctoral dissertation defense on May 29, 2024.
● [2024.04] Our SemiUHPE for Semi-Supervised Unconstrained Head Pose Estimation in the Wild is released.
● [2024.02] Our MultiAugs for boosting Semi-Supervised 2D Human Pose Estimation is released.
● [2024.01] PBADet: A One-Stage Anchor-Free Approach for Part-Body Association is accepted by ICLR 2024.
● [2024.01] BPJDet: Extended Object Representation for Generic Body-Part Joint Detection is accepted by TPAMI 2024.
● [2023.07] I've received the ICME 2023 Best Student Paper Runner Up Award in 14 July 2023, Brisbane, Australia.
● [2023.04] CONFETI for Domain Adaptive Semantic Segmentation has been accepted by CVPR Workshop 2023.
● [2023.04] BPJDet for Human Body-Part Joint Detection and Association has been accepted by ICME 2023.
● [2023.03] StuArt for Individualized Classroom Observation of Students has been accepted by ICASSP 2023.
● [2023.03] The project AIClass: Automatic Teaching Assistance System Towards Classrooms for K-12 Education is released.
● [2023.03] SSDA-YOLO for YOLOv5-based Domain Adaptive Object Detection has been accepted by CVIU in year 2023.
● [2023.02] Our work DirectMHP: Direct 2D Multi-Person Head Pose Estimation with Full-range Angles is released.
Activities
● Conference Reviewer: ICLR(2026/2025), NeurIPS(2025), CVPR(2026/2025/2024/2023), ICCV(2025/2023), ECCV(2024/2022), AAAI(2026/2025/2024), ACMMM(2025/2024), ICRA(2026), ICME(2026/2025/2024), ACCV(2024), WACV(2026)
● Journal Reviewer: Transactions on Image Processing (TIP), Transactions on Circuits and Systems for Video Technology (TCSVT), Transactions on Intelligent Vehicles (TIV)
TLDR: VLBiMan is a framework that derives reusable skills from a single human example through task-aware decomposition, preserving invariant primitives as anchors while dynamically adapting adjustable components via vision-language grounding, and takes a step toward practical and versatile dual-arm manipulation in unstructured settings.
TLDR: This work proposes a geometry-aware post-optimization algorithm that refines raw motions into executable manipulation primitives that conform to specific motion patterns and validate BiNoMaP across a range of representative bimanual tasks and diverse object categories, demonstrating its effectiveness, efficiency, versatility, and superior generalization capability.
TLDR: GAT-Grasp is introduced, a gesture-driven grasping framework that directly utilizes human hand gestures to guide the generation of task-specific grasp poses with appropriate positioning and orientation, and enables zero-shot generalization to novel objects and cluttered environments.
TLDR: As we all known, a critical prerequisite for achieving generalizable robot control is the availability of a large-scale robot training dataset. In this work, we introduce DexScale, a data engine designed to perform automatic skills simulation and scaling for learning deployable robot manipulation policies.
TLDR: This work proposes the YOTO (You Only Teach Once), which can extract and then inject patterns of bimanual actions from as few as a single binocular observation of hand movements, and teach dual robot arms various complex tasks.
TLDR: We propose the first semi-supervised unconstrained head pose estimation (SemiUHPE) method, which can leverage a large amount of unlabeled wild head images. SemiUHPE is robust to estimate wild challenging heads (e.g., heavy blur, extreme illumination, severe occlusion, atypical pose, and invisible face).
TLDR: Our method MultiAugs contains two vital components: (1) New advanced collaborative augmentation combinations; (2) Multi-path predictions of strongly augment inputs with diverse augmentations. Either of them can help to boost the performance of Semi-Supervised 2D Human Pose Estimation.
TLDR: This paper presents PBADet, a novel one-stage, anchor-free approach for part-body association detection. Building upon the anchor-free object representation across multi-scale feature maps, it introduces a singular part-to-body center offset that effectively encapsulates the relationship between parts and their parent bodies.
TLDR: The journal version of BPJDet. It has various new functions (Multiple Body-Parts Joint Detection and two downstream applications including Body-Head for Accurate Crowd Counting and Body-Hand for Hand Contact Estimation)
TLDR: To overcome the domain gap between synthetic and real-world datasets for semantic segmentation, this paper present CONtrastive FEaTure and pIxel alignment (CONFETI) for bridging the domain gap at both the pixel and feature levels.
TLDR: A novel extended object representation that integrates the center location offsets of body or its parts, and construct a dense one-stage Body-Part Joint Detector (BPJDet). This design is simple yet efficient.
TLDR: StuArt is a novel automatic system designed for the individualized classroom observation. We proposed some pedagogical approaches in signal processing for K-12 education. (Note: StuArt is one of the key part of the project AIClass.)
TLDR: This paper focuses on the full-range Multi-Person Head Pose Estimation (MPHPE) problem. We firstly construct two benchmarks by extracting GT labels for head detection and head orientation from public datasets AGORA and CMU Panoptic. Then, we propose a direct end-to-end simple baseline named DirectMHP based on YOLOv5.
TLDR: This paper presents a novel semi-supervised domain adaptive YOLO (SSDA-YOLO) based method to improve cross-domain detection performance by integrating the compact one-stage stronger detector YOLOv5 with domain adaptation.
TLDR: An automatic hand-raiser recognition algorithm to show who raise their hands in real classroom scenarios, which is of great importance for further analyzing the learning states of individuals.
Experience
The Chinese University of Hong Kong, Shenzhen,
Postdoctoral researcher in School of Data Science
Advisor: Prof. Kui Jia
2024.7 - Present
Shanghai Jiao Tong University,
Ph.D. student in Computer Science and Engineering
Advisor: Prof. Hongtao Lu
2020.9 - 2024.06
Qualcomm Wireless Communication Technologies (Shenzhen, China),
Engineering Intern at AI Department, Machine Learning Group (MLGCN)
Reporting to Dongyong Zhou, Senior Software Engineer
2019.6 - 2019.10
Shanghai Jiao Tong University,
Academic Master student in Computer Science and Engineering
Advisor: Prof. Ruimin Shen
2017.9 - 2020.3
Hunan University,
Bachelor of Engineering in Computer Science and Technology
2013.9 - 2017.6