Hengkai Guo
I currently lead a team at ByteDance , focusing on the development of advanced 3D vision and augmented reality (AR) technologies. In my role, I have sprearheaded 3D vision projects, aimed at enhancing AR interaction, 3D editing, and content creation. Prior to joining ByteDance, I completed both my bachelor's and master's degrees from Tsinghua University . Additionally, I have enriched my experience with internships at Google , Hulu , Sensetime , and Microsoft .
Hiring: We are seeking talented individuals (both FTEs and interns) with expertise in generative models and 3D vision. If you're interested, please feel free to email me (Email: guohengkaighk [AT] gmail [DOT] com).
Email /
Scholar /
Github
Research
My primary interest lies in 3D vision and its practical applications, encompassing areas including 3D localization, perception, reconstruction, understanding, and generation. (* indicates equal contribution, † indicates project lead.)
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Sili Chen,
Hengkai Guo †,
Shengnan Zhu,
Feihu Zhang,
Zilong Huang ,
Jiashi Feng ,
Bingyi Kang
CVPR , 2025   (Highlight Presentation)
project page
/
code
/
arXiv
An accurate, consistent, efficient, and generalizable video depth estimator, capable of supporting videos of any length.
ZeroPlane: Towards In-the-wild 3D Plane Reconstruction from a Single Image
Jiachen Liu *,
Rui Yu *,
Sili Chen,
Sharon X Huang ,
Hengkai Guo †
CVPR , 2025   (Highlight Presentation)
code
/
arXiv
A zero-shot monocular plane estimator over diverse domains.
MonoPlane: Exploiting Monocular Geometric Cues for Generalizable 3D Plane Reconstruction
Wang Zhao ,
Jiachen Liu ,
Sheng Zhang,
Yishu Li ,
Sili Chen,
Sharon X Huang ,
Yong-Jin Liu ,
Hengkai Guo †
IROS , 2024   (Oral Presentation)
code (Coming soon)
/
arXiv
Leverage pretrained depth and normal models to facilitate zero-shot monocular plane reconstruction.
Lazy Visual Localization via Motion Averaging
Siyan Dong *,
Shaohui Liu *,
Hengkai Guo ,
Baoquan Chen ,
Marc Pollefeys
arXiv , 2024
arXiv
Integrate motion averaging into visual localization to eliminate the need for pre-built 3D maps.
ParticleSfM: Exploiting Dense Point Trajectories for Localizing Moving Cameras
Wang Zhao ,
Shaohui Liu ,
Hengkai Guo †,
Wenping Wang ,
Yong-Jin Liu
ECCV , 2022
project page
/
arXiv
/
code
A video structure-from-motion system with dense point trajectories that generalizes well to in-the-wild sequences with dynamic motion.
A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo
Wang Zhao *,
Shaohui Liu *,
Yi Wei ,
Hengkai Guo ,
Yong-Jin Liu
ICCV , 2021
project page
/
arXiv
/
code
A differentiable solver for multi-view stereo that iteratively solves for per-view depth and normal map based on the locally planar assumption.
Iterative Feature Matching for Self-supervised Indoor Depth Estimation
Yi Wei ,
Hengkai Guo †,
Jiwen Lu ,
Jie Zhou
TCSVT , 2021
paper
A self-supervised depth estimation framework designed for low-texture indoor scenes, which operates without PoseNet by utilizing iterative feature matching.
GPO: Global Plane Optimization for Fast and Accurate Monocular SLAM Initialization
Sicong Du*,
Hengkai Guo *†,
Yao Chen ,
Yilun Lin,
Xiangbing Meng,
Linfu Wen,
Fei-Yue Wang
ICRA , 2020
arXiv
A monocular initialization method for SLAM using multi-frame planar homographies.
Geometric Pretraining for Monocular Depth Estimation
Kaixuan Wang ,
Yao Chen ,
Hengkai Guo ,
Linfu Wen,
Shaojie Shen
ICRA , 2020
paper
/
code
Use self-supervised flow pretraining task to improve monocular depth estimation.
Pose Guided Structured Region Ensemble Network for Cascaded Hand Pose Estimation
Xinghao Chen ,
Guijin Wang ,
Hengkai Guo ,
Cairong Zhang
Neurocomputing , 2019
arXiv
/
code
Enhance RGB-D-based 3D hand pose estimation by fusing features from initial pose guidance, estimated through a region ensemble network.
Two-Stream Binocular Network: Accurate Near Field Finger Detection Based On Binocular Images
Yi Wei ,
Guijin Wang ,
Cairong Zhang,
Hengkai Guo ,
Xinghao Chen ,
Huazhong Yang
VCIP , 2017   (Best Student Paper Award)
arXiv
/
dataset
A two-stream model designed to regress the 3D positions of fingertips from binocular images.
Region Ensemble Network: Improving Convolutional Network for Hand Pose Estimation
Hengkai Guo ,
Guijin Wang ,
Xinghao Chen ,
Cairong Zhang,
Fei Qiao ,
Huazhong Yang
ICIP , 2017
arXiv
/
code
A state-of-the-art RGB-D-based 3D hand pose model that utilizes a convolutional network and regional feature ensemble.
Motion Feature Augmented Recurrent Neural Network for Skeleton-based Dynamic Hand Gesture Recognition
Xinghao Chen ,
Hengkai Guo ,
Guijin Wang ,
Cairong Zhang,
Li Zhang
ICIP , 2017
arXiv
Use a recurrent neural network with augmented motion features to estimate skeleton-based dynamic hand gesture.