My research interests are computer vision and computer graphics. Most of my work aims to capturing, modeling and understanding our dynamic 3D world. This involves analyzing the geometry, motion and appearance for dynamic humans, objects and scenes.
GeoMan produces accurate and temporally stable geometric predictions (depth and normal) for human videos. The root-relative depth representation preserves human scale information, enabling metric depth estimation and 3D reconstruction.
GAIA generates animation-ready Gaussian avatars by learning only on in-the-wild image datasets. GAIA supports photorealistic novel view synthesis, individual control of identity and expression, and interactive animation and editing. GAIA is featured in the Emerging Technologies demo at SIGGRAPH 25.
BLADE is a human mesh recovery method that accurately recovers perspective parameters from a single image. BLADE outperforms existing methods at estimating subject depth, focal parameters, 3D pose, and 2D alignment.
We develop efficient representations for streamable free-viewpoint videos with dynamic Gaussians. QUEEN is able to capture dynamic scenes at high visual quality and reduce the model size to just 0.7 MB per frame while training in under 5 sec and rendering at ∼350 FPS. QUEEN is featured at GTC 2025 (San Jose and Paris) and SIGGRAPH 2025.
TEMPEH reconstructs 3D heads in dense semantic correspondence directly from calibrated multi-view images. It is one step beyond ToFu, as it uses self-supervised training from scans to resolve ambiguous and imperfect dense correspondences, with head localization in a large capture volume and occlusion-aware feature fusion.
We propose a novel and compact dynamic neural radiance field (DyNeRF) that captures complex dynamic scenes and enables photorealistic 3D video synthesis from wide view angles and at arbitary times. We also present the neural 3D video synthesis dataset.
We propose the ToFu framework that uses volumetric sampling to predict accurate base meshes in consistent topology directly from multi-view image inputs in only 0.385 seconds. ToFu also infers high-resolution skin appearance and detail maps, which enables photorealistic rendering.
A rasterization-based differentiable renderer for 3D meshes, Soft Rasterizer (SoftRas), that supports reasoning for geometry, texture, lighting conditions and camera poses with 2D images. An extended version of SoftRas has been incorporated into the PyTorch3D library.
Given a face portrait, this system corrects the perspective distortion, easing subsequent facial recognition and reconstruction and reducing the bias for human perception.
Utilizing a learnt implicit representation for geometry, the method is able to capture dynamic performances of human actors with very sparse camera settings (3 or 4 views), which enables to high-quality volumetric videos.
We propose a light-weight yet expressive generic face model, FLAME, by learning from large high-quality datasets and an appropriate separation of identity, expression and pose. The FLAME model has been incorporated into the SMPL-X model.
A real-time facial performance capture system from single RGB camera, that is robust to occlusion, thanks to an effective and real-time facial segmentation network.
Play4D unlocks efficient dynamic 4D neural scene reconstruction for Physical AI testing in Isaac Sim, along with the compressed streaming and free viewpoint immersive viewing on VR and light field displays.
We present novel immersive 3D experiences that allows users to move around in a streaming volumetric video, in 3D, in real-time. This allows for a highly immersive video viewing experience, especially when paired with 3D displays such as light field display or virtual/mixed reality headsets.