Forge4D is the first feed-forward model for 4D human Gaussian reconstruction in real world metric scale, and enables novel-view and novel-time synthesis from uncalibrated sparse-view videos in an efficient streaming manner.
PanoLAM is a large avatar model for Gaussian full-head reconstruction from a single unposed image. It utilize coarse-to-fine and dual-branch frameworks that creates Gaussian full-head within a second.
LAM creates animatable Gaussian heads with one-shot images in a single forward pass, which can be reenacted and rendered on various platforms (including mobile phones) in real time.
LaMP is a language-motion pretraining model that advances text-to-motion generation, motion-text retrieval, and motion captioning through aligned language-motion representation learning.
We build a bidirectional control flow between the style and the content for stylized motion generation and enable multimodal style control including text, image, and style motions.
We introduce a 2D joint VQVAE to quantize each joint instead of all joints into tokens. A spatial-temporal modeling framework with temporal-spatial 2D masking and 2D attention is also proposed for motion generation.
We introduce a new problem: open-vocabulary 9D object pose and size estimation, a new dataset: OO3D-9D, and a new framework based on vision foundation model to tackle this problem.
A new open-set few-shot 6D object pose estimation problem:
estimating the 6D pose of an unknown object by a few support views without CAD models and extra training.
A large-scale synthesis dataset for pre-training and benchmarks for future research.
A generic full flow bidirectional fusion framework for RGBD representation learning,
applied to joint instance semantic segmentation and 3D keypoint-based 6D pose estimation.
The first deep learning 3D keypoint-based 6D pose estimation algorithm and an overall framework for joint instance semantic segmantation and 3D keypoint detection.