I am a lyric poet, a passionate lover of life, and a PhD candidate at the Institute of Automation (CASIA).
I'm supervised by Prof. Xiaolong Zheng. My research interests include embodied AI and computer vision.
We are excited to introduce RoboBrain2.0, the most powerful open-source embodied brain model to date. Compared to its predecessor, RoboBrain1.0, our latest version significantly advances multi-agent task planning, spatial reasoning, and closed-loop execution.
We developed RoboBrain, an VLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy,
and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves SOTA performance across various robotic tasks,
highlighting its potential to advance robotic brain capabilities.
We developed Reason-RFT, a novel reinforcement fine-tuning framework that enhances visual reasoning capabilities in Vision-Language Models (VLMs).
Reason-RFT employs a two-phase training strategy: (1) SFT with curated CoT data to activate reasoning potential, followed by
(2) Group Relative Policy Optimization (GRPO)-based reinforcement learning to generate diverse reasoning-response pairs.
We present RoboOS, a unified memory-based framework for multi-robot collaboration. At its core, the Spatio-Temporal–Embodiment Memory (STEM) integrates spatial, temporal, and embodiment information to support long-horizon learning, heterogeneous coordination, and fault recovery. Experiments in diverse tasks show RoboOS enables lifelong, scalable, and robust collaboration.
VisualTrans is the first real-world benchmark for Visual Transformation Reasoning (VTR), evaluating spatial, procedural and quantitative reasoning across 12 human-object interaction tasks. While current models perform well on static tasks, they show significant limitations in dynamic, multi-step reasoning, revealing critical gaps in temporal and causal understanding for intelligent systems.
MathSticks is a benchmark for Visual Symbolic Compositional Reasoning (VSCR) that unifies visual perception, symbolic manipulation, and arithmetic consistency. Each task presents an incorrect matchstick equation in a seven-segment style. The goal is to move exactly one or two sticks—under strict stick-conservation and digit-legibility constraints—to make the equation mathematically correct.
Synthesizing over 1200+ publications, this survey fundamentally restructures the landscape of robotic manipulation with a unified taxonomy for planning and control. Critically, we also provide the first systematic dissection of the key bottlenecks—data, utilization, and generalization—poised to define the next era of progress.
VLMs enhance robotic manipulation but rely on costly annotated data, limiting OOD adaptability. We propose ManipLVM-R1, a RL framework with Verifiable Rewards (RLVR), replacing supervision to optimize task outcomes for better generalization. Two rule-based rewards drive physical reasoning, achieving strong performance on fewer data (50%) and OOD scenarios.
EgoPrompt is a prompt-learning framework for egocentric action recognition that jointly models verbs and nouns by capturing their semantic relationships. It introduces a Unified Prompt Pool and a Diverse Pool Criteria to encourage rich, disentangled representations. EgoPrompt achieves state-of-the-art performance on Ego4D, EPIC-Kitchens, and EGTEA across various generalization benchmarks.
Spatiotemporal Graph Learning (SGL) under Zero-Inflated Distribution (ZID) is vital for urban risk management but is susceptible to adversarial attacks.
Traditional adversarial training (AT) increases performance disparities between classes.
We propose the MinGRE framework to reduce these disparities and enhance robustness, promoting more equitable and robust models.
We propose a parameter-efficient adversarial adaptation method named AdvLoRA by low-rank adaptation to improve the robustness of vision-language models.
Road Surface Reconstruction (RSR) is crucial for autonomous driving, enabling the understanding of road surface conditions.
Traditional BEV-based methods for transforming perspective views to BEV face challenges such as information loss and representation sparsity.
We present two innovative BEV-based RSR models: FastRSR-mono and FastRSR-stereo, offering superior efficiency and accuracy, achieving state-of-the-art results in elevation absolute error and processing speed.
This paper enhances HD map construction robustness via data augmentation, a new fusion module, and modality dropout. It improves performance under sensor corruptions and achieves SOTA accuracy on NuScenes.
This work introduces Multi-Sensor Corruption Benchmark (MSC-Bench),
the first comprehensive benchmark aimed at evaluating the robustness of multi-sensor autonomous driving perception models against various sensor corruptions.
We propose a cross-modal hashing framework called CCMH (CLIP-based Cross-Modal Hashing), which facilitates the transferability of a well-trained real-value semantic subspace to a hash semantic subspace.