I'm a third-year Ph.D. Candidate in MMLab at The Chinese University of Hong Kong, supervised by Prof. Dahua Lin. I work on scene-level visual creation and manipulation using structured representations and multimodal generative models.
In my research, I consistently leverage the inherent structure of visual data.
In video understanding, I exploit the natural temporal and semantic structure of videos to enable efficient perception.
In visual scene generation, I study how structured scene representations support reliable and controllable creation and manipulation of visual content.
We show, in the context of geometric object transforations in scene images, that RL is more suitable than SFT to solve visual editing problems with verifiable rewards, in terms of data efficiency and editing accuracy.
Imagine360 lifts standard perspective video into 360-degree video with rich and structured motion, unlocking dynamic scene experience from full 360 degrees.
LayerPano3D generates full-view, explorable panoramic 3D scene from a single text prompt.
Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection Jing Tan, Yuhong Wang,
Gangshan Wu,
Limin Wang T-PAMI, 2023 arXiv /
code /
blog
We present Temporal Perceiver (TP), a general architecture based on Transformer decoders as a unified solution to detect arbitrary generic boundaries, including shot-level, event-level and scene-level temporal boundaries.
PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang,
Limin Wang NeurIPS, 2022 arXiv /
code /
blog
PointTAD effectively tackles multi-label TAD by introducing a set of learnable query points to represent the action keyframes.
A new Dual-level query-based TAD framework to precisely detect actions from both instance-level and boundary-level.
Experience
AWS Agentic AI
Applied Scientist Intern June 2025 – Nov. 2025 Bellevue, WA, US
Conducted research in collaboration with
Prof. Zhuowen Tu and
Prof. Jiajun Wu
on reinforcement learning–guided spatial-aware image editing,
with Yantao Shen
as the internship manager.
Leveraged RLVR to enable precise geometric object transformations
in images following text instructions.
Tencent, PCG
Research Intern Dec. 2021 – Mar. 2023 Beijing, China
Worked on multi-label temporal action detection via learnable query points.
Developed a more general video action detection framework capable of
localizing specific actions among multiple simultaneous actions.
Selected Honors & Awards
CSIG Master’s Thesis Incentive Program Awardee, China Society of Image and Graphics, 2025
Outstanding Master’s Thesis Award (3/226), Nanjing University, 2024
National Scholarship (6/226), Ministry of Education of China, 2022