I am Xiangtai Li. I work on computer vision, multi-modal learning, and related problems.
I am working as a Research Scientist at TikTok (ByteDance), Singapore.
Our team works on applications and research for TikTok Live. Topics cover multi-modal large language models, diffusion models, and LLM reasoning.
Previously, I worked as a Research Fellow at MMLab@NTU, S-Lab, advised by Prof. Chen Change Loy.
I obtained my PhD degree from Peking University (PKU) under the supervision of Prof. Yunhai Tong, and my bachelorโs degree from Beijing University of Posts and Telecommunications (BUPT).
My research topics focus on three main aspects:
-
Multi-modal learning with LLMs (MLLM): unified modeling, benchmarking, dataset pipeline building, RL-based post-training.
-
Image/video generation and editing, controllable image/video generation.
Previously, I worked on image/video segmentation and detection, and open vocabulary learning.
Moreover, the code and models for my work (about 98%), including the projects I have deeply contributed to, are open-sourced on GitHub.
I serve as a regular reviewer for many conferences and journals, including CVPR, ICCV, ECCV, ICLR, AAAI, NeurIPS, ICML, IJCAI, IEEE-TIP, IEEE-TPAMI, IJCV, IEEE-TSCVT, IEEE-TMM, and IEEE-TGRS.
I also serve as an Area Chair for ICLR-2025/2026, CVPR-2026, ICML-2025, ICCV-2025, NeurIPS-2025, AAAI-2025/2026, WACV-2026, and ECCV-2026.
In addition, I also serve as an Associate Editor for T-PAMI.
I am looking for interns with strong coding skills and interests with MLLMs, diffusion models and diffusion langauge models.
My email addresses are [email protected] and [email protected]. Welcome to discuss.
๐ Publications
* means equal contribution.
Recent Works
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence,
Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang
Tech Report The first non-agentic spatial-temporal RL post-training framework. | Code Data
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs,
Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang
Tech Report STOA region caption models | Code Model and Data
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation,
Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li
Tech Report The first unified parallel generation model | Code Model and Data
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World,
Xiangtai Li, Tao Zhang, Yanwei Li, Haobo Yuan, Shihao Chen, Yikang Zhou, Jiahao Meng, Yueyi Sun, Shilin Xu, Lu Qi, Tianheng Cheng, Yi Lin, Zilong Huang, Wenhao Huang, Jiashi Feng, Guang Shi
Tech Report Strong data engine for desne grounded caption. | Code Data
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer,
Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, Zilong Huang
ICCV 2025 Scaling Single Transformer backbone training for both VLM and vision tasks. | Code
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos,
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang
Marrying SAM2 with LLaVA-like MLLM for open-ended spatial temporal understanding. | Project Page
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding,
Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng
The simplest architecture for pixel-grounding tasks. | Code
Several Other Recent Works
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs,
Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qianyu Zhou, Yunhai Tong, Shunping Ji, Jiangning Zhang, Xiangtai Li, Lu Qi
ICCV 2025 The first MLLM visual matching benchmark and a simple contrastive token solution. | Project Page
DreamRelation: Bridging Customization and Relation Generation,
Qingyu Shi, Lu Qi, Jianzong Wu, Jinbin Bai, Jingbo Wang, Yunhai Tong, Xiangtai Li
CVPR 2025 (oral) First relation-aware customization approach | Github
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation,
Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xiangtai Li, Yunhai Tong
CVPR 2025 The first MLLM-based generation method for customized manga generation. | Project Page
RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything,
Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang
ICLR 2025 (oral) A new real-time multi-task segmentation setting, benchmark, and a simple effcient baseline. | Github
OMG-Seg: Is One Model Good Enough For All Segmentation?,
Xiangtai Li, Haobo Yuan, Wei Li, Henghui Ding, Size Wu, Wenwei Zhang, Yining Li, Kai Chen, Chen Change Loy
CVPR 2024 One model to perform image/video/open-vocabulary/multi-dataset/interactive segmentation in one shot. | Project Page
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding,
Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan
NeurIPS 2024 The first end-to-end MLLM that unifies image-level, object-level, pixel-level understanding and reasoning. | Github
MotionBooth: Motion-Aware Customized Text-to-Video Generation,
Jianzong Wu, Xiangtai Li, Yanhong Zeng, Jiangning Zhang, Qianyu Zhou, Yining Li, Yunhai Tong, Kai Chen
NeurIPS 2024 (spotlight) The first customized T2V with motion control. | Github
Several Previous Works
Transformer-Based Visual Segmentation: A Survey,
Xiangtai Li, Henghui Ding, Haobo Yuan, Wenwei Zhang, Jiangmiao Pang, Guangliang Cheng, Kai Chen, Ziwei Liu, Chen Change Loy
T-PAMI 2024 The first survey that summarizes the transformer-based segmentation method from technical views. (PAMI popular paper) | Github
Tube-Link: A Flexible Cross Tube Baseline for Universal Video Segmentation,
Xiangtai Li, Haobo Yuan, Wenwei Zhang, Guangliang Cheng, Jiangmiao Pang, Chen Change Loy,
ICCV 2023 The first unified SOTA universal video segmentation model. | Project
TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers ,
Qianyu Zhou*, Xiangtai Li* , Lu He, Yibo Yang, Guangliang Cheng, Yunhai Tong, Lizhuang Ma, Dacheng Tao,
T-PAMI 2023 The first End-to-End Vision Transformer for Video Object Detection and STOA results on Video Object Detection | Code
Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation,
Xiangtai Li*, Wenwei Zhang*, Jiangmiao Pang*, Kai Chen, Guangliang Cheng, Yunhai Tong, Chen Change Loy,
CVPR 2022 (Oral, top2%) The first unified video segmentation model and codebase for VPS, VIS, VSS | Code
Semantic Flow for Fast and Accurate Scene Parsing,
Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Yunhai Tong,
ECCV 2020 (Oral, top2%) The first real-time model over 80% mIoU on Cityscapes test set. | Code
Code can be found in this.
๐ Honors and Awards
-
National Scholarship, Ministry of Education of China in PKU (year 2020-2021) (year 2019-2020).
-
President Scholarship of PKU (year 2020-2021).
-
2017, 2022 Beijing Excellent Graduates.
-
2017, 2022 BUPT Excellent Graduates, PKU Excellent Graduates.
๐ Educations
-
2017.09 - 2022.07, PhD in Peking University (PKU).
-
2013.09 - 2017.07, Bachelor in Beijing University of Posts and Telecommunications (BUPT).
๐ฌ Invited Talks
-
2024.03 Invited talk on Open-Vocabulary Segmentation and Segment Anything at VALSE, online. Slide, Video.
-
2023.08 Invited talk on Video Segmentation at VALSE, online. Slides, Video.
-
2022.05 Invited talk on Panoptic Segmentation and Beyond in Baidu PaddleSeg Group.
-
2021.12 Invited talk on Video Segmentation in DiDi Auto-Driving Group.
-
2021.10 Invited talk on Aligned Segmentation HuaWei Noah Auto-Driving Group.
๐ป Internships and Work Experience
-
SenseTime Research, mentored by Dr. Guangliang Cheng and Dr. Jianping Shi.
-
JD AI Lab (remote cooperation), mentored by Dr. Yibo Yang and Prof. Dacheng Tao.
-
DeepMotion (Now Xiaomi Car), mentored by Dr. Kuiyuan Yang.
-
I was mentored by Dr.Kuiyuan Yang, Prof.Li Zhang, Dr.Guangliang Cheng, Dr.Yibo Yang, Prof.Dacheng Tao, Prof.Zhouchen Lin, Dr.Jiangmiao Pang during my PhD study.
-
I used to hold a research consultant at Shanghai AI lab, working with Dr.Yining Li, Dr.Kai Chen, Dr.Jingbo Wang, and Dr.Yanhong Zeng.