Hi, there! I am Teng Wang, a senior researcher at Tencent ARC Lab, focusing on multimodal foundation models.
I obtained my Ph.D. in Computer Science from MMLab at the University of Hong Kong (HKU), supervised by Prof. Ping Luo and Prof. Feng Zheng.
Prior to that, I received my B.E. and M.E. degrees from Sun Yat-sen University (SYSU) under the supervision of Prof. Huicheng Zheng.
Prospective Interns:
We are actively seeking self-motivated research interns to work on advanced multimodal models for video understanding and embodied scenarios. Please feel free to reach out via email.
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs Jun Zhang, Teng Wang+, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang CVPR, 2026.
AudioStory: Generating Long-Form Narrative Audio with Large Language Models Yuxin Guo, Teng Wang+, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, Ying Shan CVPR, 2026.
R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios Lu Zhu, Tiantian Geng, Yangye Chen, Teng Wang, Ping Lu, Feng Zheng AAAI, 2026.
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model Yatai Ji, Teng Wang+, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, Ping Luo Arxiv, 2025.
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts Yuying Ge*, Yixiao Ge*, Chen Li*, Teng Wang*, Junfu Pu*, Yizhuo Li*, Lu Qiu* et al. Arxiv, 2025.
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries Junfu Pu*, Teng Wang*, Yixiao Ge, Yuying Ge, Chen Li, Ying Shan Arxiv, 2025.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? Junhao Cheng, Yuying Ge+, Teng Wang+, Yixiao Ge, Jing Liao, Ying Shan Arxiv, 2025.
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation Haokun Lin*, Teng Wang*, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan Arxiv, 2025.
Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors Xiangchen Wang*, Jinrui Zhang*, Teng Wang*, Haigang Zhang, Feng Zheng EMNLP, 2025. (Oral Presentation)
Sample Then Identify: A General Framework for Risk Control and Assessment in Multimodal Large Language Models Qingni Wang, Tiantian Geng, Zhiyuan Wang, Teng Wang, Bo Fu, Feng Zheng ICLR, 2025. (Spotlight)
LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng CVPR, 2025.
Video Understanding with Large Language Models: A Survey Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, et al. IEEE TCSVT, 2025.
(2k GitHub stars)
UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization Tiantian Geng, Teng Wang, Yanfu Zhang, Jinming Duan, Weili Guan, Feng Zheng IEEE TPAMI, 2025.
Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models Jinrui Zhang, Teng Wang, Haigang Zhang, Ping Lu, Feng Zheng ECCV, 2024.
Two in One Go: Single-Stage Emotion Recognition with Decoupled Subject-Context Transformer Xinpeng Li, Teng Wang, Jian Zhao, Shuyi Mao, Jinbao Wang, Feng Zheng, Xiaojiang Peng, Xuelong Li ACM MM, 2024.
MCoCa: Towards Fine-Grained Multimodal Control in Image Captioning Shijie Zhao, Teng Wang, Jinrui Zhang, Xiaowei Wang, Feng Zheng Pattern Recognition (PR), 2024.
Caption Anything: Interactive Image Description with Diverse Multimodal Controls Teng Wang*, Jinrui Zhang*, Junjie Fei*, Yixiao Ge, Hao Zheng, et al. Arxiv, 2023.
(1.7k GitHub stars)
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning Junjie Fei*, Teng Wang*, Jinrui Zhang, Zhenyu He, Chengjie Wang, Feng Zheng ICCV, 2023.
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos Teng Wang*, Jinrui Zhang*, Feng Zheng, Wenhao Jiang, Ran Cheng, Ping Luo Arxiv, 2023.
(Rank 1 in PIC Challenge 2022 Track 1 & 2)
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline Tiantian Geng, Teng Wang, Jinming Duan, Runmin Cong, Feng Zheng CVPR, 2023.
Accelerating Vision-Language Pretraining with Free Language Modeling Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, Xiaohu Qie, Ping Luo CVPR, 2023.
π-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-Task Interpolation Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo ICML, 2023.
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo ICML, 2022.
Show, Tell and Rephrase: Diverse Video Captioning via Two-Stage Progressive Training Zhu Liu, Teng Wang, Jinrui Zhang, Feng Zheng, Wenhao Jiang, Ke Lu IEEE TMM, 2022.
End-to-End Dense Video Captioning with Parallel Decoding Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, Ping Luo ICCV, 2021.
Event-Centric Hierarchical Representation for Dense Video Captioning Teng Wang, Huicheng Zheng, Mingjing Yu, Qian Tian, Haifeng Hu IEEE TCSVT, 2020.