Proposed VoCo-LLaMA, an attention-distilled video token compression method enabling video-LLMs to train and inference million-token (1+ hour) videos within a 4k-context LLM.
Proposed ATP-LLaVA, an efficient MLLM that performs adaptive instance-wise and decoder-layer-wise token pruning with nearly no performance degradation.
Proposed LAVT, a Transformer-based universal referring image and video segmentation (RIS and RVOS) framework that performs language-aware visual encoding in place of cross-modal fusion post feature extraction.
Selected Honors and Awards
Nanhu Elite Scholarship of Tsinghua University, 2025. (清华大学综合优秀奖学金, 校级一等)
Zhaoyi Scholarship of Tsinghua University, 2024. (清华大学综合优秀奖学金, 校级一等)
First Prize Scholarship of Tongji University, 2023. (同济大学综合优秀奖学金, 校级一等)
Second Prize Scholarship of Tongji University, 2021, 2022. (同济大学综合优秀奖学金, 校级二等)
Industrial Experience
Bytedance Seed Application, Beijing, China. December, 2024 - April, 2025.