Shilin Yan1†, Jiaming Han2, Joey Tsai3, Hongwei Xue1, Rongyao Fang2,
Lingyi Hong, Ziyu Guo2, Ray Zhang2‡
1Accio Team, Alibaba Group 2CUHK MMLab 3Tsinghua University
†Project Leader ‡Corresponding author
- [2025-05-23]🔥🔥🔥 We release the paper
We present CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens.
- Significantly compress the number of tokens from pretrained visual encoders for efficient representation.
- Apply a simple pooling strategy to retain critical visual information while reducing token count.
- Novel architecture design: Pooled visual tokens act as queries attending over the original visual token set.
- Enables the model to capture fine-grained visual details, maintaining fidelity even under strong token compression.
- Enhances text token representations through interaction with the original visual tokens.
- Deepens text-visual alignment, offering richer contextual understanding for multimodal downstream tasks.
- The dual attention mechanism maximizes model efficiency while preserving the ability to handle long-form video content.
- Achieves a strong balance between computational efficiency and fine-grained multimodal understanding, empowering advanced video-language applications.
This architecture enables efficient and scalable video-text modeling while maintaining state-of-the-art accuracy.
We would like to thank LLAVA-NeXT, upon which our repo is built.
@article{yan2025crosslmm,
title={CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms},
author={Yan, Shilin and Han, Jiaming and Tsai, Joey and Xue, Hongwei and Fang, Rongyao and Hong, Lingyi and Guo, Ziyu and Zhang, Ray},
journal={arXiv preprint arXiv:2505.17020},
year={2025}
}
If you have any question about this project, please feel free to contact [email protected].

