Skip to content

shilinyan99/CrossLMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation


CrossLMM: Decoupling Long Video Sequences from LMMs via
Dual Cross-Attention Mechanisms

Shilin Yan1†, Jiaming Han2, Joey Tsai3, Hongwei Xue1, Rongyao Fang2,
Lingyi Hong, Ziyu Guo2, Ray Zhang2‡

1Accio Team, Alibaba Group 2CUHK MMLab 3Tsinghua University

Project Leader Corresponding author

Image Image Image

CrossLMM Framework

PaperIntroductionModel

🔥 News

  • [2025-05-23]🔥🔥🔥 We release the paper

🧠 Introduction

We present CrossLMM, decoupling long video sequences from LMMs via a dual cross-attention mechanism, which substantially reduces visual token quantity with minimal performance degradation. Specifically, we first implement a significant token reduction from pretrained visual encoders through a pooling methodology. Then, within LLM layers, we employ a visual-to-visual cross-attention mechanism, wherein the pooled visual tokens function as queries against the original visual token set. This module enables more efficient token utilization while retaining fine-grained informational fidelity. In addition, we introduce a text-to-visual cross-attention mechanism, for which the text tokens are enhanced through interaction with the original visual tokens, enriching the visual comprehension of the text tokens.

👀 Model

CrossLMM Architecture

🚩 Main Innovations

1. 🌟 Token Reduction via Pooling

  • Significantly compress the number of tokens from pretrained visual encoders for efficient representation.
  • Apply a simple pooling strategy to retain critical visual information while reducing token count.

2. 🚀 Visual-to-Visual Cross-Attention

  • Novel architecture design: Pooled visual tokens act as queries attending over the original visual token set.
  • Enables the model to capture fine-grained visual details, maintaining fidelity even under strong token compression.

3. 🔮 Text-to-Visual Cross-Attention

  • Enhances text token representations through interaction with the original visual tokens.
  • Deepens text-visual alignment, offering richer contextual understanding for multimodal downstream tasks.

🔗 Framework Benefits

  • The dual attention mechanism maximizes model efficiency while preserving the ability to handle long-form video content.
  • Achieves a strong balance between computational efficiency and fine-grained multimodal understanding, empowering advanced video-language applications.

This architecture enables efficient and scalable video-text modeling while maintaining state-of-the-art accuracy.

🥳 Acknowledgements

We would like to thank LLAVA-NeXT, upon which our repo is built.

📄 Cite

@article{yan2025crosslmm,
  title={CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms},
  author={Yan, Shilin and Han, Jiaming and Tsai, Joey and Xue, Hongwei and Fang, Rongyao and Hong, Lingyi and Guo, Ziyu and Zhang, Ray},
  journal={arXiv preprint arXiv:2505.17020},
  year={2025}
}

📧 Concat

If you have any question about this project, please feel free to contact [email protected].

About

CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published