VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

Yinghao Wu*,Zhuoyan Luo*, Yiyao Yu, Zhaojian Yu, Yujiu Yang, Xiao-Ping Zhang

Tsinghua University

🔥 Updates

[2026/05/25] 🔥🔥🔥 The code is coming soon.

📖 Abstract

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

📗 FrameWork

❤️ Acknowledgement

Code in this repository is built upon several public repositories. Thanks for the wonderful work LLaVA! !

⭐️ BibTeX

if you find it helpful, please cite

@article{wu2026ven-vl,
  title={VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding},
  author={Wu, Yinghao and Luo, Zhuoyan and Yu, Yiyao and Yu, zhaojian and Yang, Yujiu and Zhang, Xiao-ping},
  journal={arXiv preprint arXiv:2605.25952},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

🔥 Updates

📖 Abstract

📗 FrameWork

❤️ Acknowledgement

⭐️ BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

🔥 Updates

📖 Abstract

📗 FrameWork

❤️ Acknowledgement

⭐️ BibTeX

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages