LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

LVOmniBench is a new audio-visual understanding evaluation benchmark in long-form audio-video inputs. 🌟

🔥 News

2026.03.19 🌟 We are very proud to launch LVOmniBench, the pioneering comprehensive evaluation benchmark of OmniLLMs in Long Audio-Video Understanding Evaluation!

✨ LVOmniBench Introduction

Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video.

We curated a diverse collection of long videos, with durations ranging from 10 to 90 minutes and an average duration of 2,069s. This duration represents a greater than sixfold increase in temporal scale compared to that of existing benchmarks for audio-visual understanding.
We manually constructed 1,014 high-quality multiple-choice questions, which are explicitly designed to require joint reasoning across the audio and visual modalities, thereby facilitating a more comprehensive evaluation of OmniLLMs.
Each QA is ranked by difficulty level, and long audio-video understanding poses significant challenges for both current proprietary and open source models!

🌰 Dataset Examples

🔮 Evaluation

We provide OmniEval, a plug-and-play evaluation framework for benchmarking OmniLLMs on LVOmniBench. It supports multi-GPU distributed inference and makes it easy to adapt new models — just inherit a single base class. To evaluate your own model, follow the Adapting Your Own Model guide. See the OmniEval README for full documentation.

📍 Prompt:

The common prompt used in our evaluation follows this format:

prompt_text = (
    f"Question: {question}\n"
    f"Options:\n{options_str}\n\n"
    "Select the best answer from the options above. "
    "Directly provide the letter representing your choice (A/B/C/D) and nothing else. "
    "Do not include the full text of the option, do not provide any explanation."
)

📍 Leaderboard:

If you want to add your results to our LVOmniBench leaderboard, please contact us at taokeda@westlake.edu.cn

🏆 Experimental Results

Evaluation results of different OmniLLMs.

Evaluation results across different task types.

🌍 Citation

If you find our work helpful for your research, please consider citing our work.

@article{tao2026lvomnibench,
  title={LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs},
  author={Tao, Keda and Zheng, Yuhua and Xu, Jia and Du, Wenjie and Shao, Kele and Wang, Hesong and Chen, Xueyi and Jin, Xin and Zhu, Junhan and Yu, Bohan and others},
  journal={arXiv preprint arXiv:2603.19217},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
OmniEval		OmniEval
asset		asset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

🔥 News

✨ LVOmniBench Introduction

🌰 Dataset Examples

🔮 Evaluation

🏆 Experimental Results

🌍 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

🔥 News

✨ LVOmniBench Introduction

🌰 Dataset Examples

🔮 Evaluation

🏆 Experimental Results

🌍 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages