VideoMind: A Chain-of-LoRA Agent for Temporal Grounded Video Reasoning

ICLR 2026

1The Hong Kong Polytechnic University 2Show Lab, National University of Singapore
Equal Contribution Corresponding Authors

Abstract

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in text-based reasoning with large language models, multi-modal reasoning -- especially for videos -- remains limited.

In this work, we fill this gap by introducing VideoMind, a novel video-language agent for temporal-grounded video reasoning. Our method involves two key innovations: (1) We identify four essential capabilities for grounded video reasoning and propose a role-based agentic workflow, comprising a planner to coordinate roles, a grounder for temporal event localization, a verifier to assess event candidates, and an answerer for question answering. (2) To efficiently integrate these roles during inference, we propose a novel Chain-of-LoRA mechanism, where a unified base model with multiple LoRA adapters is leveraged to enable seamless role switching, balancing efficiency and flexibility.

Extensive experiments on 14 benchmarks across 3 tasks, including Grounded VideoQA, Video Temporal Grounding, and General VideoQA, demonstrate the effectiveness of the proposed scheme in advancing video agent, test-time scaling, and long-form video reasoning.

What is VideoMind?

Teaser

Model Overview

Model Overview

Role-Specific Designs

Timestamp Decoder
Timestamp Decoder
Planner
Planner
Verifier
Verifier

Visualization

Visualization

Citation

Please kindly cite our paper if you find this project helpful.
@article{liu2025videomind,
title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2503.13444},
year={2025}
}