Jiapeng Shi, Junke Wang, Zuyao You, Bo He, Zuxuan Wu✉
[📜 Paper] [📥 Model] [🤗 Dataset]
This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.
Jan. 23, 2026Our evaluation code is released.Jan. 13, 2026Our paper and checkpoints are released.
- Install packages
conda create -n vloom python=3.10 -y
conda activate vloom
pip install --upgrade pip
pip install -r requirements.txt- Download pretrained models
Download the following pretrained models and place them in the ./pretrained directory:
./ # project root
pretrained/
├── sam2_hiera_large.pt
├── InternVL2_5-4B
└── InternVL3-8B
We provide the following models:
| Model Name | Base MLLM | Checkpoints |
|---|---|---|
| VideoLoom-4B | InternVL2.5-4B | 🤗 link |
| VideoLoom-8B | InternVL3-8B | 🤗 link |
Download the spatial datasets and VQA datasets from Sa2VA and place them in the data directory. The download link is here.
The final data structure should be like:
./ # project root
data/
├── video_datas/
| ├── mevis/
| ├── revos/
| └── rvos/
├── ref_seg/
| ├── refcoco/
| ├── refcoco+/
| └── refcocog/
├── glamm_data/
└── llava_data/
Download the temporal datasets, including Charades-STA, YouCook2 and QVHighlight, and place them in the data_time directory.
The final data structure should be like:
./ # project root
data_time/
├── TimeIT/
├── YouCook2_asr_denseCap/
| └── youcook2_6fps_224/
├── Charades/
| └── videos/
└── QVhighlights/
└── videos/
├── train/
└── val/
We provide evaluation scripts for all evaluated benchmarks in our paper. Please make sure the model and data paths are correct before running the code.
- Evaluation of TVG tasks: Provided in scripts/tvg_eval.sh.
- Evaluation of DVC tasks: Provided in scripts/dvc_eval.sh.
- Evaluation of VHD tasks: Provided in scripts/vhd_eval.sh.
- Evaluation of ref-VOS tasks: Provided in scripts/spatial_eval.sh.
- Release our checkpoints
- Release our evaluation code
- Release LoomData
- Release our training code
- Release LoomBench
If you find our work helpful, please consider giving a star ⭐ and citation 📝
@article{shi2026videoloom,
title={VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding},
author={Shi, Jiapeng and Wang, Junke and You, Zuyao and He, Bo and Wu, Zuxuan},
journal={arXiv preprint arXiv:2601.07290},
year={2026}
}
Feel free to contact us if you have any questions or suggestions
- Email (Jiapeng Shi): jpshi1212@gmail.com
We refer to Sa2VA, TimeChat and LITA to build our codebase. Thanks for their wonderful project.
