VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Jiapeng Shi, Junke Wang, Zuyao You, Bo He, Zuxuan Wu^✉

🔎 Overview

This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.

🔥 News

Jan. 23, 2026 Our evaluation code is released.
Jan. 13, 2026 Our paper and checkpoints are released.

🛠️ Setup

Install packages

conda create -n vloom python=3.10 -y
conda activate vloom
pip install --upgrade pip
pip install -r requirements.txt

Download pretrained models

Download the following pretrained models and place them in the ./pretrained directory:

./ # project root
pretrained/
├── sam2_hiera_large.pt
├── InternVL2_5-4B
└── InternVL3-8B

📦 Model Zoo

We provide the following models:

Model Name	Base MLLM	Checkpoints
VideoLoom-4B	InternVL2.5-4B	🤗 link
VideoLoom-8B	InternVL3-8B	🤗 link

🧩 Data

Download the spatial datasets and VQA datasets from Sa2VA and place them in the data directory. The download link is here.

The final data structure should be like:

./ # project root
data/
├── video_datas/
|   ├── mevis/
|   ├── revos/
|   └── rvos/
├── ref_seg/
|   ├── refcoco/
|   ├── refcoco+/
|   └── refcocog/
├── glamm_data/
└── llava_data/

Download the temporal datasets, including Charades-STA, YouCook2 and QVHighlight, and place them in the data_time directory.

The final data structure should be like:

./ # project root
data_time/
├── TimeIT/
├── YouCook2_asr_denseCap/
|   └── youcook2_6fps_224/
├── Charades/
|   └── videos/
└── QVhighlights/
    └── videos/
        ├── train/
        └── val/

📈 Evaluation

We provide evaluation scripts for all evaluated benchmarks in our paper. Please make sure the model and data paths are correct before running the code.

Evaluation of TVG tasks: Provided in scripts/tvg_eval.sh.
Evaluation of DVC tasks: Provided in scripts/dvc_eval.sh.
Evaluation of VHD tasks: Provided in scripts/vhd_eval.sh.
Evaluation of ref-VOS tasks: Provided in scripts/spatial_eval.sh.

✅ Todo List

📝 Citation

If you find our work helpful, please consider giving a star ⭐ and citation 📝

@article{shi2026videoloom,
      title={VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding}, 
      author={Shi, Jiapeng and Wang, Junke and You, Zuyao and He, Bo and Wu, Zuxuan},
      journal={arXiv preprint arXiv:2601.07290},
      year={2026}
}

📧 Contact

Feel free to contact us if you have any questions or suggestions

Email (Jiapeng Shi): jpshi1212@gmail.com

🤝 Acknowledgements

We refer to Sa2VA, TimeChat and LITA to build our codebase. Thanks for their wonderful project.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
data_time/TimeIT		data_time/TimeIT
metrics		metrics
projects/llava_sam2/evaluation		projects/llava_sam2/evaluation
prompts		prompts
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

🔎 Overview

🔥 News

🛠️ Setup

📦 Model Zoo

🧩 Data

📈 Evaluation

✅ Todo List

📝 Citation

📧 Contact

🤝 Acknowledgements

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

🔎 Overview

🔥 News

🛠️ Setup

📦 Model Zoo

🧩 Data

📈 Evaluation

✅ Todo List

📝 Citation

📧 Contact

🤝 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages