Skip to content

JPShi12/VideoLoom

Repository files navigation

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Jiapeng Shi, Junke Wang, Zuyao You, Bo He, Zuxuan Wu

[📜 Paper] [📥 Model] [🤗 Dataset]

🔎 Overview

This paper presents VideoLoom, a unified Video Large Language Model (Video LLM) for joint spatial-temporal understanding. To facilitate the development of fine-grained spatial and temporal localization capabilities, we curate LoomData-8.7k, a human-centric video dataset with temporally grounded and spatially localized captions. With this, VideoLoom achieves state-of-the-art or highly competitive performance across a variety of spatial and temporal benchmarks (e.g., 63.1 J&F on ReVOS for referring video object segmentation, and 48.3 R1@0.7 on Charades-STA for temporal grounding). In addition, we introduce LoomBench, a novel benchmark consisting of temporal, spatial, and compositional video-question pairs, enabling a comprehensive evaluation of Video LLMs from diverse aspects. Collectively, these contributions offer a universal and effective suite for joint spatial-temporal video understanding, setting a new standard in multimodal intelligence.

Model

🔥 News

  • Jan. 23, 2026 Our evaluation code is released.
  • Jan. 13, 2026 Our paper and checkpoints are released.

🛠️ Setup

  1. Install packages
conda create -n vloom python=3.10 -y
conda activate vloom
pip install --upgrade pip
pip install -r requirements.txt
  1. Download pretrained models

Download the following pretrained models and place them in the ./pretrained directory:

./ # project root
pretrained/
├── sam2_hiera_large.pt
├── InternVL2_5-4B
└── InternVL3-8B

📦 Model Zoo

We provide the following models:

Model Name Base MLLM Checkpoints
VideoLoom-4B InternVL2.5-4B 🤗 link
VideoLoom-8B InternVL3-8B 🤗 link

🧩 Data

Download the spatial datasets and VQA datasets from Sa2VA and place them in the data directory. The download link is here.

The final data structure should be like:

./ # project root
data/
├── video_datas/
|   ├── mevis/
|   ├── revos/
|   └── rvos/
├── ref_seg/
|   ├── refcoco/
|   ├── refcoco+/
|   └── refcocog/
├── glamm_data/
└── llava_data/

Download the temporal datasets, including Charades-STA, YouCook2 and QVHighlight, and place them in the data_time directory.

The final data structure should be like:

./ # project root
data_time/
├── TimeIT/
├── YouCook2_asr_denseCap/
|   └── youcook2_6fps_224/
├── Charades/
|   └── videos/
└── QVhighlights/
    └── videos/
        ├── train/
        └── val/

📈 Evaluation

We provide evaluation scripts for all evaluated benchmarks in our paper. Please make sure the model and data paths are correct before running the code.

✅ Todo List

  • Release our checkpoints
  • Release our evaluation code
  • Release LoomData
  • Release our training code
  • Release LoomBench

📝 Citation

If you find our work helpful, please consider giving a star ⭐ and citation 📝

@article{shi2026videoloom,
      title={VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding}, 
      author={Shi, Jiapeng and Wang, Junke and You, Zuyao and He, Bo and Wu, Zuxuan},
      journal={arXiv preprint arXiv:2601.07290},
      year={2026}
}

📧 Contact

Feel free to contact us if you have any questions or suggestions

🤝 Acknowledgements

We refer to Sa2VA, TimeChat and LITA to build our codebase. Thanks for their wonderful project.

About

[ICML 2026] VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding

Resources

Stars

Watchers

Forks

Contributors