Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
NeurIPS 2025 Poster
Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied AI.
We formalize the Multi-Scale Temporal Prediction (MSTP) task in surgical and general scenes
and propose Incremental Generation and Multi-agent Collaboration to tackle it.
git clone https://github.com/jinlab-imvr/MSTP.git
cd MSTP/LLaMA-Factory
# Create and activate the environment
conda create -n mstp python=3.10 -y
conda activate mstp
# Install core dependencies
pip install wheel
pip install -e ".[torch,metrics]" --no-build-isolation
# (Optional) Choose transformers version by model family
# For Qwen2.5-VL series pretrained models:
pip install transformers==4.51
# For InternVL3 and gemma-3 series pretrained models:
pip install transformers==4.52
# Additional requirements
pip install -r requirements.txtThe dataset provided in the paper can be downloaded for verification.
- We use 8 video frames from the GraSP dataset for training and 4 video frames for testing.
- Please run
make_augment_all.pyto perform data augmentation. - If you want to obtain the processed labels for the MSTP task used in the paper:
- Fill out this form to obtain the download link.
- After download, extract the compressed file to
LLaMA-Factory/data/.
If you want to customize the dataset, please refer to the data instructions.
Download the pretrained visual generation model weights and place them in the pretrained directory:
| Model | HuggingFace |
|---|---|
| SD3.5 Large | ioky/SD3.5_large |
| SD3.5 Medium | ioky/SD3.5_medium |
Download the LoRA weights of the pretrained decision-making models and place them in the LoRA directory:
| Model | HuggingFace |
|---|---|
| Qwen2.5-VL-7B-Instruct | ioky/Qwen2.5-VL-7B-Instruct |
| InternVL3-8B-hf | ioky/InternVL3-8B-hf |
| gemma-3-4b-it | ioky/gemma-3-4b-it |
Temporal Prediction via Incremental Generation
| Decision-making Model | Command |
|---|---|
| Qwen2.5-VL-7B-Instruct | python TP_IG.py --cir 5 --time 1 --start 0 --end 200 --data_dir dir_to_dataset --sd_model large --mode test --model_name Qwen2.5-VL-7B-Instruct |
| gemma-3-4b-it | python TP_IG.py --cir 5 --time 1 --start 0 --end 200 --data_dir dir_to_dataset --sd_model large --mode test --model_name gemma-3-4b-it |
| InternVL3-8B-hf | python TP_IG.py --cir 5 --time 1 --start 0 --end 200 --data_dir dir_to_dataset --sd_model large --mode test --model_name InternVL3-8B-hf |
To fine-tune the SD3.5-based visual generation model, please refer to the official Stable Diffusion 3.5 fine-tuning guide.
This project uses LoRA to train the decision-making models.
Step 1 — Train:
llamafactory-cli train examples/train_lora/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yamlStep 2 — Export (merge LoRA):
llamafactory-cli export examples/merge_lora/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yamlGenerate decision-making model results in batches:
llamafactory-cli train examples/predict/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yamlIf you find this code useful for your research, please cite:
@misc{zeng2025multiscaletemporalpredictionincremental,
title = {Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration},
author = {Zhitao Zeng and Guojian Yuan and Junyuan Mao and Yuxuan Wang and Xiaoshuang Jia and Yueming Jin},
year = {2025},
eprint = {2509.17429},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2509.17429},
}