GitHub - jinlab-imvr/MSTP: [NeurIPS 2025]Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

NeurIPS 2025 Poster

Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied AI.
We formalize the Multi-Scale Temporal Prediction (MSTP) task in surgical and general scenes
and propose Incremental Generation and Multi-agent Collaboration to tackle it.

🔧 Installation

git clone https://github.com/jinlab-imvr/MSTP.git
cd MSTP/LLaMA-Factory

# Create and activate the environment
conda create -n mstp python=3.10 -y
conda activate mstp

# Install core dependencies
pip install wheel
pip install -e ".[torch,metrics]" --no-build-isolation

# (Optional) Choose transformers version by model family
# For Qwen2.5-VL series pretrained models:
pip install transformers==4.51
# For InternVL3 and gemma-3 series pretrained models:
pip install transformers==4.52

# Additional requirements
pip install -r requirements.txt

📦 Dataset

The dataset provided in the paper can be downloaded for verification.

We use 8 video frames from the GraSP dataset for training and 4 video frames for testing.
Please run make_augment_all.py to perform data augmentation.
If you want to obtain the processed labels for the MSTP task used in the paper:
- Fill out this form to obtain the download link.
- After download, extract the compressed file to LLaMA-Factory/data/.

If you want to customize the dataset, please refer to the data instructions.

📥 Model Weights

Visual Generation Module

Download the pretrained visual generation model weights and place them in the pretrained directory:

Model	HuggingFace
SD3.5 Large	`ioky/SD3.5_large`
SD3.5 Medium	`ioky/SD3.5_medium`

Decision-making Module (LoRA Weights)

Download the LoRA weights of the pretrained decision-making models and place them in the LoRA directory:

Model	HuggingFace
Qwen2.5-VL-7B-Instruct	`ioky/Qwen2.5-VL-7B-Instruct`
InternVL3-8B-hf	`ioky/InternVL3-8B-hf`
gemma-3-4b-it	`ioky/gemma-3-4b-it`

🚀 Inference

Temporal Prediction via Incremental Generation

Decision-making Model	Command
Qwen2.5-VL-7B-Instruct	`python TP_IG.py --cir 5 --time 1 --start 0 --end 200 --data_dir dir_to_dataset --sd_model large --mode test --model_name Qwen2.5-VL-7B-Instruct`
gemma-3-4b-it	`python TP_IG.py --cir 5 --time 1 --start 0 --end 200 --data_dir dir_to_dataset --sd_model large --mode test --model_name gemma-3-4b-it`
InternVL3-8B-hf	`python TP_IG.py --cir 5 --time 1 --start 0 --end 200 --data_dir dir_to_dataset --sd_model large --mode test --model_name InternVL3-8B-hf`

🏋️ Training

Visual Generation Module

To fine-tune the SD3.5-based visual generation model, please refer to the official Stable Diffusion 3.5 fine-tuning guide.

Decision-making Module

This project uses LoRA to train the decision-making models.

Step 1 — Train:

llamafactory-cli train examples/train_lora/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yaml

Step 2 — Export (merge LoRA):

llamafactory-cli export examples/merge_lora/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yaml

📊 Validation

Generate decision-making model results in batches:

llamafactory-cli train examples/predict/Qwen2.5-VL-7B-Instruct/qwen2.5vl_lora_sft_chain1_1s.yaml

📖 Citing Our Work

If you find this code useful for your research, please cite:

@misc{zeng2025multiscaletemporalpredictionincremental,
      title        = {Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration},
      author       = {Zhitao Zeng and Guojian Yuan and Junyuan Mao and Yuxuan Wang and Xiaoshuang Jia and Yueming Jin},
      year         = {2025},
      eprint       = {2509.17429},
      archivePrefix= {arXiv},
      primaryClass = {cs.CV},
      url          = {https://arxiv.org/abs/2509.17429},
}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
LLaMA-Factory		LLaMA-Factory
LICENSE		LICENSE
README.md		README.md
TP_IG.py		TP_IG.py
make_augment_all.py		make_augment_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📝 Contents

🔧 Installation

📦 Dataset

📥 Model Weights

Visual Generation Module

Decision-making Module (LoRA Weights)

🚀 Inference

🏋️ Training

Visual Generation Module

Decision-making Module

📊 Validation

📖 Citing Our Work

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📝 Contents

🔧 Installation

📦 Dataset

📥 Model Weights

Visual Generation Module

Decision-making Module (LoRA Weights)

🚀 Inference

🏋️ Training

Visual Generation Module

Decision-making Module

📊 Validation

📖 Citing Our Work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages