🔥 Code for the SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models.
-
[2026/3/28] 🔥🔥🔥The code, models, and data of SAMA have been open-sourced. If you have any questions, feel free to contact: yesun23@m.fudan.edu.cn. We also welcome potential collaborations. Finally, please consider citing SAMA if you find it helpful for your research.👍
-
[2025/9/21] SAMA is accepted to NeurIPS 2025🔥! See you in San Diego!😉
If you find SAMA useful for your work, please kindly cite using the BibTeX 🙏🙏🙏:
@inproceedings{sun2025sama,
title={SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models},
author={Sun, Ye and Zhang, Hao and Ding, Henghui and Zhang, Tiehua and Ma, Xingjun and Jiang, Yu-Gang},
booktitle={NeurIPS},
year={2025}
}- Introduction
- Installation
- Model Weights
- Training Data preparation
- Training
- Evaluation & Benchmark
- Experiments
- Data annotation
- Acknowledgments
Achieving fine-grained spatio-temporal understanding in videos remains a major challenge for current Video Large Multimodal Models (Video LMMs). Addressing this challenge requires mastering two core capabilities: video referring understanding, which captures the semantics of video regions, and video grounding, which segments object regions based on natural language descriptions. However, most existing approaches tackle these tasks in isolation, limiting progress toward unified, referentially grounded video interaction. We identify a key bottleneck in the lack of high-quality, unified video instruction data and a comprehensive benchmark for evaluating referentially grounded video chat. To address these challenges, we contribute in three core aspects: dataset, model, and benchmark. First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically curated to enable joint learning of video referring understanding, grounding, and multi-turn video chat. Second, we propose the SAMA model, which incorporates a versatile spatio-temporal context aggregator and a Segment Anything Model to jointly enhance fine-grained video comprehension and precise grounding capabilities. Finally, we establish SAMA-Bench, a meticulously designed benchmark consisting of 5,067 questions from 522 videos, to comprehensively evaluate the integrated capabilities of Video LMMs in multi-turn, spatio-temporal referring understanding and grounded dialogue. Extensive experiments and benchmarking results show that SAMA not only achieves strong performance on SAMA-Bench but also sets a new state-of-the-art on general grounding benchmarks, while maintaining highly competitive performance on standard visual understanding benchmarks.
Installation
- Please install the python and pytorch first:
> conda create -n vlm python=3.10
> conda activate vlm
> conda install pytorch==2.3.1 torchvision==0.18.1 pytorch-cuda=12.1 cuda -c pytorch -c "nvidia/label/cuda-12.1.0" -c "nvidia/label/cuda-12.1.1"- Install mmcv:
> pip install mmcv==2.2.0 -f https://download.openmmlab.com/mmcv/dist/cu121/torch2.3/index.html- Install other dependencies:
> pip install -r requirements.txtWe provide the following models:
| Model Name | HF Link |
|---|---|
| SAMA-1B | 🤗 link |
| SAMA-4B | 🤗 link |
| SAMA-8B | 🤗 link |
Data Preparation
-
Please first download the Sa2VA training datasets and place them in the
datadirectory. The download link is here. -
To support the training of SAMA239K, please first download the LVVIS dataset and the RefYoutube-VOS dataset into the sama239k_data folder.
-
Create symbolic links in sama239k_data folder for the mevis dataset and the sav_train dataset (sam_v_full). These two datasets can be obtained from the Sa2VA training data.
-
For the VidSTG dataset, we have performed frame extraction. Please download this dataset first and conduct frame extraction using our provided
/tools/vidstg_process.py. The data organization format of VidSTG supported by the code is similar to the following:
VidSTG_VIDEO_DIR/
├── video1.mp4
├── video2.mp4
└── video3.mp4
VidSTG_JSON_DIR/
├── video1.json
├── video2.json
└── video3.json
- Download our json files here and put them into sama239k_data folder.
The final data structure should be like:
data/
├── sama239k_data
| ├── mevis
| | └── train
| ├── lvvis
| | └── train
| ├── ref_youtube_vos
| | └── train
| ├── sav_train
| | └── sav_000
| | └── .....
| ├── VidSTG
| | └── train
| | └── 2399224635
| | └── frame_0.jpg
| | └── frame_4.jpg
| | └── .....
| ├── sama239k_train_final.json # sama 239k json file
| ├── mevis_train_mask_dict.json # reorganized mask annotation files
| ├── lvvis_train_mask_dict.json
| ├── ref_youtube_train_mask_dict.pkl
| ├── SAV_mask_dict_train.json
| ├── VidSTG_mask_dict_train_updated.json
├── video_datas
| ├── revos
| ├── mevis
| └── davis17
| └── chat_univi
| └── sam_v_full # [!important] please download this from sam-2 directly.
| └── Ref-SAV.json
├── ref_seg
| ├── refclef
| ├── refcoco
| ├── refcoco+
| ├── refcocog
| ├──
├── glamm_data
| ├── images
| ├── annotations
├── osprey-724k
| ├── Osprey-724K
| ├── coco
├── llava_data
| ├── llava_images
| ├── LLaVA-Instruct-150K
| ├── LLaVA-Pretrain
To complete the training of SAMA, please first prepare the Sa2VA model weights and ensure that the dataset paths, model paths, and other configurations in the config file are set correctly. Model training is based on A100 (80G) x 8.
> bash scripts/train/run_train_1b.sh> bash scripts/train/run_train_4b.sh> bash scripts/train/run_train_8b.shAfter training is complete, use the script below to convert and obtain the final model weights.
> bash scripts/model_convert_st.sh- Image Segmentation: Example evaluation scripts for datasets such as Refcoco/+/g.
> bash scripts/inference/eval_refcoco.sh- Video Segmentation: Example evaluation scripts for datasets such as MeViS/Ref-Davis/Ref-youtube/ReVOS.
> bash scripts/inference/eval_mevis.sh
> bash tools/eval_video_seg/scripts/eval_mevis_metrics.sh- SAMA-Bench-G Evaluation:
Download our SAMA-Bench json files here. The SAMA-Bench test set is drawn from the validation splits of four open-source video datasets: MeViS, LVVIS, Ref-YouTube-VOS, and VidSTG. Among them, the videos in the VidSTG validation split must also be processed into extracted frames using the provided /tools/vidstg_process.py. We have reorganized or re-annotated the mask annotation files of these datasets. In addition, due to the long evaluation time, we split the SAMA-Bench JSON file into multiple subsets, allowing you to utilize multiple nodes simultaneously to accelerate inference. Running inference on SAMA-Bench-g requires at least an A100-80G GPU. Since the videos in VidSTG are relatively long, the total inference time on 8 * A100-80G is approximately 4 hours. During evaluation, we primarily use the first-frame prompt of the query object as input to the model.
The expected data directory structure is as follows:
sama_bench
├── mevis # Video Image files
| └── val_u
| └── JPEGImages
├── lvvis
| └── val
| └── JPEGImages
├── ref_youtube_vos
| └── valid
| └── JPEGImages
├── VidSTG
| └── val
| └── 2400171624
| └── frame_0.jpg
| └── frame_4.jpg
| └── .....
├── Mevis_mask_dict_val.json # mask annotation files
├── LVVIS_mask_dict_val.json
├── RefYoutube_mask_dict_val.json
├── VidSTG_mask_dict_val_updated.json
├── lvvis_0.json # SAMA-Bench JSON files
├── lvvis_1.json
├── mevis.json
├── ref_youtube_vos_0.json
├── ref_youtube_vos_1.json
├── VidSTG_0.json
├── VidSTG_1.json
├── VidSTG_2.json
├── VidSTG_3.json
├── VidSTG_4.json
├── VidSTG_5.json
├── VidSTG_6.json
├── VidSTG_7.json
Example evaluation scripts for sama-bench-g:
> bash scripts/inference/eval_sama_bench_g.sh
> python scripts/inference/compute_sama_bench_g_final.py- SAMA-Bench-C Evaluation:
> bash scripts/inference/eval_sama_bench_c.sh
> python scripts/inference/compute_sama_bench_c_final.py- General Benchmark Evaluation: For the evaluation of general benchmarks such as MME and VideoMME, we primarily use VLMEvalKit.
1. Comparison with the capabilities of the image-level model Ferret
2. Video Referential Grounded Chat
3. Video Grounded Description
We hope that open-sourcing the annotation prompts and code will further encourage the development of more datasets focused on fine-grained understanding and perception within the community. These resources can be found in the annotation directory.
It is worth noting that the annotations were originally generated using Gemini-1.5 Pro. However, this API is no longer supported. Therefore, we recommend adapting and updating the provided code accordingly to ensure compatibility with the current Gemini API.
The code of SAMA is built upon Sa2VA, and the evaluation code is adapted from VideoGLaMM and VLMEvalKit. We sincerely thank them for their excellent contributions.



