Skip to content

Multimodal deep-research MLLM and benchmark. The first long-horizon multimodal deep-research MLLM, extending the number of reasoning turns to dozens and the number of search-engine interactions to hundreds.

License

Notifications You must be signed in to change notification settings

Osilly/Vision-DeepResearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image

Vision-DeepResearch & Vision-DeepResearch Benchmark (VDR-Bench)

The official repo for "Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models" and "Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models".

logo Project Page

🤗 Cold-start Dataset (demo)   |   🤗 RL Dataset (demo)   |   🤗 VDR-Bench (full)   |   🤗 VDR-Bench (testmini)  

🤗 Vision-DeepResearch-30B-A3B (SFT+RL, coming soon)   |   🤗 Vision-DeepResearch-8B (SFT-only)  

📑Vision-DeepResearch Paper   |   📑 VDR-Bench Paper  

The datasets, code and weights will be released, stay tuned!

Timeline

Demo (click to watch on YouTube)

Compare to Other Methods

Image

More Cases

Image Image
Image Image

Performance

Model VDR FVQA MMSearch+ MMSearch LiveVQA BC-VL Avg.
Direct Answer
GPT-5 9.8 57.3 19.1 33.3 57.5 47.2 37.4
Gemini-2.5 Pro 8.0 60.7 14.5 39.8 60.3 43.1 37.7
Gemini-2.5 Flash 6.2 47.7 8.1 30.4 51.0 37.1 30.1
Claude-4-Sonnet 2.0 35.3 4.0 18.7 38.5 29.3 21.3
Claude-3.7-Sonnet 4.6 36.7 4.0 21.1 38.0 32.3 22.8
Qwen3-VL-8B-Instruct 2.8 28.0 3.2 15.2 41.0 25.1 19.2
Qwen3-VL-8B-Thinking 5.6 24.0 2.7 15.8 43.3 25.1 19.4
Qwen3-VL-30B-A3B-Instruct 3.8 34.7 3.2 18.7 42.7 29.6 22.1
Qwen3-VL-30B-A3B-Thinking 4.4 32.7 4.5 19.3 49.0 34.6 24.1
RAG Workflow
Gemini-2.5-flash -- -- -- 43.9 41.3 12.1 --
Claude-3.7-Sonnet -- -- -- 32.7 30.3 10.0 --
Qwen-2.5-VL-72B -- -- -- 29.2 35.7 10.2 --
Agent Workflow
GPT-5 20.4 69.0 17.2 63.7 73.3 46.1 48.3
Gemini-2.5 Pro 18.8 68.3 22.2 69.0 76.0 49.9 50.7
Gemini-2.5 Flash 16.3 68.0 19.9 64.0 73.0 44.6 47.6
Claude-4-Sonnet 13.6 69.0 23.1 67.2 69.7 48.6 48.5
Claude-3.7-Sonnet 27.2 67.3 17.2 63.7 72.0 50.4 49.6
Qwen3-VL-8B-Thinking 17.6 51.3 12.2 45.6 56.3 37.1 36.7
Qwen3-VL-30B-A3B-Thinking 23.2 63.0 13.6 53.2 62.0 44.1 43.2
Multimodal DeepResearch MLLM
MMSearch-R1-7B -- 58.4 -- 53.8 48.4 -- --
Webwatcher-7B -- -- -- 49.1 51.2 20.3 --
Webwatcher-32B -- -- -- 55.3 58.7 26.7 --
Ours
Qwen3-VL-8B-Instruct (Agentic) 17.0 58.7 11.3 52.0 63.0 38.6 40.1
Vision-DeepResearch-8B (Ours) 29.2 (+12.2) 64.7 (+6.0) 20.4 (+9.1) 69.6 (+17.6) 76.7 (+13.7) 42.6 (+4.0) 50.5 (+10.4)
Qwen3-VL-30B-A3B-Instruct (Agentic) 20.2 57.7 10.0 55.0 60.0 42.6 40.9
Vision-DeepResearch-30B-A3B (Ours) 37.8 (+17.6) 74.2 (+16.5) 28.5 (+18.5) 69.6 (+14.6) 77.6 (+17.6) 53.7 (+11.1) 56.9 (+16.0)

Teaser

Vision-DeepResearch

Image

VDR-Bench

Image

Data Pipeline

Vision-DeepResearch

Image

VDR-Bench

Image

Quickstart

Environment Setup

# 1. Clone the repository
git clone https://github.com/Osilly/Vision-DeepResearch.git
cd Vision-DeepResearch

# 2. Install verl
cd rllm/verl
pip install -e .

# 3. Install Megatron-LM
cd ../../Megatron-LM
pip install -e .

# 4. Install mbridge
cd ../mbridge
pip install -e .

# 5. Install rllm
cd ../rllm
pip install -e .

# 6. Install additional dependencies
pip install requests==2.32.3
pip install oss2

# 7. Return to the project root directory
cd ..

Data Preparation

SFT Data

Download the Cold-start dataset (Demo 1K).

You need to convert the data in Parquet format into the JSONL training format supported by ms-swift. We provide a conversion script for this purpose: ms-swift/run/data_prepare/convert_parquet2jsonl.sh.

You need to provide an --image_dir, where images stored as bytes in the Parquet file will be converted to .png/.jpg files and saved to disk.

RL Data

Download the RL dataset (Demo 1K).

First, you need to convert the data in Parquet format into the JSONL format. We provide a conversion script for this purpose: rllm/vision_deepresearch_async_workflow/data_prepare/convert_parquet2jsonl.sh.

Then, you need to run rllm/vision_deepresearch_async_workflow/data_prepare/register_rl_dataset.sh to register the RL dataset.

SFT Train

cd ms-swift
bash run/vision_deepresearch_SFT_30B_A3B_megatron_lr2e5_2ep.sh

RL Train

First, deploy the Extract model (used to summarize web page contents) and the Judge model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve \
 Qwen/Qwen3-VL-30B-A3B-Instruct \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.8 \
  --served-model-name "Qwen3-VL-30B-A3B-Instruct" \
  --max_model_len 160000 \
  --mm-processor-cache-gb 0 \
  --no-enable-prefix-caching

Then, modify the vLLM URL service endpoints for JUDGE_MODEL and EXTRACT_MODEL in rllm/.env, and enter your SERP_API_KEY, JINA_API_KEY, and OSS configuration.

Run RL train.

cd rllm
bash vision_deepresearch_async_workflow/run/vision_deepresearch_30B_A3B_grpo_plus_bfloat16_sglang_megatron_128batch_128mini_8n.sh

Eval

Run the command below to start an OpenAI-compatible API service:

Vision-DeepResearch model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve \
 Osilly/Vision-DeepResearch-8B \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.8 \
  --served-model-name "Vision-DeepResearch-8B" \
  --max_model_len 160000 \
  --mm-processor-cache-gb 0 \
  --no-enable-prefix-caching

Extract model (used to summarize web page contents) and Judge model:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve \
 Qwen/Qwen3-VL-30B-A3B-Instruct \
  --host 0.0.0.0 \
  --port 8001 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.8 \
  --served-model-name "Qwen3-VL-30B-A3B-Instruct" \
  --max_model_len 160000 \
  --mm-processor-cache-gb 0 \
  --no-enable-prefix-caching

Modify the vLLM URL service endpoints for JUDGE_MODEL and EXTRACT_MODEL in rllm/.env, and enter your SERP_API_KEY, JINA_API_KEY, and OSS configuration.

Modify the base-url and model (the Vision DeepResearch vLLM service endpoint and model name) in rllm/eval/run_eval.sh. For the data format of test.parquet, refer to rllm/eval/README.md.

Run rllm/eval/run_eval.sh to start inference.

bash rllm/eval/run_eval.sh

Star History

Star History Chart

Citation

@article{huang2026vision,
  title={Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models},
  author={Huang, Wenxuan and Zeng, Yu and Wang, Qiuchen and Fang, Zhen and Cao, Shaosheng and Chu, Zheng and Yin, Qingyu and Chen, Shuang and Yin, Zhenfei and Chen, Lin and others},
  journal={arXiv preprint arXiv:2601.22060},
  year={2026}
}

@article{zeng2026vision,
  title={Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models},
  author={Zeng, Yu and Huang, Wenxuan and Fang, Zhen and Chen, Shuang and Shen, Yufan and Cai, Yishuo and Wang, Xiaoman and Yin, Zhenfei and Chen, Lin and Chen and others},
  journal={arXiv preprint arXiv:2602.02185},
  year={2026}
}

About

Multimodal deep-research MLLM and benchmark. The first long-horizon multimodal deep-research MLLM, extending the number of reasoning turns to dozens and the number of search-engine interactions to hundreds.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •