MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
Suhao Yu*, Haojin Wang*, Juncheng Wu*, Cihang Xie, Yuyin Zhou

📢 Breaking News

[📄💥 May 22, 2025] Our arXiv paper is released.
[💾 May 22, 2025] Full dataset released.

Star 🌟 us if you think it is helpful!!

⚡Introduction

MedFrameQA introduces multi-image, clinically grounded questions that require comprehensive reasoning across all images. Unlike prior benchmarks such as SLAKE and MedXpertQA, it emphasizes diagnostic complexity, expert-level knowledge, and explicit reasoning chains.

We develop a scalable pipeline that automatically constructs multi-image, clinically grounded VQA questions from medical education videos.
We benchmark ten state-of-the-art MLLMs on MEDFRAMEQA and find that their accuracies mostly fall below 50% with substantial performance across different body systems, organs, and modalities.

We open-sourced our data and code here.

🚀 Dataset construction pipeline

MedFrameQA generation pipeline contains four stages:

Medical Video Collection: Collecting 3,420 medical videos via clinical search queries;
Frame-Caption Pairing: Extracting keyframes and aligning with transcribed captions;
Multi-Frame Merging: Merging clinically related frame-caption pairs into multi-frame clips;
Question-Answer Generation: Generating multi-image VQA from the multi-frame clips.

📚 Statistical overview of MedFrameQA

In figure (a), we show the distribution across body systems; (b) presents the distribution across organs; (c) shows the distribution across imaging modalities; (d) provides a word cloud of keywords in MedFrameQA; and (e) reports the distribution of frame counts per question.

🤗 Dataset Download

Dataset	🤗 Huggingface Hub
MedFrameQA	SuhaoYu1020/MedFrameQA

🏆 Results

Accuracy by Human Body System on MedFrameQA

Accuracy by Modality and Frame Count on MedFrameQA

💬 Quick Start

⏬ Install

Using Linux system,

Clone this repository and navigate to the folder

git clone https://github.com/haojinw0027/MedFrameQA.git
cd MedFrameQA

Install Package

conda create -n medframeqa python=3.10 -y
conda activate medframeqa
pip install -r requirements.txt
cd src

🎬 Generate VQA pairs from Video

Download video and audio

python process.py --process_stage download_process --csv_file ../data/30_disease_video_id.csv 

# Specify the number of videos to be downloaded
python process.py --process_stage download_process --csv_file ../data/30_disease_video_id.csv --num_ids number(-1 for all)

Extract frame from video and generate transcripts from audio

python process.py --process_stage video_process --csv_file ../data/30_disease_video_id.csv

Frame-caption pairing

python process.py --process_stage pair_process --csv_file ../data/30_disease_video_id.csv 

# Specify the time intervals for the selection of video frames
python process.py --process_stage pair_process --csv_file ../data/30_disease_video_id.csv --bias_time 20

Multi-frame merging and question-answer generation

python process.py --process_stage vqa_process --csv_file ../data/30_disease_video_id.csv 

# Specify the max frame num of one question
python process.py --process_stage vqa_process --csv_file ../data/30_disease_video_id.csv --max_frame_num 5

🧐 Evaluate on MLLMs

python eval_process.py --input_file "your vqa pairs file path" --output_dir ../eval --model_name "your model"

# Specify the number of questions you want to evaluate
python eval_process.py --input_file "your vqa pairs file path" --output_dir ../eval --model_name "your model" --num_q number(-1 for all)

You can download our datasets to evaluate at SuhaoYu1020/MedFrameQA

📜 Citation

If you find MedFrameQA useful for your research and applications, please cite using this BibTeX:

@misc{yu2025medframeqamultiimagemedicalvqa,
      title={MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning}, 
      author={Suhao Yu and Haojin Wang and Juncheng Wu and Cihang Xie and Yuyin Zhou},
      year={2025},
      eprint={2505.16964},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.16964}, 
}

🙏 Acknowledgement

We thank the Microsoft Accelerate Foundation Models Research Program for supporting our computing needs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
data		data
images		images
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

📢 Breaking News

⚡Introduction

🚀 Dataset construction pipeline

📚 Statistical overview of MedFrameQA

🤗 Dataset Download

🏆 Results

Accuracy by Human Body System on MedFrameQA

Accuracy by Modality and Frame Count on MedFrameQA

💬 Quick Start

⏬ Install

🎬 Generate VQA pairs from Video

Download video and audio

Extract frame from video and generate transcripts from audio

Frame-caption pairing

Multi-frame merging and question-answer generation

🧐 Evaluate on MLLMs

📜 Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning

📢 Breaking News

⚡Introduction

🚀 Dataset construction pipeline

📚 Statistical overview of MedFrameQA

🤗 Dataset Download

🏆 Results

Accuracy by Human Body System on MedFrameQA

Accuracy by Modality and Frame Count on MedFrameQA

💬 Quick Start

⏬ Install

🎬 Generate VQA pairs from Video

Download video and audio

Extract frame from video and generate transcripts from audio

Frame-caption pairing

Multi-frame merging and question-answer generation

🧐 Evaluate on MLLMs

📜 Citation

🙏 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages