Authors: David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal
This repository contains the code and data for the paper "Multimodal Fact-Level Attribution for Verifiable Reasoning".
- Python: 3.12+
- Package Manager:
uv(recommended) - System Dependencies: CUDA, FFmpeg (for video processing)
For most models (excluding Qwen-VL and Qwen-Omni), you can simply install the dependencies from the requirements file:
uv pip install -r requirements.txt
Due to conflicting dependencies, we recommend maintaining separate virtual environments for Qwen-VL and Qwen-Omni.
For Qwen-VL:
Install vllm via the official package:
uv pip install vllm --torch-backend=auto
For Qwen-Omni:
Install the specific version of vllm and the vllm-omni fork:
uv pip install vllm==0.15.0 --torch-backend=auto
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e .
API Keys If you are using cloud-based models (e.g., Gemini), set the necessary environment variables:
export GEMINI_API_KEY="your-gemini-api-key"
# Add other keys as needed (e.g., OPENAI_API_KEY)Dataset Paths
Update src/util.py to point to your local dataset directories:
DATASET_CONFIGS = {
"videommmu": {
"video_dir": "/path/to/VideoMMMU/videos/",
"hf_path": "/path/to/data/VideoMMMU_sample",
},
# Add other datasets here
}.
├── README.md # This file
├── src/ # Core source code
│ ├── models.py # Multimodal model implementations
│ ├── util.py # Shared utilities (text processing, data loading)
│ ├── run_baseline.py # Generate baseline responses
│ ├── run_baseline_with_citation.py # Generate responses with citations
│ ├── run_metric.py # Evaluation pipeline
│ ├── run_generation_program.py # Generation with iterative feedback
│ └── generation_program_util.py # Utilities for the generation program
│
├── prompts/ # Prompt templates
│ ├── base.txt # Base reasoning prompt
│ ├── base_with_citations.txt # Reasoning with citation extraction
│ ├── decontextualization.txt # Remove document context
│ ├── atomic_decomposition.txt # Break into atomic facts
│ ├── coverage_prompt.txt # Fact verification
│ ├── entailment_prompt.txt # Entailment checking
│ └── ... # Additional task-specific prompts
│
└── requirements.txt # Python dependencies
Datasets and model generations are available via Google Drive: [Link to Drive]
Generate model responses without citations.
python src/run_baseline.py <dataset_name> <model_name>
- Example:
python src/run_baseline.py videommmu gemini-2.5-flash - Arguments:
dataset_name: Must be defined inDATASET_CONFIGS.model_name: Model identifier from the supported models list.
Generate responses using the base_with_citations.txt prompt to encourage explicit citations.
python src/run_baseline_with_citation.py <dataset_name> <model_name>
This pipeline decontextualizes responses, decomposes them into atomic facts, extracts citations, and computes verification scores.
python src/run_metric.py --input-file <path-to-generations.json> --output-dir <output-directory>
Generate responses with iterative refinement and feedback.
python src/run_generation_program.py <dataset_name> <model_name>
You can tune performance using the following environment variables:
export OPENAI_API_BASE="your-openai-endpoint" # For custom deployments
export VLLM_USE_V1=0 # Toggle VLLM versions
export DECORD_EOF_RETRY_MAX=20480 # Video processing stability
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" # Memory management
If you use this framework in your research, please cite:
@misc{wan2026multimodalfactlevelattributionverifiable,
title={Multimodal Fact-Level Attribution for Verifiable Reasoning},
author={David Wan and Han Wang and Ziyang Wang and Elias Stengel-Eskin and Hyunji Lee and Mohit Bansal},
year={2026},
eprint={2602.11509},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.11509},
}