This repository holds the code and data of DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning.
- Update on Jun 4, 2025: release codes and paper
- Update on Jun 9, 2025: DreamPRM (o4-mini) has been added to the top of the MathVista Leaderboard (testmini) with 85.2% accuracy!
- Update on Jun 10, 2025: update instructions for extending DreamPRM to o4-mini
DreamPRM — Domain-Reweighted Process Reward Model for Multimodal Reasoning
DreamPRM tackles the dataset quality imbalance and distribution shift that plague multimodal reasoning by domain-reweighting.
It jointly learns (i) a high-fidelity Process Reward Model (PRM) and (ii) optimal domain weights through a bi-level optimisation (BLO) loop, delivering a consistent +4 pp average gain on five public benchmarks.
- Example
- Method Overview
- Quick Start
- Customize Your Datasets
- Extend DreamPRM to o4-mini (new)
- Acknowledgement
- License
- Citation
DreamPRM improves multimodal reasoning by mitigating the dataset quality imbalance problem.
Left: On five benchmarks, DreamPRM outperforms base model (InternVL-2.5-8B-MPO) by an average of +4.0%. DreamPRM also consistently surpasses Vanilla PRM trained
without data selection.
Right: Easy AI2D questions (weight 0.55) vs. hard M3COT questions (weight 1.49) shows how DreamPRM prioritizes data that demand deeper reasoning - samples
requiring knowledge from both textual and visual modalities for step-by-step logical deduction.
DreamPRM significantly outperforms o4-mini in pass@1 accuracy (with temperature fixed at 1.0, following OpenAI API defaults), achieving a 4.6% absolute improvement. It also surpasses the widely used self-consistency (Consensus) method based on majority voting for reasoning chain selection.
General flow of training PRM and using PRM for inference.
Training phase: Train PRM with Monte Carlo signals from intermediate steps of Chain-of-Thoughts (CoTs).
Inference phase: Use the trained PRM to verify CoTs step by step and select the best CoT. Conventional
training of PRM has poor generalization capability due to distribution shift between training set and
testing set.
The proposed bi-level optimization based domain-reweighting method. Lower-level optimization: In this stage, PRM’s parameters are updated on multiple datasets with domain weights, allowing the PRM to prioritize domains with better quality. Upper-level optimization: In this stage, the PRM is evaluated on a separate meta dataset to compute an aggregation function loss and optimize the domain weights.
| Component | Purpose | Highlight |
|---|---|---|
| Domain-Reweighted Fine-Tuning | Re-weights K training domains via parameters αₖ | Gives harder, higher-quality datasets greater gradient influence |
| Bi-Level Optimisation (BLO) | Lower level updates PRM weights ϕ; upper level updates α | Learns both model and data weights in one run |
| Aggregation Function Loss | Meta-level loss that mirrors inference-time scoring | Aligns training with real PRM usage |
DreamPRM’s learned domain weights span 0.55–1.49, down-weighting noisy sets like AI2D and up-weighting challenging ones like M3CoT. This correlation with dataset difficulty underpins its performance gains.
All commands below are illustrative—rename scripts / paths to match your repo.
Git clone our repository, creating a python environment and ativate it via the following command
git https://github.com/coder-qicao/DreamPRM.git
cd DreamPRM# (a) create conda env
conda create -n dreamprm python=3.10 -y
conda activate dreamprm
# (b) install requirements
pip install -r requirements.txt # torch betty, transformers, accelerate, ...Verify the installation of torch and torchvision is successful by running python -c "import torchvision; print(torchvision.__version__)". If it outputs the version number without any warnings or errors, then you are good to go. If it outputs any warnings or errors, try to uninstall torch by conda uninstall pytorch torchvision torchaudio cudatoolkit and then reinstall them following here. You need to find the correct command according to the CUDA version your GPU driver supports (check nvidia-smi).
The current version of DreamPRM is built on Qwen2-VL-2B-Instruct. Please download Qwen2-VL weights from https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct.
Domain-reweighting for DreamPRM fine-tuning:
python main.py \\
--train_json_file "data/train.json" \\
--meta_json_file "data/meta.json" \\
--weights_path "weights"\\You need at least 80 GB GPU memory for the training.
In addition, you may want to change the number of epochs and other hyper-parameters there, such as iteration_num, unroll_steps, gradiant_accumulation,lr, scheduler_step_size, etc.
| Argument | Type | Default | Description |
|---|---|---|---|
--train_json_file |
str | None | Path to training dataset JSON file |
--meta_json_file |
str | None | Path to meta dataset JSON file |
--weights_path |
str | None | Directory to save/load model weights |
--reward_model |
str | "Qwen/Qwen2-VL-2B-Instruct" | Pretrained reward model identifier |
| Argument | Type | Default | Description |
|---|---|---|---|
--iteration_num |
int | 10000 | Total training iterations |
--batch_size |
int | 1 | Training batch size |
--max_epoch |
int | 120 | Maximum training epochs |
--device |
str | "cuda" | Compute device ("cuda" or "cpu") |
--precision |
str | "bf16" | Floating point precision (bf16/fp16/fp32) |
--strategy |
str | "default" | Training strategy (default) |
--seed |
int | 1 | Random seed for reproducibility |
--local_rank |
int | 0 | Local rank for distributed training |
| Argument | Type | Default | Description |
|---|---|---|---|
--lr |
float | 5e-7 | Main optimizer learning rate |
--meta_lr |
float | 0.01 | Meta-optimizer learning rate |
--weight_decay |
float | 1e-3 | Weight decay (L2 penalty) |
--meta_weight_decay |
float | 0.0 | Meta-optimizer weight decay |
--scheduler_step_size |
int | 5000 | Steps between LR adjustments |
--scheduler_gamma |
float | 0.5 | LR decay multiplier |
| Argument | Type | Default | Description |
|---|---|---|---|
--unroll_steps |
int | 5 | Unrolled optimization steps |
--gradiant_accumulation |
int | 1 | Gradient accumulation steps |
| Argument | Type | Default | Description |
|---|---|---|---|
--save_every_iterations |
int | 1000 | Checkpoint save interval |
python main.py \\
--train_json_file data/train.json \\
--meta_json_file data/meta.json \\
--weights_path models/dreamprm \\
--iteration_num 20000 \\
--lr 1e-6 \\
--meta_lr 0.05 \\
--precision bf16 \\
--reward_model "Qwen/Qwen2-VL-7B-Instruct" \\
--unroll_steps 8 \\
--save_every_iterations 500We provide demo datasets with 10 domains (10k training samples) and 500 meta samples in our repository:
data/
├── meta.json
└── train.jsonEach sample in the training dataset should follow this format:
{
"id": 1128, # Unique question identifier
"sid": 1, # Step number identifier
"input": "Your task is...", # Full question prompt
"add": "Step 1: Restate...", # Model's partial response
"ground_truth": "1.78947", # Correct final answer
"image_path": "dataset/...", # Path to input image
"dataset": "chartqa", # Domain name
"score": 7, # Monte Carlo score
"times": 11, # Monte Carlo iterations
"accuracy": 0.6363 # Estimated accuracy (0-1)
}{
"input": "...", # Question prompt (required)
"add": "Step 1: ...", # Model's partial response (required)
"image_path": "xxx.png", # Input image path (required)
"dataset": "...", # Domain name (required)
"accuracy": 0.6363 # Estimated accuracy (0-1, required)
}Each sample in the meta dataset should follow this format:
{
"id": 2, # Unique question identifier
"true_false": True, # Ground truth label
"input": "Question: The...", # Full question + model response
"image_path": "dataset/..." # Path to input image
}{
"true_false": True, # Boolean ground truth (required)
"input": "Question: ...", # Full question + model response (required)
"image_path": "xxx.png" # Input image path (required)
}DreamPRM can be extended to stronger models by leveraging your customized meta training set. In this section, we demonstrate how to apply DreamPRM to o4-mini.
Generate multiple Chains of Thought (CoTs) using o4-mini. We highly recommend enabling high reasoning effort mode to produce richer and more reliable reasoning paths.
from openai import OpenAI
client = OpenAI(api_key=api_key)
response = client.responses.create(
model="o4-mini",
reasoning={"effort": "high"},
input=prompt,
)We suggest generating 8 diverse CoTs per question to enable best-of-N selection.
To improve response structure and clarity, we recommend using a structured thinking prompt that clearly outlines each reasoning step:
# Structured prompting for o4-mini
prompt = """
You have been given a question that involves both an image and a text.
Your task is to analyze the question by following exactly five steps:
Step 1: Restate the question.
- Clearly rephrase or clarify the question in your own words.
Step 2: Gather evidence from the image.
- Describe relevant visual details (e.g., objects, people, locations, interactions) that may help answer the question.
Step 3: Identify any necessary background knowledge.
- List any general facts or assumptions required to answer the question.
Step 4: Reason using the available evidence.
- Integrate the image, text, and background knowledge to form a coherent reasoning path.
Step 5: Summarize and conclude.
- Provide a concise answer, supported by the reasoning in previous steps.
Finally, report your answer in the following format:
Final answer: ...
Question: ... (Insert question here)
"""DreamPRM's upper-level optimization provides realistic simulation of reasoning. To maximize performance with o4-mini, we recommend creating a custom meta-training set ~~using CoTs and responses generated by o4-mini, with careful domain selection tailored to the model’s strengths.
A sample format for meta-training data:
{
"true_false": True, # Ground truth label (required)
"input": "Question: ... + (Insert o4-mini response here)", # Full input prompt + model response (required)
"image_path": "xxx.png" # Path to the input image (required)
}Use the re-trained PRM to select the most promising CoT among the candidates. Try different aggregation functions—such as mean, log-mean, or other variants—to evaluate and aggregate step-level scores effectively.
This repository is under Apache License 2.0.
If you use this work in your research, please cite:
@misc{cao2025dreamprmdomainreweightedprocessreward,
title={DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning},
author={Qi Cao and Ruiyi Wang and Ruiyi Zhang and Sai Ashish Somayajula and Pengtao Xie},
year={2025},
eprint={2505.20241},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.20241},
}

