Wonjun Kang*1,5,
Kevin Galim*1,
Seunghyuk Oh*1,
Minjae Lee1,
Yuchen Zeng2,3,
Shuibai Zhang2,
Coleman Hooper4,
Yuezhou Hu4,
Hyung Il Koo1,
Nam Ik Cho5,
Kangwook Lee2,6
1FuriosaAI, 2UW-Madison, 3Microsoft Research, 4UC Berkeley, 5Seoul National University, 6KRAFTON AI
- Jan 25, 2026 Paper accepted at ICLR 2026! 🎉
- Oct 6, 2025 ParallelBench release!
We are currently working to support new models and implement advanced unmasking methods. If you are conducting dLLM research and would like to contribute new models or methods, please open an issue.
New Models
Advanced Unmasking Strategies
- WINO
- DUS
- APD
- SlowFast Sampling
- EB-Sampler
- KLASS
- Uncode (formerly, PC-Sampler)
Diffusion LLMs (dLLMs) promise faster generation via parallel decoding. However, this speed often comes at the cost of quality, as they ignore token dependencies, an issue that existing benchmarks do not sufficiently capture. To address this issue, we introduce ParallelBench, the first benchmark designed to rigorously test this trade-off through realistic tasks that humans and autoregressive (AR) LLMs can easily solve, but which cause dLLMs to collapse as parallelism grows. We release ParallelBench to drive research towards truly efficient dLLMs that can overcome this challenge.
-
Information-Theoretic Analysis: We derive error bounds on parallel decoding for tasks with inter-token dependencies. Even an optimal model sees accuracy degrade as parallelism grows.
-
Quantitative Case Studies: Synthetic list operations (Copy, Replace, Shuffle) with closed-form accuracy formulas pin down exactly where and how parallel decoding breaks.
-
Realistic Benchmark Tasks: 17 tasks across three categories (Waiting Line, Text Writing, Puzzles) that humans and AR LLMs solve easily, but expose clear quality drops in dLLMs under parallel decoding.
These steps will guide you through setting up the necessary environment and dependencies.
- Conda: For managing the environment.
- NVIDIA GPU: CUDA >= 11.8.
- Java Development Kit (JDK): Required only for grammar-based evaluation metrics.
First, create and activate the conda environment. We use Python 3.10.
conda create -n parallelbench python=3.10 -y
conda activate parallelbenchWe use uv for faster package installation. The following commands will install PyTorch, vLLM for the LLM baselines, and all other required packages from requirements.txt.
# Install uv, a fast package installer
pip install uv
# Install core dependencies
uv pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu118
uv pip install -r requirements.txt
uv pip install vllm # optional for LLM evaluationIf you need to run the grammar-based evaluations, install the JDK via conda:
conda install -c conda-forge openjdk=17Here's a simple example of how to load a model and run it on a ParallelBench task. For a more in-depth example, see the demo.py script.
import torch
from transformers import AutoModel, AutoTokenizer
from dataset.parallel_bench import ParallelBench
# 1. Load the model and tokenizer
model = AutoModel.from_pretrained(
"Dream-org/Dream-v0-Instruct-7B",
trust_remote_code=True,
torch_dtype=torch.bfloat16
).cuda()
tokenizer = AutoTokenizer.from_pretrained(
"Dream-org/Dream-v0-Instruct-7B",
trust_remote_code=True
)
# 2. Load a benchmark task and get a sample
task_name = "waiting_line/copy"
dataset = ParallelBench(task_name)
sample = dataset[0] # Get the first sample from the task
# 3. Prepare input from the benchmark sample
messages = sample["input"]["messages"]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
# 4. Generate the model's output
generated_ids = model.diffusion_generate(input_ids, max_tokens=32)
response = tokenizer.decode(generated_ids[0][len(input_ids[0]):], skip_special_tokens=True)
# 5. Compare the model's output with the reference label
print(f"Task: {task_name}")
print(f"Prompt: {messages[-1]['content']}")
print(f"Reference Label: {sample['label']}")
print(f"Model Output: {response}")
# To get the final score, run compute_metrics
metrics = dataset.compute_metrics([response], [sample["label"]])
print(f"Metrics: {metrics}")- Waiting Line
waiting_line/copywaiting_line/insert_indexwaiting_line/insert_randomwaiting_line/remove_indexwaiting_line/remove_randomwaiting_line/replace_indexwaiting_line/replace_randomwaiting_line/reversewaiting_line/shufflewaiting_line/sort
- Text Writing
paraphrase_summarize/chatgpt-paraphrasesparaphrase_summarize/samsumwords_to_sentence/easywords_to_sentence/mediumwords_to_sentence/hard
- Puzzle
puzzle/latin_square_n4puzzle/sudoku_n4_12
For additional models and unmasking methods, please refer to the Roadmap section.
- LLaDA Family (LLaDA 1.x)
- Dream Family (Dream, DiffuCoder)
- SDAR Family (SDAR, TraDo)
- Top-k methods:
- Random
- Confidence
- Entropy
- Margin
- Advanced methods:
You can easily generate custom tasks from YAML configuration files. For example, to create new copy and reverse tasks:
PYTHONPATH=. python dataset/parallel_bench/data/task.py --task test/copy_reverse/allThis command uses the configurations specified in dataset/parallel_bench/data/task_configs/.
Before running the evaluations, you must export the necessary API keys as environment variables.
# For logging results
export WANDB_API_KEY="your_weights_and_biases_key"
# For commercial model APIs
export ANTHROPIC_API_KEY="your_anthropic_key" # For Haiku
export INCEPTION_API_KEY="your_mercury_model_key" # For MercuryAll experiments are launched using the run_all.py script. The general command structure is:
python run_all.py eval.py --device <gpu_ids> --cfg <path_to_config_file>This section covers the commands to reproduce the main benchmark results from our paper. The following commands run evaluation on two GPUs.
- LLaDA 1.5:
python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/llada_1_5_all_tasks_list.yaml
- Dream:
python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/dream_all_tasks_list.yaml
- Diffucoder:
python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/diffucoder_all_tasks_list.yaml
- LLaDA 1.0:
python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/llada_1_0_all_tasks_list.yaml
This section includes the commands for the comparative analysis between our models and other strong LLM baselines.
- LLaDA 1.5:
python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llada_1_5_all_tasks_list.yaml
- Dream:
python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/dream_all_tasks_list.yaml
- Diffucoder:
python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/diffucoder_all_tasks_list.yaml
- LLaDA 1.0:
python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llada_1_0_all_tasks_list.yaml
- Mercury (requires single GPU):
python run_all.py eval.py --device 0 --cfg cfg/paper/dllm_vs_llm/mercury_all_tasks_list.yaml
- Haiku (requires single GPU):
python run_all.py eval.py --device 0 --cfg cfg/paper/dllm_vs_llm/haiku_all_tasks_list.yaml
- LLM Baselines (via vLLM):
python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llm_all_tasks_list.yaml
All evaluation metrics and generated outputs are logged to Weights & Biases (wandb). Please ensure you have configured your API key and project settings.
This project builds upon the work of several fantastic open-source repositories. We extend our sincere thanks to the original authors for their contributions to the community.
@article{kang2025parallelbench,
title={ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs},
author={Kang, Wonjun and Galim, Kevin and Oh, Seunghyuk and Lee, Minjae and Zeng, Yuchen and Zhang, Shuibai and Hooper, Coleman and Hu, Yuezhou and Koo, Hyung Il and Cho, Nam Ik and others},
journal={arXiv preprint arXiv:2510.04767},
year={2025}
}
