ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs

Wonjun Kang^*1,5, Kevin Galim^*1, Seunghyuk Oh^*1, Minjae Lee¹, Yuchen Zeng^2,3, Shuibai Zhang²,
Coleman Hooper⁴, Yuezhou Hu⁴, Hyung Il Koo¹, Nam Ik Cho⁵, Kangwook Lee^2,6

¹FuriosaAI, ²UW-Madison, ³Microsoft Research, ⁴UC Berkeley, ⁵Seoul National University, ⁶KRAFTON AI

🔔 Updates

Jan 25, 2026 Paper accepted at ICLR 2026! 🎉
Oct 6, 2025 ParallelBench release!

🗺️ Roadmap

We are currently working to support new models and implement advanced unmasking methods. If you are conducting dLLM research and would like to contribute new models or methods, please open an issue.

New Models

Advanced Unmasking Strategies

WINO
DUS
APD
SlowFast Sampling
EB-Sampler
KLASS
Uncode (formerly, PC-Sampler)

🔎 Overview

Diffusion LLMs (dLLMs) promise faster generation via parallel decoding. However, this speed often comes at the cost of quality, as they ignore token dependencies, an issue that existing benchmarks do not sufficiently capture. To address this issue, we introduce ParallelBench, the first benchmark designed to rigorously test this trade-off through realistic tasks that humans and autoregressive (AR) LLMs can easily solve, but which cause dLLMs to collapse as parallelism grows. We release ParallelBench to drive research towards truly efficient dLLMs that can overcome this challenge.

Features

Information-Theoretic Analysis: We derive error bounds on parallel decoding for tasks with inter-token dependencies. Even an optimal model sees accuracy degrade as parallelism grows.
Quantitative Case Studies: Synthetic list operations (Copy, Replace, Shuffle) with closed-form accuracy formulas pin down exactly where and how parallel decoding breaks.
Realistic Benchmark Tasks: 17 tasks across three categories (Waiting Line, Text Writing, Puzzles) that humans and AR LLMs solve easily, but expose clear quality drops in dLLMs under parallel decoding.

⚙️ Setup

These steps will guide you through setting up the necessary environment and dependencies.

1. Prerequisites

Conda: For managing the environment.
NVIDIA GPU: CUDA >= 11.8.
Java Development Kit (JDK): Required only for grammar-based evaluation metrics.

2. Create Conda Environment

First, create and activate the conda environment. We use Python 3.10.

conda create -n parallelbench python=3.10 -y
conda activate parallelbench

3. Install Python Dependencies

We use uv for faster package installation. The following commands will install PyTorch, vLLM for the LLM baselines, and all other required packages from requirements.txt.

# Install uv, a fast package installer
pip install uv

# Install core dependencies
uv pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu118
uv pip install -r requirements.txt
uv pip install vllm  # optional for LLM evaluation

4. Install Java (Optional)

If you need to run the grammar-based evaluations, install the JDK via conda:

conda install -c conda-forge openjdk=17

⚡ Quickstart

Here's a simple example of how to load a model and run it on a ParallelBench task. For a more in-depth example, see the demo.py script.

import torch
from transformers import AutoModel, AutoTokenizer
from dataset.parallel_bench import ParallelBench

# 1. Load the model and tokenizer
model = AutoModel.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    trust_remote_code=True
)

# 2. Load a benchmark task and get a sample
task_name = "waiting_line/copy"
dataset = ParallelBench(task_name)
sample = dataset[0] # Get the first sample from the task

# 3. Prepare input from the benchmark sample
messages = sample["input"]["messages"]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# 4. Generate the model's output
generated_ids = model.diffusion_generate(input_ids, max_tokens=32)
response = tokenizer.decode(generated_ids[0][len(input_ids[0]):], skip_special_tokens=True)

# 5. Compare the model's output with the reference label
print(f"Task: {task_name}")
print(f"Prompt: {messages[-1]['content']}")
print(f"Reference Label: {sample['label']}")
print(f"Model Output:    {response}")

# To get the final score, run compute_metrics
metrics = dataset.compute_metrics([response], [sample["label"]])
print(f"Metrics: {metrics}")

🎯 Evaluation Coverage

Tasks

Waiting Line
- waiting_line/copy
- waiting_line/insert_index
- waiting_line/insert_random
- waiting_line/remove_index
- waiting_line/remove_random
- waiting_line/replace_index
- waiting_line/replace_random
- waiting_line/reverse
- waiting_line/shuffle
- waiting_line/sort
Text Writing
- paraphrase_summarize/chatgpt-paraphrases
- paraphrase_summarize/samsum
- words_to_sentence/easy
- words_to_sentence/medium
- words_to_sentence/hard
Puzzle
- puzzle/latin_square_n4
- puzzle/sudoku_n4_12

Models

For additional models and unmasking methods, please refer to the Roadmap section.

LLaDA Family (LLaDA 1.x)
Dream Family (Dream, DiffuCoder)
SDAR Family (SDAR, TraDo)

Unmasking Methods

Top-k methods:
- Random
- Confidence
- Entropy
- Margin
Advanced methods:
- Threshold-based
- Factor-based
- RCR
- ReMDM

🛠️ Create Your Own Tasks

You can easily generate custom tasks from YAML configuration files. For example, to create new copy and reverse tasks:

PYTHONPATH=. python dataset/parallel_bench/data/task.py --task test/copy_reverse/all

This command uses the configurations specified in dataset/parallel_bench/data/task_configs/.

🚀 Running Evaluations

Configuration

Before running the evaluations, you must export the necessary API keys as environment variables.

# For logging results
export WANDB_API_KEY="your_weights_and_biases_key"

# For commercial model APIs
export ANTHROPIC_API_KEY="your_anthropic_key"      # For Haiku
export INCEPTION_API_KEY="your_mercury_model_key"  # For Mercury

All experiments are launched using the run_all.py script. The general command structure is:

python run_all.py eval.py --device <gpu_ids> --cfg <path_to_config_file>

Main Benchmark Reproduction

This section covers the commands to reproduce the main benchmark results from our paper. The following commands run evaluation on two GPUs.

LLaDA 1.5:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/llada_1_5_all_tasks_list.yaml

Dream:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/dream_all_tasks_list.yaml

Diffucoder:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/diffucoder_all_tasks_list.yaml

LLaDA 1.0:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/llada_1_0_all_tasks_list.yaml

dLLM vs. Autoregressive LLM Comparison

This section includes the commands for the comparative analysis between our models and other strong LLM baselines.

LLaDA 1.5:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llada_1_5_all_tasks_list.yaml

Dream:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/dream_all_tasks_list.yaml

Diffucoder:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/diffucoder_all_tasks_list.yaml

LLaDA 1.0:

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llada_1_0_all_tasks_list.yaml

Mercury (requires single GPU):

python run_all.py eval.py --device 0 --cfg cfg/paper/dllm_vs_llm/mercury_all_tasks_list.yaml

Haiku (requires single GPU):

python run_all.py eval.py --device 0 --cfg cfg/paper/dllm_vs_llm/haiku_all_tasks_list.yaml

LLM Baselines (via vLLM):

python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llm_all_tasks_list.yaml

Results

All evaluation metrics and generated outputs are logged to Weights & Biases (wandb). Please ensure you have configured your API key and project settings.

🙏 Acknowledgements

This project builds upon the work of several fantastic open-source repositories. We extend our sincere thanks to the original authors for their contributions to the community.

📖 Citation

@article{kang2025parallelbench,
  title={ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs},
  author={Kang, Wonjun and Galim, Kevin and Oh, Seunghyuk and Lee, Minjae and Zeng, Yuchen and Zhang, Shuibai and Hooper, Coleman and Hu, Yuezhou and Koo, Hyung Il and Cho, Nam Ik and others},
  journal={arXiv preprint arXiv:2510.04767},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs

🔔 Updates

🗺️ Roadmap

🔎 Overview

Features

⚙️ Setup

1. Prerequisites

2. Create Conda Environment

3. Install Python Dependencies

4. Install Java (Optional)

⚡ Quickstart

🎯 Evaluation Coverage

Tasks

Models

Unmasking Methods

🛠️ Create Your Own Tasks

🚀 Running Evaluations

Configuration

Main Benchmark Reproduction

dLLM vs. Autoregressive LLM Comparison

Results

🙏 Acknowledgements

📖 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cfg/paper		cfg/paper
dataset		dataset
docs		docs
model		model
utils		utils
.gitignore		.gitignore
README.md		README.md
demo.py		demo.py
eval.py		eval.py
requirements.txt		requirements.txt
run_all.py		run_all.py

furiosa-ai/ParallelBench

Folders and files

Latest commit

History

Repository files navigation

ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs

🔔 Updates

🗺️ Roadmap

🔎 Overview

Features

⚙️ Setup

1. Prerequisites

2. Create Conda Environment

3. Install Python Dependencies

4. Install Java (Optional)

⚡ Quickstart

🎯 Evaluation Coverage

Tasks

Models

Unmasking Methods

🛠️ Create Your Own Tasks

🚀 Running Evaluations

Configuration

Main Benchmark Reproduction

dLLM vs. Autoregressive LLM Comparison

Results

🙏 Acknowledgements

📖 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages