Skip to content

[ICLR 2026] ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs

Notifications You must be signed in to change notification settings

furiosa-ai/ParallelBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs

Image

Wonjun Kang*1,5, Kevin Galim*1, Seunghyuk Oh*1, Minjae Lee1, Yuchen Zeng2,3, Shuibai Zhang2,
Coleman Hooper4, Yuezhou Hu4, Hyung Il Koo1, Nam Ik Cho5, Kangwook Lee2,6

1FuriosaAI, 2UW-Madison, 3Microsoft Research, 4UC Berkeley, 5Seoul National University, 6KRAFTON AI

Project arXiv

🔔 Updates

  • Jan 25, 2026 Paper accepted at ICLR 2026! 🎉
  • Oct 6, 2025 ParallelBench release!

🗺️ Roadmap

We are currently working to support new models and implement advanced unmasking methods. If you are conducting dLLM research and would like to contribute new models or methods, please open an issue.

New Models

Advanced Unmasking Strategies

🔎 Overview

Image

Diffusion LLMs (dLLMs) promise faster generation via parallel decoding. However, this speed often comes at the cost of quality, as they ignore token dependencies, an issue that existing benchmarks do not sufficiently capture. To address this issue, we introduce ParallelBench, the first benchmark designed to rigorously test this trade-off through realistic tasks that humans and autoregressive (AR) LLMs can easily solve, but which cause dLLMs to collapse as parallelism grows. We release ParallelBench to drive research towards truly efficient dLLMs that can overcome this challenge.

Features

  • Information-Theoretic Analysis: We derive error bounds on parallel decoding for tasks with inter-token dependencies. Even an optimal model sees accuracy degrade as parallelism grows.

  • Quantitative Case Studies: Synthetic list operations (Copy, Replace, Shuffle) with closed-form accuracy formulas pin down exactly where and how parallel decoding breaks.

  • Realistic Benchmark Tasks: 17 tasks across three categories (Waiting Line, Text Writing, Puzzles) that humans and AR LLMs solve easily, but expose clear quality drops in dLLMs under parallel decoding.

⚙️ Setup

These steps will guide you through setting up the necessary environment and dependencies.

1. Prerequisites

  • Conda: For managing the environment.
  • NVIDIA GPU: CUDA >= 11.8.
  • Java Development Kit (JDK): Required only for grammar-based evaluation metrics.

2. Create Conda Environment

First, create and activate the conda environment. We use Python 3.10.

conda create -n parallelbench python=3.10 -y
conda activate parallelbench

3. Install Python Dependencies

We use uv for faster package installation. The following commands will install PyTorch, vLLM for the LLM baselines, and all other required packages from requirements.txt.

# Install uv, a fast package installer
pip install uv

# Install core dependencies
uv pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu118
uv pip install -r requirements.txt
uv pip install vllm  # optional for LLM evaluation

4. Install Java (Optional)

If you need to run the grammar-based evaluations, install the JDK via conda:

conda install -c conda-forge openjdk=17

⚡ Quickstart

Here's a simple example of how to load a model and run it on a ParallelBench task. For a more in-depth example, see the demo.py script.

import torch
from transformers import AutoModel, AutoTokenizer
from dataset.parallel_bench import ParallelBench

# 1. Load the model and tokenizer
model = AutoModel.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).cuda()

tokenizer = AutoTokenizer.from_pretrained(
    "Dream-org/Dream-v0-Instruct-7B",
    trust_remote_code=True
)

# 2. Load a benchmark task and get a sample
task_name = "waiting_line/copy"
dataset = ParallelBench(task_name)
sample = dataset[0] # Get the first sample from the task

# 3. Prepare input from the benchmark sample
messages = sample["input"]["messages"]
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

# 4. Generate the model's output
generated_ids = model.diffusion_generate(input_ids, max_tokens=32)
response = tokenizer.decode(generated_ids[0][len(input_ids[0]):], skip_special_tokens=True)

# 5. Compare the model's output with the reference label
print(f"Task: {task_name}")
print(f"Prompt: {messages[-1]['content']}")
print(f"Reference Label: {sample['label']}")
print(f"Model Output:    {response}")

# To get the final score, run compute_metrics
metrics = dataset.compute_metrics([response], [sample["label"]])
print(f"Metrics: {metrics}")

🎯 Evaluation Coverage

Tasks

  • Waiting Line
    • waiting_line/copy
    • waiting_line/insert_index
    • waiting_line/insert_random
    • waiting_line/remove_index
    • waiting_line/remove_random
    • waiting_line/replace_index
    • waiting_line/replace_random
    • waiting_line/reverse
    • waiting_line/shuffle
    • waiting_line/sort
  • Text Writing
    • paraphrase_summarize/chatgpt-paraphrases
    • paraphrase_summarize/samsum
    • words_to_sentence/easy
    • words_to_sentence/medium
    • words_to_sentence/hard
  • Puzzle
    • puzzle/latin_square_n4
    • puzzle/sudoku_n4_12

Models

For additional models and unmasking methods, please refer to the Roadmap section.

Unmasking Methods

  • Top-k methods:
    • Random
    • Confidence
    • Entropy
    • Margin
  • Advanced methods:
    • Threshold-based
    • Factor-based
    • RCR
    • ReMDM

🛠️ Create Your Own Tasks

You can easily generate custom tasks from YAML configuration files. For example, to create new copy and reverse tasks:

PYTHONPATH=. python dataset/parallel_bench/data/task.py --task test/copy_reverse/all

This command uses the configurations specified in dataset/parallel_bench/data/task_configs/.

🚀 Running Evaluations

Configuration

Before running the evaluations, you must export the necessary API keys as environment variables.

# For logging results
export WANDB_API_KEY="your_weights_and_biases_key"

# For commercial model APIs
export ANTHROPIC_API_KEY="your_anthropic_key"      # For Haiku
export INCEPTION_API_KEY="your_mercury_model_key"  # For Mercury

All experiments are launched using the run_all.py script. The general command structure is:

python run_all.py eval.py --device <gpu_ids> --cfg <path_to_config_file>

Main Benchmark Reproduction

This section covers the commands to reproduce the main benchmark results from our paper. The following commands run evaluation on two GPUs.

  • LLaDA 1.5:
    python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/llada_1_5_all_tasks_list.yaml
  • Dream:
    python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/dream_all_tasks_list.yaml
  • Diffucoder:
    python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/diffucoder_all_tasks_list.yaml
  • LLaDA 1.0:
    python run_all.py eval.py --device 0 1 --cfg cfg/paper/benchmark/llada_1_0_all_tasks_list.yaml

dLLM vs. Autoregressive LLM Comparison

This section includes the commands for the comparative analysis between our models and other strong LLM baselines.

  • LLaDA 1.5:
    python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llada_1_5_all_tasks_list.yaml
  • Dream:
    python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/dream_all_tasks_list.yaml
  • Diffucoder:
    python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/diffucoder_all_tasks_list.yaml
  • LLaDA 1.0:
    python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llada_1_0_all_tasks_list.yaml
  • Mercury (requires single GPU):
    python run_all.py eval.py --device 0 --cfg cfg/paper/dllm_vs_llm/mercury_all_tasks_list.yaml
  • Haiku (requires single GPU):
    python run_all.py eval.py --device 0 --cfg cfg/paper/dllm_vs_llm/haiku_all_tasks_list.yaml
  • LLM Baselines (via vLLM):
    python run_all.py eval.py --device 0 1 --cfg cfg/paper/dllm_vs_llm/llm_all_tasks_list.yaml

Results

All evaluation metrics and generated outputs are logged to Weights & Biases (wandb). Please ensure you have configured your API key and project settings.

🙏 Acknowledgements

This project builds upon the work of several fantastic open-source repositories. We extend our sincere thanks to the original authors for their contributions to the community.

📖 Citation

@article{kang2025parallelbench,
  title={ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs},
  author={Kang, Wonjun and Galim, Kevin and Oh, Seunghyuk and Lee, Minjae and Zeng, Yuchen and Zhang, Shuibai and Hooper, Coleman and Hu, Yuezhou and Koo, Hyung Il and Cho, Nam Ik and others},
  journal={arXiv preprint arXiv:2510.04767},
  year={2025}
}

About

[ICLR 2026] ParallelBench: Understanding the Tradeoffs of Parallel Decoding in Diffusion LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages