Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

Zhuoyang Zhang*, Luke J. Huang*, Chengyue Wu, Shang Yang, Yao Lu, Song Han
MIT, NVIDIA
(* indicates equal contribution)

News

Awards

No items found.

Competition Awards

No items found.

Abstract

We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256×256 res.) and 1024 to 48 (512×512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4× lower latency than previous parallelized autoregressive models.

Method

  1. Flexible Parallelized Autoregressive Modeling. The core idea is to decouple the context representation and token generation by leveraging separate tokens. In this formulation, previously generated tokens are encoded to provide context and the generation is driven by learnable position query tokens corresponding to the desired target positions. By directly inputting these position-specific queries, the model can generate tokens at arbitrary target positions in parallel. This design allows the model to leverage positional information in both the context and generation pathways, enabling arbitrary generation order.
Image
  1. Locality-aware Generation Ordering. To fully leverage our flexible parallelized autoregressive modeling architecture, we introduce a locality-aware generation order schedule. This schedule is guided by two key principles (1) High proximity to previously generated tokens: target positions should be spatially close to existing context to ensure strong conditioning and (2) Low proximity among concurrently generated tokens: tokens predicted in the same parallel step should be spatially distant to reduce mutual dependency.
Image

Results

  1. Compared with raster order
Image
  1. Compared with previous parallelized autoregressive models
Image

Demo

Video

Citation

@article{zhang2025locality,
 title={Locality-aware parallel decoding for efficient autoregressive image generation},
 author={Zhang, Zhuoyang and Huang, Luke J and Wu, Chengyue and Yang, Shang and Peng, Kelly and Lu, Yao and Han, Song},
 journal={arXiv preprint arXiv:2507.01957},
 year={2025}
}

Media

No media articles found.

Acknowledgment

We thank MIT-IBM Watson AI Lab, National Science Foundation, Hyundai, and Amazon for supporting this research.

Team Members