Skip to content

NVlabs/PixelDiT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu1,2   Wei Xiong1†   Weili Nie1   Yichen Sheng1   Shiqiu Liu1   Jiebo Luo2

1NVIDIA   2University of Rochester
Project Lead and Main Advising

Image   Image   Image   Image   Image
Image  

Image

PixelDiT is a single-stage, end-to-end pixel-space diffusion transformer that eliminates the VAE autoencoder entirely. It uses a dual-level architecture — patch-level DiT for global semantics + pixel-level DiT for texture details — to generate images directly in pixel space.

  • 1.61 FID on ImageNet 256×256
  • 0.74 GenEval / 83.5 DPG-Bench on text-to-image at 1024×1024
  • No VAE, no latent space

🔥 News

  • [2026/06] Added a post-modulation option for the PiT (pixel-level) blocks that mitigates the training loss spikes (#6). See c2i/README.md.
  • [2026/06] PixelDiT is selected as a CVPR 2026 Best Paper Finalist.
  • [2026/04] Training & inference code, and pre-trained models are released.
  • [2026/02] PixelDiT is accepted to CVPR 2026 Oral.
  • [2025/11] arxiv is released.

Performance

ImageNet 256×256 (PixelDiT-XL, 797M params)

All evaluations use FlowDPMSolver with 100 steps. 50K samples. Metrics follow ADM evaluation protocol.

Epoch gFID↓ CFG Scale Steps Sampler Time Shift CFG Interval
80 2.36 3.25 100 FlowDPMSolver 1.0 [0.1, 1.0]
160 1.97 3.25 100 FlowDPMSolver 1.0 [0.1, 1.0]
320 1.61 2.75 100 FlowDPMSolver 1.0 [0.1, 0.9]

ImageNet 512×512 (PixelDiT-XL, 797M params)

Resolution gFID↓ CFG Scale Steps Sampler Time Shift CFG Interval
512×512 1.81 3.5 100 FlowDPMSolver 2.0 [0.1, 1.0]

Text-to-Image (PixelDiT-T2I, 1.3B params)

Resolution GenEval↑ DPG-Bench↑
512×512 0.78 83.7
1024×1024 0.74 83.5

Getting Started

Docker image (recommended): nvcr.io/nvidia/pytorch:24.09-py3

pip install -r requirements.txt

Tasks

Note: Our models are resumed every 4 hours, using the timestamp as the random seed each time. As a result, the final training outcome may have a slight gap compared to a continuous run without intermediate resumes.

Class-to-Image Generation (ImageNet)

Training and evaluation instructions for class-conditioned generation on ImageNet 256×256 and 512×512.

c2i/README.md

Text-to-Image Generation

Multi-stage training (512px → 1024px) and inference for text-to-image generation.

t2i/README.md

Repository Structure

├── pixdit_core/      # Shared PixelDiT model definitions (c2i & t2i)
├── tools/            # Shared utilities (checkpoint download, GFLOPs computation)
├── c2i/              # Class-to-image
└── t2i/              # Text-to-image

Compute GFLOPs

Measure single-forward-pass GFLOPs for any PixelDiT model (run from project root):

# C2I (ImageNet 256x256, default resolution)
python tools/compute_flops.py --config c2i/configs/pix256_xl.yaml
# T2I at 1024x1024
python tools/compute_flops.py --config t2i/configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml --height 1024 --width 1024

Acknowledgements

We would like to thank the authors of PixNerd and SANA for sharing their code. We also thank the SANA team for sharing their text-to-image training data.

Citation

If you find this work useful, please cite:

@inproceedings{yu2026pixeldit,
      title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
      author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2026},
}

About

[CVPR 2026 Best Paper Finalist] Pixel Diffusion Transformers for Image Generation

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors