GitHub - NVlabs/PixelDiT: [CVPR 2026 Best Paper Finalist] Pixel Diffusion Transformers for Image Generation

PixelDiT: Pixel Diffusion Transformers for Image Generation

Yongsheng Yu^1,2 Wei Xiong^1† Weili Nie¹ Yichen Sheng¹ Shiqiu Liu¹ Jiebo Luo²

¹NVIDIA ²University of Rochester
^†Project Lead and Main Advising

PixelDiT is a single-stage, end-to-end pixel-space diffusion transformer that eliminates the VAE autoencoder entirely. It uses a dual-level architecture — patch-level DiT for global semantics + pixel-level DiT for texture details — to generate images directly in pixel space.

1.61 FID on ImageNet 256×256
0.74 GenEval / 83.5 DPG-Bench on text-to-image at 1024×1024
No VAE, no latent space

🔥 News

[2026/06] Added a post-modulation option for the PiT (pixel-level) blocks that mitigates the training loss spikes (#6). See c2i/README.md.
[2026/06] PixelDiT is selected as a CVPR 2026 Best Paper Finalist.
[2026/04] Training & inference code, and pre-trained models are released.
[2026/02] PixelDiT is accepted to CVPR 2026 Oral.
[2025/11] arxiv is released.

Performance

ImageNet 256×256 (PixelDiT-XL, 797M params)

All evaluations use FlowDPMSolver with 100 steps. 50K samples. Metrics follow ADM evaluation protocol.

Epoch	gFID↓	CFG Scale	Steps	Sampler	Time Shift	CFG Interval
80	2.36	3.25	100	FlowDPMSolver	1.0	[0.1, 1.0]
160	1.97	3.25	100	FlowDPMSolver	1.0	[0.1, 1.0]
320	1.61	2.75	100	FlowDPMSolver	1.0	[0.1, 0.9]

ImageNet 512×512 (PixelDiT-XL, 797M params)

Resolution	gFID↓	CFG Scale	Steps	Sampler	Time Shift	CFG Interval
512×512	1.81	3.5	100	FlowDPMSolver	2.0	[0.1, 1.0]

Text-to-Image (PixelDiT-T2I, 1.3B params)

Resolution	GenEval↑	DPG-Bench↑
512×512	0.78	83.7
1024×1024	0.74	83.5

Getting Started

Docker image (recommended): nvcr.io/nvidia/pytorch:24.09-py3

pip install -r requirements.txt

Tasks

Note: Our models are resumed every 4 hours, using the timestamp as the random seed each time. As a result, the final training outcome may have a slight gap compared to a continuous run without intermediate resumes.

Class-to-Image Generation (ImageNet)

Training and evaluation instructions for class-conditioned generation on ImageNet 256×256 and 512×512.

→ c2i/README.md

Text-to-Image Generation

Multi-stage training (512px → 1024px) and inference for text-to-image generation.

→ t2i/README.md

Repository Structure

├── pixdit_core/      # Shared PixelDiT model definitions (c2i & t2i)
├── tools/            # Shared utilities (checkpoint download, GFLOPs computation)
├── c2i/              # Class-to-image
└── t2i/              # Text-to-image

Compute GFLOPs

Measure single-forward-pass GFLOPs for any PixelDiT model (run from project root):

# C2I (ImageNet 256x256, default resolution)
python tools/compute_flops.py --config c2i/configs/pix256_xl.yaml

# T2I at 1024x1024
python tools/compute_flops.py --config t2i/configs/PixelDiT_1024px_pixel_diffusion_stage3.yaml --height 1024 --width 1024

Acknowledgements

We would like to thank the authors of PixNerd and SANA for sharing their code. We also thank the SANA team for sharing their text-to-image training data.

Citation

If you find this work useful, please cite:

@inproceedings{yu2026pixeldit,
      title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
      author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
      booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
c2i		c2i
pixdit_core		pixdit_core
t2i		t2i
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PixelDiT: Pixel Diffusion Transformers for Image Generation

🔥 News

Performance

ImageNet 256×256 (PixelDiT-XL, 797M params)

ImageNet 512×512 (PixelDiT-XL, 797M params)

Text-to-Image (PixelDiT-T2I, 1.3B params)

Getting Started

Tasks

Class-to-Image Generation (ImageNet)

Text-to-Image Generation

Repository Structure

Compute GFLOPs

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PixelDiT: Pixel Diffusion Transformers for Image Generation

🔥 News

Performance

ImageNet 256×256 (PixelDiT-XL, 797M params)

ImageNet 512×512 (PixelDiT-XL, 797M params)

Text-to-Image (PixelDiT-T2I, 1.3B params)

Getting Started

Tasks

Class-to-Image Generation (ImageNet)

Text-to-Image Generation

Repository Structure

Compute GFLOPs

Acknowledgements

Citation

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages