Exploring Diffusion Transformer Designs via Grafting

Keshigeyan Chandrasegaran^*1,2, Michael Poli^*1,2, Daniel Y. Fu^3,4, Dongjun Kim¹,
Lea M. Hadzic¹, Manling Li^1,5, Agrim Gupta⁶, Stefano Massaroli^2,7,
Azalia Mirhoseini¹, Juan Carlos Niebles^†1,8, Stefano Ermon^†1, Li Fei-Fei^†1
¹ Stanford University ² Liquid AI ³ Together AI ⁴ UC San Diego
⁵ Northwestern University ⁶ Google DeepMind ⁷ RIKEN ⁸ Salesforce Research
^* Equal contribution, ^† Equal senior authorship
NeurIPS 2025 Oral
🌎Website | 🤗 Grafted Models | 📄 arXiv | ✍️ Blog

📣 News

[2026-01-07]: Training/ Evaluation code released
[2025-06-10]: Grafting codebase released

Abstract

Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation. Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? To this end, we present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. Informed by our analysis of activation behavior and attention locality, we construct a testbed based on the DiT-XL/2 design to study the impact of grafting on model quality. Using this testbed, we develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local attention, and linear attention, and replacing MLPs with variable expansion ratio and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38-2.64 vs. 2.27 for DiT-XL/2) using <2% pretraining compute. We then graft a text-to-image model (PixArt-Sigma), achieving a 1.43x speedup with less than a 2% drop in GenEval score. Finally, we present a case study that restructures DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting. This reduces model depth by 2x and yields better quality (FID: 2.77) than other models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring.

About this code

The Grafting codebase is written in Pytorch and provides a simple implementation for grafting Diffusion Transformers (DiTs).

Grafted models

We provide 22 grafted models for ImageNet-1K 256×256 generation.

Operator	Replacement Operator	Grafting Ratio	FID	Download Link
MLP	MLP (Self-grafting, r=4)	100%	2.54	Link
MLP	MLP (r=3)	50%	2.53	Link
MLP	MLP (r=3)	75%	2.61	Link
MLP	MLP (r=6)	50%	2.38	Link
MLP	MLP (r=6)	75%	2.37	Link
MLP	Hyena-X (r=2)	50%	2.64	Link
MLP	Hyena-X (r=2)	75%	3.26	Link
MHA	MHA (Self-grafting)	100%	2.49	Link
MHA	Hyena-SE	50%	2.73	Link
MHA	Hyena-SE	50%	2.61	Link
MHA	Hyena-SE	75%	3.62	Link
MHA	Hyena-X	50%	2.74	Link
MHA	Hyena-X	50%	2.58	Link
MHA	Hyena-X	75%	3.69	Link
MHA	Hyena-Y	50%	2.72	Link
MHA	Hyena-Y	50%	2.61	Link
MHA	Hyena-Y	75%	3.66	Link
MHA	SWA	50%	2.67	Link
MHA	SWA	50%	2.62	Link
MHA	SWA	75%	3.09	Link
MHA	Mamba-2	50%	2.65	Link
MHA	Mamba-2	75%	3.02	Link

Getting Started

Start generating samples using our grafted models (See demo_notebooks/grafting_demo.ipynb)

Training Pipeline for Grafting Diffusion Transformers

This guide describes the complete training pipeline for grafting on the ImageNet-1K dataset. The pipeline is modular and can be adapted to different operators, layers and resolutions as needed. All the results reported in the paper can be reproduced using this codebase. All experiments are specified via YAML config files. We provide Dockerfiles. For reference, we provide a step-by-step demo for replacing 3 Multi-Head Attention (MHA) operators in DiT-XL/2 with Hyena-Y operator:

Data preparation & feature extraction
Stage 1: Activation distillation
Stage 2: Lightweight fine-tuning
Sampling + FID evaluation

1) Data Preparation & Feature Extraction

1.1 Setup Environment

Build Docker image: docker build -t grafting .
(Optional) Create a persistent cache volume for downloading Hugging Face models: docker volume create huggingface_cache
Run container (An example shown below):

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ~/keshik/workspace/projects/grafting:/workspace -v huggingface_cache:/home/user/.cache/huggingface -v ~/keshik/data:/data -it grafting /bin/bash

1.2 Extract VAE Latents (Full ImageNet-1K)

Download ImageNet-1K dataset from here.
Extract SD-VAE features for the ImageNet-1K dataset at 256×256:

bash bash_scripts/imagenet_1k/extract_vae_fts.sh
Expected output directory created: /data/vae_features/imagenet_256/train/
Generates a stratified 128k ImageNet-1K subset (10% used in the paper) and saves image paths + SHA hash so the exact subset can be used across different experiments. This can be increased up to the full ImageNet size if required.

bash bash_scripts/dit_imagenet_1k_256x256/generate_dataset_hash.sh

⚡Recommended 1× H100

Note: Extracted SD-VAE features for ImageNet-1K dataset can be downloaded from Hugging Face: sd_vae_features_imagenet_1k_256x256.

1.3 Extract DiT Block Activations (for Activation Distillation)

Stage-1 requires intermediate DiT-XL/2 activations:

bash bash_scripts/dit_imagenet_1k_256x256/extract_mha_scion_fts.sh
Inside the script, users must manually set:
- SPLIT=train for the training set
- SPLIT=val for the validation set
This produces:
- /data/scion_fts_mha/train/
- /data/scion_fts_mha/val/

⚡Recommended 1× H100

2) Grafting Stage 1: Activation Distillation

Train replacement attention/MLP operators by distilling the extracted activations:

bash bash_scripts/dit_imagenet_1k_256x256/train_stage1.sh
Stage-1 trained operator checkpoints are saved under: ./results/
Optional post Stage-1 sampling:

bash bash_scripts/dit_imagenet_1k_256x256/sample_stage1.sh

⚡Recommended 1× H100 (You can run this in parallel for different layers)

3) Grafting Stage 2: Lightweight Fine-Tuning

Perform end-to-end fine-tuning after activation distillation:

bash bash_scripts/dit_imagenet_1k_256x256/train_stage2.sh
Stage-1 trained operator checkpoints are saved under: ./results/

⚡Recommended 8× H100

4) Sampling & FID Evaluation

Generate samples from the fine-tuned model and save as .npz:

bash bash_scripts/dit_imagenet_1k_256x256/sample_stage2.sh
Then compute FID using OpenAI’s reference batch.Frist, install dependencies using the official requirements.txt, or use the Dockerfile at assets/tf_Dockerfile/Dockerfile for evaluation. Then run the following:

cd ./external/guided_diffusion/evaluations/ && wget https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz && python evaluator.py VIRTUAL_imagenet256_labeled.npz ./samples/demo/hyena_y_6_16_27.npz

⚡Recommended 8× H100

Contact

Keshigeyan Chandrasegaran: keshik@stanford.edu
Michael Poli: poli@stanford.edu

For issues, feedback, or contributions, please open an issue or submit a pull request.

Acknowledgements

We acknowledge the following works and libraries:

Scalable Diffusion Models with Transformers (DiT): https://github.com/facebookresearch/DiT
https://github.com/chuanyangjin/fast-DiT
Convolutions for Sequence Modeling: https://github.com/HazyResearch/safari
Mamba SSM architecture: https://github.com/state-spaces/mamba
Causal depthwise conv1d in CUDA, with a PyTorch interface: https://github.com/Dao-AILab/causal-conv1d
Experiment Tracking with Weights and Biases : https://www.wandb.com/

Citation

@article{chandrasegaran2025grafting,
      title={Exploring Diffusion Transformer Designs via Grafting},
      author={Chandrasegaran, Keshigeyan and Poli, Michael and Fu, Daniel Y. and Kim, Dongjun and 
      Hadzic, Lea M. and Li, Manling and Gupta, Agrim and Massaroli, Stefano and 
      Mirhoseini, Azalia and Niebles, Juan Carlos and Ermon, Stefano and Li, Fei-Fei},
      booktitle = {Advances in Neural Information Processing Systems},
      volume = {38}
      year={2025},
      url={https://arxiv.org/abs/2506.05340}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets/imagenet		assets/imagenet
bash_scripts/dit_imagenet_1k_256x256		bash_scripts/dit_imagenet_1k_256x256
configs		configs
demo_notebooks		demo_notebooks
external		external
plots		plots
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sample.png		sample.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Diffusion Transformer Designs via Grafting

📣 News

Abstract

About this code

Grafted models

Getting Started

Training Pipeline for Grafting Diffusion Transformers

1) Data Preparation & Feature Extraction

1.1 Setup Environment

1.2 Extract VAE Latents (Full ImageNet-1K)

1.3 Extract DiT Block Activations (for Activation Distillation)

2) Grafting Stage 1: Activation Distillation

3) Grafting Stage 2: Lightweight Fine-Tuning

4) Sampling & FID Evaluation

Contact

Acknowledgements

Citation

About

Uh oh!

Releases

Languages

License

keshik6/grafting

Folders and files

Latest commit

History

Repository files navigation

Exploring Diffusion Transformer Designs via Grafting

📣 News

Abstract

About this code

Grafted models

Getting Started

Training Pipeline for Grafting Diffusion Transformers

1) Data Preparation & Feature Extraction

1.1 Setup Environment

1.2 Extract VAE Latents (Full ImageNet-1K)

1.3 Extract DiT Block Activations (for Activation Distillation)

2) Grafting Stage 1: Activation Distillation

3) Grafting Stage 2: Lightweight Fine-Tuning

4) Sampling & FID Evaluation

Contact

Acknowledgements

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages