Skip to content

[NeurIPS 2025 Oral] Official Code for Exploring Diffusion Transformer Designs via Grafting

License

Notifications You must be signed in to change notification settings

keshik6/grafting

Repository files navigation

Exploring Diffusion Transformer Designs via Grafting

Keshigeyan Chandrasegaran*1,2Michael Poli*1,2, Daniel Y. Fu3,4, Dongjun Kim1,
Lea M. Hadzic1, Manling Li1,5, Agrim Gupta6, Stefano Massaroli2,7,
 Azalia Mirhoseini1,  Juan Carlos Niebles†1,8,  Stefano Ermon†1,  Li Fei-Fei†1
1 Stanford University   2 Liquid AI   3 Together AI   4 UC San Diego  
5 Northwestern University   6 Google DeepMind   7 RIKEN   8 Salesforce Research  
* Equal contribution,  Equal senior authorship
NeurIPS 2025 Oral
🌎Website | 🤗 Grafted Models | 📄 arXiv | ✍️ Blog

teaser_fig

📣 News

  • [2026-01-07]: Training/ Evaluation code released
  • [2025-06-10]: Grafting codebase released

Abstract

Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation. Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? To this end, we present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. Informed by our analysis of activation behavior and attention locality, we construct a testbed based on the DiT-XL/2 design to study the impact of grafting on model quality. Using this testbed, we develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local attention, and linear attention, and replacing MLPs with variable expansion ratio and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38-2.64 vs. 2.27 for DiT-XL/2) using <2% pretraining compute. We then graft a text-to-image model (PixArt-Sigma), achieving a 1.43x speedup with less than a 2% drop in GenEval score. Finally, we present a case study that restructures DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting. This reduces model depth by 2x and yields better quality (FID: 2.77) than other models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring.

About this code

The Grafting codebase is written in Pytorch and provides a simple implementation for grafting Diffusion Transformers (DiTs).

Grafted models

We provide 22 grafted models for ImageNet-1K 256×256 generation.

Operator Replacement Operator Grafting Ratio FID Download Link
MLP MLP (Self-grafting, r=4) 100% 2.54 Link
MLP MLP (r=3) 50% 2.53 Link
MLP MLP (r=3) 75% 2.61 Link
MLP MLP (r=6) 50% 2.38 Link
MLP MLP (r=6) 75% 2.37 Link
MLP Hyena-X (r=2) 50% 2.64 Link
MLP Hyena-X (r=2) 75% 3.26 Link
MHA MHA (Self-grafting) 100% 2.49 Link
MHA Hyena-SE 50% 2.73 Link
MHA Hyena-SE 50% 2.61 Link
MHA Hyena-SE 75% 3.62 Link
MHA Hyena-X 50% 2.74 Link
MHA Hyena-X 50% 2.58 Link
MHA Hyena-X 75% 3.69 Link
MHA Hyena-Y 50% 2.72 Link
MHA Hyena-Y 50% 2.61 Link
MHA Hyena-Y 75% 3.66 Link
MHA SWA 50% 2.67 Link
MHA SWA 50% 2.62 Link
MHA SWA 75% 3.09 Link
MHA Mamba-2 50% 2.65 Link
MHA Mamba-2 75% 3.02 Link

Getting Started

Start generating samples using our grafted models (See demo_notebooks/grafting_demo.ipynb)

Training Pipeline for Grafting Diffusion Transformers

This guide describes the complete training pipeline for grafting on the ImageNet-1K dataset. The pipeline is modular and can be adapted to different operators, layers and resolutions as needed. All the results reported in the paper can be reproduced using this codebase. All experiments are specified via YAML config files. We provide Dockerfiles. For reference, we provide a step-by-step demo for replacing 3 Multi-Head Attention (MHA) operators in DiT-XL/2 with Hyena-Y operator:

  1. Data preparation & feature extraction
  2. Stage 1: Activation distillation
  3. Stage 2: Lightweight fine-tuning
  4. Sampling + FID evaluation

1) Data Preparation & Feature Extraction

1.1 Setup Environment

  • Build Docker image: docker build -t grafting .

  • (Optional) Create a persistent cache volume for downloading Hugging Face models: docker volume create huggingface_cache

  • Run container (An example shown below):

    docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ~/keshik/workspace/projects/grafting:/workspace -v huggingface_cache:/home/user/.cache/huggingface -v ~/keshik/data:/data -it grafting /bin/bash


1.2 Extract VAE Latents (Full ImageNet-1K)

  • Download ImageNet-1K dataset from here.

  • Extract SD-VAE features for the ImageNet-1K dataset at 256×256:

    bash bash_scripts/imagenet_1k/extract_vae_fts.sh

  • Expected output directory created: /data/vae_features/imagenet_256/train/

  • Generates a stratified 128k ImageNet-1K subset (10% used in the paper) and saves image paths + SHA hash so the exact subset can be used across different experiments. This can be increased up to the full ImageNet size if required.

    bash bash_scripts/dit_imagenet_1k_256x256/generate_dataset_hash.sh

⚡Recommended 1× H100

Note: Extracted SD-VAE features for ImageNet-1K dataset can be downloaded from Hugging Face: sd_vae_features_imagenet_1k_256x256.


1.3 Extract DiT Block Activations (for Activation Distillation)

  • Stage-1 requires intermediate DiT-XL/2 activations:

    bash bash_scripts/dit_imagenet_1k_256x256/extract_mha_scion_fts.sh

  • Inside the script, users must manually set:

    • SPLIT=train for the training set

    • SPLIT=val for the validation set

  • This produces:

    • /data/scion_fts_mha/train/

    • /data/scion_fts_mha/val/

⚡Recommended 1× H100


2) Grafting Stage 1: Activation Distillation

  • Train replacement attention/MLP operators by distilling the extracted activations:

    bash bash_scripts/dit_imagenet_1k_256x256/train_stage1.sh

  • Stage-1 trained operator checkpoints are saved under: ./results/

  • Optional post Stage-1 sampling:

    bash bash_scripts/dit_imagenet_1k_256x256/sample_stage1.sh

⚡Recommended 1× H100 (You can run this in parallel for different layers)


3) Grafting Stage 2: Lightweight Fine-Tuning

  • Perform end-to-end fine-tuning after activation distillation:

    bash bash_scripts/dit_imagenet_1k_256x256/train_stage2.sh

  • Stage-1 trained operator checkpoints are saved under: ./results/

⚡Recommended 8× H100


4) Sampling & FID Evaluation

  • Generate samples from the fine-tuned model and save as .npz:

    bash bash_scripts/dit_imagenet_1k_256x256/sample_stage2.sh

  • Then compute FID using OpenAI’s reference batch.Frist, install dependencies using the official requirements.txt, or use the Dockerfile at assets/tf_Dockerfile/Dockerfile for evaluation. Then run the following:

    cd ./external/guided_diffusion/evaluations/ && wget https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz && python evaluator.py VIRTUAL_imagenet256_labeled.npz ./samples/demo/hyena_y_6_16_27.npz

⚡Recommended 8× H100


Contact

For issues, feedback, or contributions, please open an issue or submit a pull request.

Acknowledgements

We acknowledge the following works and libraries:

Citation

@article{chandrasegaran2025grafting,
      title={Exploring Diffusion Transformer Designs via Grafting},
      author={Chandrasegaran, Keshigeyan and Poli, Michael and Fu, Daniel Y. and Kim, Dongjun and 
      Hadzic, Lea M. and Li, Manling and Gupta, Agrim and Massaroli, Stefano and 
      Mirhoseini, Azalia and Niebles, Juan Carlos and Ermon, Stefano and Li, Fei-Fei},
      booktitle = {Advances in Neural Information Processing Systems},
      volume = {38}
      year={2025},
      url={https://arxiv.org/abs/2506.05340}, 
}