Keshigeyan Chandrasegaran*1,2,
Michael Poli*1,2,
Daniel Y. Fu3,4,
Dongjun Kim1,
Lea M. Hadzic1,
Manling Li1,5,
Agrim Gupta6,
Stefano Massaroli2,7,
Azalia Mirhoseini1,
Juan Carlos Niebles†1,8,
Stefano Ermon†1,
Li Fei-Fei†1
1 Stanford University
2 Liquid AI
3 Together AI
4 UC San Diego
5 Northwestern University
6 Google DeepMind
7 RIKEN
8 Salesforce Research
* Equal contribution, † Equal senior authorship
NeurIPS 2025 Oral
🌎Website |
🤗 Grafted Models |
📄 arXiv |
✍️ Blog
- [2026-01-07]: Training/ Evaluation code released
- [2025-06-10]: Grafting codebase released
Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation. Inspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? To this end, we present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. Informed by our analysis of activation behavior and attention locality, we construct a testbed based on the DiT-XL/2 design to study the impact of grafting on model quality. Using this testbed, we develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local attention, and linear attention, and replacing MLPs with variable expansion ratio and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38-2.64 vs. 2.27 for DiT-XL/2) using <2% pretraining compute. We then graft a text-to-image model (PixArt-Sigma), achieving a 1.43x speedup with less than a 2% drop in GenEval score. Finally, we present a case study that restructures DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting. This reduces model depth by 2x and yields better quality (FID: 2.77) than other models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring.
The Grafting codebase is written in Pytorch and provides a simple implementation for grafting Diffusion Transformers (DiTs).
We provide 22 grafted models for ImageNet-1K 256×256 generation.
| Operator | Replacement Operator | Grafting Ratio | FID | Download Link |
|---|---|---|---|---|
| MLP | MLP (Self-grafting, r=4) | 100% | 2.54 | Link |
| MLP | MLP (r=3) | 50% | 2.53 | Link |
| MLP | MLP (r=3) | 75% | 2.61 | Link |
| MLP | MLP (r=6) | 50% | 2.38 | Link |
| MLP | MLP (r=6) | 75% | 2.37 | Link |
| MLP | Hyena-X (r=2) | 50% | 2.64 | Link |
| MLP | Hyena-X (r=2) | 75% | 3.26 | Link |
| MHA | MHA (Self-grafting) | 100% | 2.49 | Link |
| MHA | Hyena-SE | 50% | 2.73 | Link |
| MHA | Hyena-SE | 50% | 2.61 | Link |
| MHA | Hyena-SE | 75% | 3.62 | Link |
| MHA | Hyena-X | 50% | 2.74 | Link |
| MHA | Hyena-X | 50% | 2.58 | Link |
| MHA | Hyena-X | 75% | 3.69 | Link |
| MHA | Hyena-Y | 50% | 2.72 | Link |
| MHA | Hyena-Y | 50% | 2.61 | Link |
| MHA | Hyena-Y | 75% | 3.66 | Link |
| MHA | SWA | 50% | 2.67 | Link |
| MHA | SWA | 50% | 2.62 | Link |
| MHA | SWA | 75% | 3.09 | Link |
| MHA | Mamba-2 | 50% | 2.65 | Link |
| MHA | Mamba-2 | 75% | 3.02 | Link |
Start generating samples using our grafted models (See demo_notebooks/grafting_demo.ipynb)
This guide describes the complete training pipeline for grafting on the ImageNet-1K dataset. The pipeline is modular and can be adapted to different operators, layers and resolutions as needed. All the results reported in the paper can be reproduced using this codebase. All experiments are specified via YAML config files. We provide Dockerfiles. For reference, we provide a step-by-step demo for replacing 3 Multi-Head Attention (MHA) operators in DiT-XL/2 with Hyena-Y operator:
- Data preparation & feature extraction
- Stage 1: Activation distillation
- Stage 2: Lightweight fine-tuning
- Sampling + FID evaluation
-
Build Docker image:
docker build -t grafting . -
(Optional) Create a persistent cache volume for downloading Hugging Face models:
docker volume create huggingface_cache -
Run container (An example shown below):
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v ~/keshik/workspace/projects/grafting:/workspace -v huggingface_cache:/home/user/.cache/huggingface -v ~/keshik/data:/data -it grafting /bin/bash
-
Download ImageNet-1K dataset from here.
-
Extract SD-VAE features for the ImageNet-1K dataset at 256×256:
bash bash_scripts/imagenet_1k/extract_vae_fts.sh -
Expected output directory created:
/data/vae_features/imagenet_256/train/ -
Generates a stratified 128k ImageNet-1K subset (10% used in the paper) and saves image paths + SHA hash so the exact subset can be used across different experiments. This can be increased up to the full ImageNet size if required.
bash bash_scripts/dit_imagenet_1k_256x256/generate_dataset_hash.sh
⚡Recommended 1× H100
Note: Extracted SD-VAE features for ImageNet-1K dataset can be downloaded from Hugging Face: sd_vae_features_imagenet_1k_256x256.
-
Stage-1 requires intermediate DiT-XL/2 activations:
bash bash_scripts/dit_imagenet_1k_256x256/extract_mha_scion_fts.sh -
Inside the script, users must manually set:
-
SPLIT=trainfor the training set -
SPLIT=valfor the validation set
-
-
This produces:
-
/data/scion_fts_mha/train/ -
/data/scion_fts_mha/val/
-
⚡Recommended 1× H100
-
Train replacement attention/MLP operators by distilling the extracted activations:
bash bash_scripts/dit_imagenet_1k_256x256/train_stage1.sh -
Stage-1 trained operator checkpoints are saved under:
./results/ -
Optional post Stage-1 sampling:
bash bash_scripts/dit_imagenet_1k_256x256/sample_stage1.sh
⚡Recommended 1× H100 (You can run this in parallel for different layers)
-
Perform end-to-end fine-tuning after activation distillation:
bash bash_scripts/dit_imagenet_1k_256x256/train_stage2.sh -
Stage-1 trained operator checkpoints are saved under:
./results/
⚡Recommended 8× H100
-
Generate samples from the fine-tuned model and save as
.npz:bash bash_scripts/dit_imagenet_1k_256x256/sample_stage2.sh -
Then compute FID using OpenAI’s reference batch.Frist, install dependencies using the official
requirements.txt, or use the Dockerfile atassets/tf_Dockerfile/Dockerfilefor evaluation. Then run the following:cd ./external/guided_diffusion/evaluations/ && wget https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/imagenet/256/VIRTUAL_imagenet256_labeled.npz && python evaluator.py VIRTUAL_imagenet256_labeled.npz ./samples/demo/hyena_y_6_16_27.npz
⚡Recommended 8× H100
- Keshigeyan Chandrasegaran: keshik@stanford.edu
- Michael Poli: poli@stanford.edu
For issues, feedback, or contributions, please open an issue or submit a pull request.
We acknowledge the following works and libraries:
- Scalable Diffusion Models with Transformers (DiT): https://github.com/facebookresearch/DiT
- https://github.com/chuanyangjin/fast-DiT
- Convolutions for Sequence Modeling: https://github.com/HazyResearch/safari
- Mamba SSM architecture: https://github.com/state-spaces/mamba
- Causal depthwise conv1d in CUDA, with a PyTorch interface: https://github.com/Dao-AILab/causal-conv1d
- Experiment Tracking with Weights and Biases : https://www.wandb.com/
@article{chandrasegaran2025grafting,
title={Exploring Diffusion Transformer Designs via Grafting},
author={Chandrasegaran, Keshigeyan and Poli, Michael and Fu, Daniel Y. and Kim, Dongjun and
Hadzic, Lea M. and Li, Manling and Gupta, Agrim and Massaroli, Stefano and
Mirhoseini, Azalia and Niebles, Juan Carlos and Ermon, Stefano and Li, Fei-Fei},
booktitle = {Advances in Neural Information Processing Systems},
volume = {38}
year={2025},
url={https://arxiv.org/abs/2506.05340},
}