Universal Few-Shot Spatial Control for Diffusion Models

Figure 1: Results of our method (UNet) learned with 30 examples on unseen spatial conditions. The proposed control adapter guides the pre-trained T2I models in a versatile and data-efficient manner.

Kiet T. Nguyen, Chanhyuk Lee, Donggyun Kim, Dong Hoon Lee, Seunghoon Hong

KAIST

🚀 Introduction

This repository contains the official implementation of Universal Few-Shot Spatial Control for Diffusion Models (UFC).

UFC is a versatile few-shot control adapter capable of generalizing to novel spatial conditions, thereby enabling fine-grained control over the structure of generated images. Our method is applicable to both UNet and DiT diffusion backbones.

Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks.To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures.

💡 : Method

⏳ : To Do

Release code
Release evaluation data
Release checkpoints
Provide support data for generation

🛠️ Environment

This codebase is developed on PyTorch 2.6.0, CUDA 11.8, and Python 3.11.11
Install other dependencies via pip install -r requirements.txt

📂 Datasets

1. Data preprocessing (Optional)

If you want to prepare the spatial conditions for your dataset, please refer to the following files:

annotate_data.py: extract condition for tasks different from densepose
extract_densepose.py: extract densepose condition

2. Evaluation data

We release the evaluation data at this link. After downloading the zip file, it should be extracted and placed in the datasets directory.

📍 Model Checkpoints

Checkpoints using UNet backbone

Few-shot Task	Few-shot (30-shot) Fine-tuned Model	Base Meta-trained Model	Description
`Canny`	UNet_canny	UNet_taskgr23	The base model is trained with 4 tasks: `[Depth, Normal, Pose, Densepose]`
`Hed`	UNet_hed	UNet_taskgr23	The base model is trained with 4 tasks: `[Depth, Normal, Pose, Densepose]`
`Depth`	UNet_depth	UNet_taskgr13	The base model is trained with 4 tasks: `[Canny, HED, Pose, Densepose]`
`Normal`	UNet_normal	UNet_taskgr13	The base model is trained with 4 tasks: `[Canny, HED, Pose, Densepose]`
`Pose`	UNet_pose	UNet_taskgr12	The base model is trained with 4 tasks: `[Canny, HED, Depth, Normal]`
`Densepose`	UNet_densepose	UNet_taskgr12	The base model is trained with 4 tasks: `[Canny, HED, Depth, Normal]`

Checkpoints using DiT backbone

Few-shot Task	Few-shot (30-shot) Fine-tuned Model	Base Meta-trained Model	Description
`Canny`	DiT_canny	DiT_taskgr23	The base model is trained with 4 tasks: `[Depth, Normal, Pose, Densepose]`
`Hed`	DiT_hed	DiT_taskgr23	The base model is trained with 4 tasks: `[Depth, Normal, Pose, Densepose]`
`Depth`	DiT_depth	DiT_taskgr13	The base model is trained with 4 tasks: `[Canny, HED, Pose, Densepose]`
`Normal`	DiT_normal	DiT_taskgr13	The base model is trained with 4 tasks: `[Canny, HED, Pose, Densepose]`
`Pose`	DiT_pose	DiT_taskgr12	The base model is trained with 4 tasks: `[Canny, HED, Depth, Normal]`
`Densepose`	DiT_densepose	DiT_taskgr12	The base model is trained with 4 tasks: `[Canny, HED, Depth, Normal]`

🔥 Meta-Training

Training UFC with UNet (Stable Diffusion v1.5) backbone:

accelerate launch -m src.train15.train \
    --config </path/to/config> \
    --exp_name <exp_name>

We train UFC (UNet) on 8 NVIDIA RTX3090 GPUs.

Training UFC with DiT (Stable Diffusion v3.5-medium) backbone:

accelerate launch -m src.train3.train \
    --config </path/to/config> \
    --exp_name <exp_name>

We train UFC (DiT) on 8 NVIDIA A6000 GPUs

🔥 Few-shot Fine-tuning

After finish meta-training process, the model can be fine-tuned on unseen tasks with a handful of support examples.

Script for UFC with UNet backbone:

python -m src.train15.fewshot_finetune \
    --config </path/to/config> \
    --ckpt_path </path/to/meta_train_checkpoint> \
    --task <task> \
    --shots <number of fine-tune data> \
    --exp_name <exp_name>

<task> is selected in ["canny", "hed", "depth", "normal", "pose", "densepose"]. It should be an unseen task during meta-training.

Script for UFC with DiT backbone is similar, but replacing train15 with train3.

🖼️ Image Generation

Script for UFC with UNet backbone:

PYTHONPATH=. python eval/UNet_generation.py \
    --config </path/to/config> \           
    --ckpt_path </path/to/meta_train_checkpoint> \
    --task_ckpt_path </path/to/finetune_checkpoint> \
    --task <task> --shots 5 --batch_size 8 \

Script for UFC with DiT backbone is similar, but replacing UNet_generation.py with DiT_generation.py.

📝 Evaluation

We evaluate UFC using both quantitative and qualitative metrics to assess its performance and controllability under various spatial conditions.

📊 FID Measurement

To compute the Fréchet Inception Distance (FID) between generated and reference images, run:

python -m pytorch_fid </path/to/generated_images> </path/to/reference_images>

For tasks ["canny", "hed", "depth", "normal"], use 5,000 images from the validation split of COCO2017 as reference images.
For tasks ["pose", "densepose"], use images containing humans from the validation split of COCO2017 as reference images. Please check pose_imgs, densepose_imgs directories in the coco2017/val2017.

🎛️ Controllability Measurement

1. Extract Conditions from Generated Images

For tasks other than "densepose":

python eval/extract_condition.py --task <task> --path </path/to/generated_images>

For the "densepose" task:

First, install the DensePose dependencies:

git clone https://github.com/facebookresearch/detectron2.git
python -m pip install -e detectron2
pip install git+https://github.com/facebookresearch/detectron2@main#subdirectory=projects/DensePose

Then, extract the human body segmentation mask (refer to scripts/densepose_label.sh).

2. Metric Calculation

For tasks other than "densepose":

python eval/metric_calculation.py \
    --task <task> \
    --gen_path </path/to/generation_dir> \
    --gt_path datasets/coco2017/val2017

For the "densepose" task:

python eval/densepose_mIoU.py \
    --predict_path </path/to/extracted_segmentation> \
    --gt_path datasets/coco2017/val2017/densepose/dumpt.pt

📖 Reference

We develop our method based on the diffusers library, the official code of OminiControl, VTM and ControlNet. We gratefully acknowledge the authors for making their code publicly available.

🙏 Acknowledgements

This work was in part supported by the National Research Foundation of Korea (RS-2024-00351212 and RS-2024-00436165) and the Institute of Information & communications Technology Planning & Evaluation (IITP) (RS-2022-II220926, RS-2024-00509279, RS-2021- II212068, RS-2022-II220959, and RS-2019-II190075) funded by the Korea government (MSIT).

📖 Citation

If you find this work useful, please consider citing

@misc{nguyen2025universalfewshotspatialcontrol,
      title={Universal Few-Shot Spatial Control for Diffusion Models}, 
      author={Kiet T. Nguyen and Chanhuyk Lee and Donggyun Kim and Dong Hoon Lee and Seunghoon Hong},
      year={2025},
      eprint={2509.07530},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.07530}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
annotator		annotator
assets		assets
config		config
datasets/supports		datasets/supports
eval		eval
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
annotate_data.py		annotate_data.py
coco_val_preprocess.py		coco_val_preprocess.py
extract_densepose.py		extract_densepose.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Universal Few-Shot Spatial Control for Diffusion Models

🚀 Introduction

💡 : Method

⏳ : To Do

🛠️ Environment

📂 Datasets

1. Data preprocessing (Optional)

2. Evaluation data

📍 Model Checkpoints

Checkpoints using UNet backbone

Checkpoints using DiT backbone

🔥 Meta-Training

🔥 Few-shot Fine-tuning

🖼️ Image Generation

📝 Evaluation

📊 FID Measurement

🎛️ Controllability Measurement

1. Extract Conditions from Generated Images

2. Metric Calculation

📖 Reference

🙏 Acknowledgements

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Universal Few-Shot Spatial Control for Diffusion Models

🚀 Introduction

💡 : Method

⏳ : To Do

🛠️ Environment

📂 Datasets

1. Data preprocessing (Optional)

2. Evaluation data

📍 Model Checkpoints

Checkpoints using UNet backbone

Checkpoints using DiT backbone

🔥 Meta-Training

🔥 Few-shot Fine-tuning

🖼️ Image Generation

📝 Evaluation

📊 FID Measurement

🎛️ Controllability Measurement

1. Extract Conditions from Generated Images

2. Metric Calculation

📖 Reference

🙏 Acknowledgements

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages