Figure 1: Results of our method (UNet) learned with 30 examples on unseen spatial conditions. The proposed control adapter guides the pre-trained T2I models in a versatile and data-efficient manner.
Kiet T. Nguyen, Chanhyuk Lee, Donggyun Kim, Dong Hoon Lee, Seunghoon Hong
KAIST
This repository contains the official implementation of Universal Few-Shot Spatial Control for Diffusion Models (UFC).
UFC is a versatile few-shot control adapter capable of generalizing to novel spatial conditions, thereby enabling fine-grained control over the structure of generated images. Our method is applicable to both UNet and DiT diffusion backbones.
Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks.To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures.
- Release code
- Release evaluation data
- Release checkpoints
- Provide support data for generation
-
This codebase is developed on PyTorch 2.6.0, CUDA 11.8, and Python 3.11.11
-
Install other dependencies via
pip install -r requirements.txt
If you want to prepare the spatial conditions for your dataset, please refer to the following files:
- annotate_data.py: extract condition for tasks different from
densepose - extract_densepose.py: extract densepose condition
We release the evaluation data at this link. After downloading the zip file, it should be extracted and placed in the datasets directory.
| Few-shot Task | Few-shot (30-shot) Fine-tuned Model | Base Meta-trained Model | Description |
|---|---|---|---|
Canny |
UNet_canny | UNet_taskgr23 | The base model is trained with 4 tasks: [Depth, Normal, Pose, Densepose] |
Hed |
UNet_hed | UNet_taskgr23 | The base model is trained with 4 tasks: [Depth, Normal, Pose, Densepose] |
Depth |
UNet_depth | UNet_taskgr13 | The base model is trained with 4 tasks: [Canny, HED, Pose, Densepose] |
Normal |
UNet_normal | UNet_taskgr13 | The base model is trained with 4 tasks: [Canny, HED, Pose, Densepose] |
Pose |
UNet_pose | UNet_taskgr12 | The base model is trained with 4 tasks: [Canny, HED, Depth, Normal] |
Densepose |
UNet_densepose | UNet_taskgr12 | The base model is trained with 4 tasks: [Canny, HED, Depth, Normal] |
| Few-shot Task | Few-shot (30-shot) Fine-tuned Model | Base Meta-trained Model | Description |
|---|---|---|---|
Canny |
DiT_canny | DiT_taskgr23 | The base model is trained with 4 tasks: [Depth, Normal, Pose, Densepose] |
Hed |
DiT_hed | DiT_taskgr23 | The base model is trained with 4 tasks: [Depth, Normal, Pose, Densepose] |
Depth |
DiT_depth | DiT_taskgr13 | The base model is trained with 4 tasks: [Canny, HED, Pose, Densepose] |
Normal |
DiT_normal | DiT_taskgr13 | The base model is trained with 4 tasks: [Canny, HED, Pose, Densepose] |
Pose |
DiT_pose | DiT_taskgr12 | The base model is trained with 4 tasks: [Canny, HED, Depth, Normal] |
Densepose |
DiT_densepose | DiT_taskgr12 | The base model is trained with 4 tasks: [Canny, HED, Depth, Normal] |
Training UFC with UNet (Stable Diffusion v1.5) backbone:
accelerate launch -m src.train15.train \
--config </path/to/config> \
--exp_name <exp_name>
- We train UFC (UNet) on 8 NVIDIA RTX3090 GPUs.
Training UFC with DiT (Stable Diffusion v3.5-medium) backbone:
accelerate launch -m src.train3.train \
--config </path/to/config> \
--exp_name <exp_name>
- We train UFC (DiT) on 8 NVIDIA A6000 GPUs
After finish meta-training process, the model can be fine-tuned on unseen tasks with a handful of support examples.
Script for UFC with UNet backbone:
python -m src.train15.fewshot_finetune \
--config </path/to/config> \
--ckpt_path </path/to/meta_train_checkpoint> \
--task <task> \
--shots <number of fine-tune data> \
--exp_name <exp_name>
<task> is selected in ["canny", "hed", "depth", "normal", "pose", "densepose"]. It should be an unseen task during meta-training.
Script for UFC with DiT backbone is similar, but replacing train15 with train3.
Script for UFC with UNet backbone:
PYTHONPATH=. python eval/UNet_generation.py \
--config </path/to/config> \
--ckpt_path </path/to/meta_train_checkpoint> \
--task_ckpt_path </path/to/finetune_checkpoint> \
--task <task> --shots 5 --batch_size 8 \
Script for UFC with DiT backbone is similar, but replacing UNet_generation.py with DiT_generation.py.
We evaluate UFC using both quantitative and qualitative metrics to assess its performance and controllability under various spatial conditions.
To compute the Fréchet Inception Distance (FID) between generated and reference images, run:
python -m pytorch_fid </path/to/generated_images> </path/to/reference_images>
- For tasks ["canny", "hed", "depth", "normal"], use 5,000 images from the validation split of COCO2017 as reference images.
- For tasks ["pose", "densepose"], use images containing humans from the validation split of COCO2017 as reference images. Please check
pose_imgs,densepose_imgsdirectories in thecoco2017/val2017.
-
For tasks other than "densepose":
python eval/extract_condition.py --task <task> --path </path/to/generated_images> -
For the "densepose" task:
First, install the DensePose dependencies:
git clone https://github.com/facebookresearch/detectron2.git python -m pip install -e detectron2 pip install git+https://github.com/facebookresearch/detectron2@main#subdirectory=projects/DensePoseThen, extract the human body segmentation mask (refer to
scripts/densepose_label.sh).
-
For tasks other than "densepose":
python eval/metric_calculation.py \ --task <task> \ --gen_path </path/to/generation_dir> \ --gt_path datasets/coco2017/val2017 -
For the "densepose" task:
python eval/densepose_mIoU.py \ --predict_path </path/to/extracted_segmentation> \ --gt_path datasets/coco2017/val2017/densepose/dumpt.pt
We develop our method based on the diffusers library, the official code of OminiControl, VTM and ControlNet. We gratefully acknowledge the authors for making their code publicly available.
This work was in part supported by the National Research Foundation of Korea (RS-2024-00351212 and RS-2024-00436165) and the Institute of Information & communications Technology Planning & Evaluation (IITP) (RS-2022-II220926, RS-2024-00509279, RS-2021- II212068, RS-2022-II220959, and RS-2019-II190075) funded by the Korea government (MSIT).
If you find this work useful, please consider citing
@misc{nguyen2025universalfewshotspatialcontrol,
title={Universal Few-Shot Spatial Control for Diffusion Models},
author={Kiet T. Nguyen and Chanhuyk Lee and Donggyun Kim and Dong Hoon Lee and Seunghoon Hong},
year={2025},
eprint={2509.07530},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.07530},
}
