Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance
Kuan Heng Lin1*    Sicheng Mo1*    Ben Klingher1    Fangzhou Mu2    Bolei Zhou1
1UCLA    2NVIDIA
*Equal contribution
NeurIPS 2024
Image
Overview
We present Ctrl-X, a simple training-free and guidance-free framework for text-to-image (T2I) generation with structure and appearance control. Given user-provided structure and appearance images, Ctrl-X designs feedforward structure control to enable structure alignment with the structure image and semantic-aware appearance transfer to facilitate the appearance transfer from the appearance image. Ctrl-X supports novel structure control with arbitrary condition images of any modality, is significantly faster than prior training-free appearance transfer methods, and provides instant plug-and-play to any T2I and text-to-video (T2V) diffusion model.
Image
How does it work?   Given clean structure and appearance latents, we first obtain noised structure and appearance latents via the diffusion forward process, then extracting their U-Net features from a pretrained T2I diffusion model. When denoising the output latent, we inject convolution and self-attention features from the structure latent and leverage self-attention correspondence to transfer spatially-aware appearance statistics from the appearance latent to achieve structure and appearance control. We name our method "Ctrl-X" because we reformulate the controllable generation problem by 'cutting' (and 'pasting') structure preservation and semantic-aware stylization together.
Results: Structure and appearance control
Results of training-free and guidance-free T2I diffusion with structure and appearance control, where Ctrl-X supports a diverse variety of structure images, including natural images, ControlNet-supported conditions (e.g., canny maps, normal maps), and in-the-wild conditions (e.g., wireframes, 3D meshes). The base model here is Stable Diffusion XL v1.0.
Image
Image
Results: Multi-subject structure and appearance control
Ctrl-X is capable of multi-subject generation with semantic correspondence between appearance and structure images across both subjects and backgrounds. In comparison, ControlNet + IP-Adapter often fails at transferring all subject and background appearances.
Image
Results: Prompt-driven conditional generation
Ctrl-X also supports prompt-driven conditional generation, where it generates an output image complying with the given text prompt while aligning with the structure of the structure image. Ctrl-X continues to support any structure image/condition type here as well. The base model here is Stable Diffusion XL v1.0.
Image
Results: Extension to video generation
We can directly apply Ctrl-X to text-to-video (T2V) models. We show results of AnimateDiff v1.5.3 (with base model Realistic Vision v5.1) here.
BibTeX
@inproceedings{lin2024ctrlx,
    author = {Lin, {Kuan Heng} and Mo, Sicheng and Klingher, Ben and Mu, Fangzhou and Zhou, Bolei},
    booktitle = {Advances in Neural Information Processing Systems},
    title = {Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance},
    year = {2024}
}
Related Work
Image
Sicheng Mo, Fangzhou Mu, Kuan Heng Lin, Yanli Liu, Bochen Guan, Yin Li, Bolei Zhou. FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition. CVPR 2024.
Comment: Training-free conditional generation by guidance in diffusion U-Net subspaces for structure control and appearance regularization.
Image
Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel Cohen-Or. Cross-Image Attention for Zero-Shot Appearance Transfer. SIGGRAPH 2024.
Comment: Guidance-free appearance transfer to natural images with self-attention key + value swaps via cross-image correspondence.