Multi-LoRA Composition Without Training

Multi-LoRA Composition | Multi-LoRA Composition Without Training

This article primarily introduces methods for switching and combining any number of LoRA without training. In simple terms, it means that the characteristics of each LoRA can be synthesized into a single image. For instance, three LoRAs representing characters, clothing, and objects can be combined into one image without training, maintaining accuracy.

Plus: [Research Lovers] Communication: For convenience, send: 666

Introduction

01 Highlights Introduction
02 Sneak Peek
03 Test Results

Methodology

01 Comparison of Three LoRA Combination Methods
02 Principles of LoRA Switching and Combining
03 GPT-4V Based Evaluator
04 Types of LoRA Models

Content Overview
Practical Tutorial
Application Scenarios
[Research Lovers] Communication

Introduction

This project explores new methods for text-to-image generation, focusing on integrating multiple Low-Rank Adaptations (LoRA) to create highly customized and detailed images. The introduction of LoRA Switch and LoRA Composite aims to surpass traditional techniques in terms of accuracy and image quality, particularly in complex compositions.

01 Highlights Introduction

Switching or combining any number of LoRA models without training

LoRA Switch and LoRA Composite dynamically and accurately integrate multiple LoRAs without fine-tuning.
Unlike methods that merge LoRA weights, this approach focuses on the decoding process, keeping all LoRA weights intact.

02 Sneak Peek

03 Test Results

Methodology

01 Comparison of Three LoRA Combination Methods

LoRA Merge: Merging, combining multiple into one

A popular method for integrating multiple elements uniformly in an image.
It synthesizes a unified LoRA by linearly combining multiple LoRAs and then inserting it into the text-to-image model.
LoRA Merge completely ignores the interaction with the diffusion model during the generation process, leading to distortions of elements like hamburgers and fingers in the image.

LoRA Switch (LoRA-S): Switching, alternating LoRA during training

To explore activating a single LoRA at each denoising step, we propose LoRA Switch.
This method introduces a dynamic adaptation mechanism in the diffusion model by sequentially activating individual LoRAs at specified time intervals throughout the decoding process.
As shown, each LoRA is represented by a unique color corresponding to a specific element, with only one LoRA used at each denoising step.

LoRA Composite (LoRA-C): Combining multiple LoRA elements into one image

To explore merging all LoRAs at each time step without merging weight matrices, we propose LoRA Composite.
This involves separately calculating unconditional and conditional score estimates for each LoRA at each step.
By aggregating these scores, this technique ensures balanced guidance throughout the image generation process, facilitating the close integration of all elements represented by different LoRAs.

02 Principles of LoRA Switching and Combining

These two methods, LoRA SWITCH and LoRA COMPOSITE, are applied during the decoding process and do not require training or adjustment of LoRA weights. Here is a detailed introduction to both methods:

LoRA SWITCH (LoRA-S):

The LoRA SWITCH method achieves dynamic adaptation by selectively activating individual LoRAs at each denoising step. During generation, the model rotates between different LoRAs, ensuring each element is rendered accurately.
For example, in a virtual fitting scene, LoRA SWITCH might alternate between activating character LoRA and clothing LoRA in consecutive denoising steps to ensure each element is clearly presented.
This method activates LoRAs in a pre-arranged sequence, with each LoRA activated at specific denoising steps, rotating in order, allowing each element to contribute multiple times during the image generation process.

LoRA COMPOSITE (LoRA-C):

The LoRA COMPOSITE method is inspired by classifier-free guidance, which calculates unconditional and conditional score estimates for each LoRA at each denoising step and averages these scores to provide balanced guidance for image generation.
This method ensures that all LoRAs can effectively contribute during the image generation process, addressing stability and detail retention issues that may arise when merging LoRAs.
Specifically, LoRA COMPOSITE aggregates scores from each LoRA based on text conditions c at each denoising step to form a comprehensive guiding score, thereby integrating all elements in a balanced manner throughout the generation process.

Both methods avoid direct manipulation of the LoRA weight matrices, achieving multi-LoRA combinations by influencing the diffusion process. This approach allows for the flexible combination of any number of LoRAs without sacrificing image quality, overcoming the limitation of existing research that typically only merges two LoRAs.

03 GPT-4V Based Evaluator

The proposed methods consistently outperform LoRA Merge across all configurations and two dimensions, with superiority increasing as the number of LoRAs grows.
LoRA Switch demonstrates exceptional performance in synthesis quality, while LoRA Composite excels in image quality.
Generating composite images remains a challenging task, especially as the number of elements to be synthesized increases.

04 Types of LoRA Models

ComposLoRA features 22 LoRAs and 480 different combination sets, allowing for the generation of images with any combination of 2-5 LoRAs, including at least one character LoRA.

Content Overview

Paper Title: “Multi-LoRA Composition for Image Generation”

Abstract:

The paper introduces the widely used Low-Rank Adaptation (LoRA) technique in text-to-image models for accurately rendering specific elements in generated images, such as unique characters or styles.
Existing methods face challenges in effectively combining multiple LoRAs, especially as the number of integrated LoRAs increases, limiting the creation of complex images.
The paper studies multi-LoRA combinations from the perspective of decoding and proposes two training-free methods: LoRA SWITCH and LoRA COMPOSITE.
LoRA SWITCH alternates the activation of different LoRAs at each denoising step, while LoRA COMPOSITE simultaneously combines all LoRAs to guide more coherent image synthesis.
To evaluate these methods, the researchers established a new comprehensive testing platform, ComposLoRA, containing 480 combination sets.
Using a GPT-4V-based evaluation framework, the results indicate that these methods significantly outperform existing baselines, especially as the number of LoRAs in combination increases.

Introduction:

The paper discusses the application of LoRA in image generation and how it achieves personalized and realistic image representations.
It emphasizes the importance of composability in controllable image generation and proposes strategies for achieving advanced customization by combining multiple LoRAs focused on different elements.

Methods:

The paper details the basic concepts of LoRA, including diffusion models, classifier-free guidance, and LoRA merging.
It presents two new multi-LoRA combination methods: LoRA SWITCH and LoRA COMPOSITE, both of which avoid manipulating LoRA weight matrices and instead directly influence the diffusion process.

Experiments:

The ComposLoRA testing platform is introduced, containing various LoRA categories and 480 combination sets.
Using GPT-4V as an evaluator, image quality and combination effects are assessed.
Experimental results show that the proposed LoRA SWITCH and LoRA COMPOSITE methods outperform the LoRA Merge method in performance.

Analysis:

The paper analyzes the influence of different image styles (realistic and anime styles) on method performance.
It explores the impact of LoRA activation order and step size on LoRA SWITCH performance.
The potential bias of GPT-4V as an evaluator is analyzed.

Related Work:

The paper reviews relevant research on composable text-to-image generation and studies based on LoRA.

Conclusion:

The paper presents the first attempt to explore multi-LoRA combinations from the perspective of decoding and introduces the LoRA-S and LoRA-C methods, which surpass the current weight-based operational limitations.
By establishing a dedicated testing platform, ComposLoRA, scalable automated evaluation metrics are introduced using GPT-4V for assessments.
The research not only highlights the superior quality achieved by these methods but also provides new standards for evaluating LoRA-based composable image generation.

Appendix:

Detailed descriptions of each LoRA in ComposLoRA and the complete evaluation prompts and results used for GPT-4V comparative assessments are provided.

Practical Tutorial

from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained(
    'SG161222/Realistic_Vision_V5.1_noVAE',
    custom_pipeline="MingZhong/StableDiffusionPipeline-with-LoRA-C",
    use_safetensors=True
).to("cuda")

# Load LoRAs
lora_path = 'models/lora/reality'
pipeline.load_lora_weights(lora_path, weight_name="character_2.safetensors", adapter_name="character")
pipeline.load_lora_weights(lora_path, weight_name="clothing_2.safetensors", adapter_name="clothing")

# List of LoRAs to be composed
cur_loras = ["character", "clothing"]

# Set the prompts for image generation
prompt = "RAW photo, subject, 8k uhd, dslr, high quality, Fujifilm XT3, half-length portrait from knees up, scarlett, short red hair, blue eyes, school uniform, white shirt, red tie, blue pleated microskirt"
negative_prompt = "extra heads, nsfw, deformed iris, deformed pupils, semi-realistic, cgi, 3d, render, sketch, cartoon, drawing, anime, text, cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck"

# Generate and save the image
generator = torch.manual_seed(11)
image = pipeline(
    prompt=prompt, 
    negative_prompt=negative_prompt,
    height=1024,
    width=768,
    num_inference_steps=100,
    guidance_scale=7,
    generator=generator,
    cross_attention_kwargs={"scale": 0.8},
    callback_on_step_end=switch_callback,
    lora_composite=True if method == "composite" else False
).images[0]
image.save('example.png')

Application Scenarios

The proposed multi-LoRA combination methods (LoRA SWITCH and LoRA COMPOSITE) can be applied in various image generation scenarios, especially where precise control and combination of multiple visual elements are required. Here are some specific application scenarios:

Personalized Image Generation:

Users can customize images according to their preferences, such as generating personalized images with specific characters, clothing, or styles.

Virtual Try-On:

In e-commerce, users can upload their photos and try on different clothing or accessories to see the matching effect.

Anime Style Transformation:

Converting realistic style photos into anime style, or achieving personalized designs of specific characters in anime creation.

Game Character Design:

In game development, designers can quickly generate or modify the appearance of characters, including clothing, hairstyles, and accessories.

Film and Animation Production:

In film and animation production, these techniques can be used to create complex scenes and characters, improving production efficiency.

Art Creation Assistance:

Artists can use these tools to assist in creation, such as generating sketches or concept art, and then refining based on that.

Social Media Content Creation:

Social media users can use these tools to generate unique image content for personal expression or attracting attention.

Education and Training:

In education, customized images can be created to assist teaching, such as recreating historical scenes or visualizing scientific concepts.

Advertising and Marketing:

Advertisers can use these techniques to quickly generate appealing advertising images to cater to different markets and customer groups.

Data Augmentation:

In machine learning, these methods can be used to generate new training data, especially when the dataset is limited.

These application scenarios demonstrate the broad potential of multi-LoRA combination methods in the field of image generation, especially in situations requiring high customization and creative expression. Through these methods, users can achieve finer control over images, creating visual content that meets specific needs.

[Research Lovers] Communication

For convenience, send: 666

Multi-LoRA Composition Without Training

Table of Contents