Overview

Relevant source files

Purpose and Scope

LightX2V is an advanced lightweight inference framework for image and video generation models. It provides a unified platform for running state-of-the-art generative models with extreme performance optimization and memory efficiency. The framework supports diverse generation tasks including text-to-video (T2V), image-to-video (I2V), speech-to-video (S2V), text-to-image (T2I), and image-to-image editing (I2I).

X2V denotes the transformation of arbitrary input modalities (X) into vision outputs (Vision), encompassing text, images, audio, and their combinations.

This page provides a high-level overview of the LightX2V framework architecture, supported models, and key capabilities. For detailed information on specific topics:

Installation and setup: see Getting Started
User interfaces and APIs: see User Interfaces
Performance optimization techniques: see Performance Optimization
Specific model runners: see Model Runners and Tasks
Hardware platform support: see Hardware Platform Support

Sources: README.md1-19 README_zh.md1-19

System Architecture

LightX2V is organized into five distinct architectural layers, each serving a specific purpose in the inference pipeline:

Layer 1: Entry Points - Five different interfaces provide access to the framework, from direct Python API usage to production-ready distributed servers.

Layer 2: Core Framework - The LightX2VPipeline class orchestrates the entire generation workflow, managing configuration, runner selection, and optimization application. The RUNNER_REGISTER implements a factory pattern for model-specific runner instantiation.

Layer 3: Model Runners - Task-specific runners inherit from BaseRunner and DefaultRunner to implement model-specific inference logic. Each runner handles its model family's unique requirements while maintaining a consistent interface.

Layer 4: Model Components - Reusable components including transformer models, input encoders (text, image, audio), VAE encoder/decoders, and diffusion schedulers. These components are shared across different runners where applicable.

Layer 5: Optimization & Hardware - Cross-cutting optimization features (quantization, offloading, caching, parallelism) and hardware abstraction through lightx2v_platform enable deployment across diverse hardware backends.

Sources: README.md1-340 High-level Diagram 1

Supported Model Families and Tasks

LightX2V integrates multiple state-of-the-art model families, each optimized for specific generation tasks:

Model Family	Primary Tasks	Key Features	Runner Classes
WAN 2.1/2.2	T2V, I2V, S2V	Audio-driven video, MoE architecture, distilled variants	`WanRunner`, `WanAudioRunner`, `Wan22AudioRunner`, `WanDistillRunner`
Qwen-Image	T2I, I2I	Vision-language models, layered generation, image editing	`QwenImageRunner`
Z-Image	T2I, I2I	Fast turbo variants, Qwen3 text encoder	`ZImageRunner`
HunyuanVideo 1.5	T2V, I2V	720p support, 4-step distillation, LightTAE support	`HunyuanVideo15Runner`
LTX-2	T2AV, I2AV	Simultaneous audio+video generation, multi-stage pipeline	`LTX2Runner`
LongCat	T2I, I2I	High-resolution image generation	`LongCatImageRunner`
CausVid	T2V	Autoregressive generation with KV cache	`WanCausVidRunner`

Task Type Definitions

T2V: Text-to-Video - Generate video sequences from text descriptions
I2V: Image-to-Video - Animate static images into video sequences
S2V: Speech-to-Video - Generate videos driven by audio/speech inputs
T2I: Text-to-Image - Generate images from text descriptions
I2I: Image-to-Image - Edit or transform existing images
T2AV: Text-to-Audio+Video - Generate synchronized audio and video from text
I2AV: Image-to-Audio+Video - Generate audio+video from image conditioning

The WAN family is the most extensively featured, particularly the audio-to-video (S2V) capability which is unique among open-source models. The framework also provides distilled and quantized variants across multiple model families, enabling 4-step inference instead of the standard 40-50 steps.

Sources: README.md207-234 README_zh.md206-233 High-level Diagram 2

Core Framework Components

LightX2VPipeline Class

The LightX2VPipeline serves as the primary user-facing API. It provides a fluent interface for configuring and executing inference:

Initialization Methods:

__init__(model_path, model_cls, task) - Initialize with model location and type
create_generator(config_json=None, **kwargs) - Configure generation parameters

Optimization Methods:

enable_offload(cpu_offload, offload_granularity) - Configure CPU/disk offloading
enable_quantize(dit_quantized, text_encoder_quantized) - Enable weight quantization
enable_lightvae(use_tae, use_lightvae) - Use lightweight VAE variants
enable_lora(lora_path, lora_alpha) - Apply LoRA adapters

Execution Methods:

generate(prompt, seed, image_path, audio_path, **kwargs) - Execute generation
switch_lora(lora_path, lora_alpha) - Dynamically switch LoRA at runtime

Runner System

The runner system uses a registry pattern (RUNNER_REGISTER) to map model types to their corresponding runner implementations. Each runner implements:

init_modules() - Load and initialize model components
run_pipeline() - Execute the complete inference pipeline
run_input_encoder() - Process inputs (text, image, audio)
run_dit() - Execute diffusion transformer denoising
run_vae_decoder() - Decode latents to pixels

Sources: README.md137-197 High-level Diagrams 1, 3, 4

Key Optimization Features

LightX2V achieves approximately 20-25x speedup through a comprehensive suite of optimizations that can be applied independently or combined:

Performance Benchmarks

Cross-Framework Comparison (H100, Wan2.1-I2V-14B-480P):

Framework	Single GPU	8 GPUs	Speedup
Diffusers (baseline)	9.77s/it	-	1.0x
xDiT	8.93s/it	2.70s/it	1.1x
FastVideo	7.35s/it	2.94s/it	1.3x
SGL-Diffusion	6.13s/it	1.19s/it	1.6-2.5x
LightX2V	5.18s/it	0.75s/it	1.9-3.9x
LightX2V + FP8	-	0.35s/it	Up to 28x

Combined with 4-step distillation, the total speedup reaches approximately ~200x compared to baseline 50-step inference.

Optimization Categories

1. Step Distillation

Reduces 40-50 diffusion steps to 4 steps
Eliminates CFG requirement (Classifier-Free Guidance)
Available for WAN, Qwen-Image, HunyuanVideo families
Achieves ~10-12x speedup from step reduction alone

2. Quantization

w8a8-int8: INT8 weights and activations
w8a8-fp8: FP8 precision (NVIDIA H100+)
w4a4-nvfp4: 4-bit NVIDIA FP4 format
MXFP4/6/8: Microscaling formats
Conversion tools in tools/convert/converter.py

3. Memory Offloading

Disk → CPU → GPU three-tier architecture
Granularities: model-level, block-level, phase-level
WeightAsyncStreamManager with double-buffering
Enables 14B models on 8GB VRAM + 16GB RAM

4. Attention Operators

Sage Attention (sage_attn2)
Flash Attention 2/3 (flash_attn2, flash_attn3)
Radial Attention for sparse patterns
Neighborhood attention for local focus

5. Feature Caching

Tea Cache: cache early denoising steps
Mag Cache: cache magnitude features
Taylor Cache: Taylor expansion approximation
Ada Cache: adaptive caching strategies

6. Distributed Processing

CFG Parallelism: separate GPUs for conditional/unconditional
Tensor Parallelism: split model weights across GPUs
Sequence Parallelism: Ulysses and Ring attention
Pipeline Parallelism: stage-based execution

Sources: README.md72-114 README.md254-272 High-level Diagram 5

Hardware Platform Support

The lightx2v_platform module provides hardware abstraction, enabling deployment across diverse compute backends:

Supported Hardware Platforms

Platform	Device Types	Status	Docker Images
NVIDIA CUDA	RTX 30/40/50, A100, H100	Primary target	cu124, cu128
Cambricon	MLU590	Supported	Available
MetaX	C500	Supported	Available
Hygon	DCU	Supported	Available
Ascend	910B	Supported	Available
AMD	ROCm MI series	Supported	Available
MThreads	MUSA	Supported	Available
Enflame	S60 GCU	Supported	Available

The platform abstraction layer allows model developers to write hardware-agnostic code while the lightx2v_platform module handles backend-specific implementations. This architecture is particularly important for Chinese domestic chip adoption where multiple vendors provide AI accelerators.

Docker environments for each platform are available in dockerfiles/platforms/, and usage scripts are provided in scripts/platforms/.

Sources: lightx2v_platform/README.md1-19 lightx2v_platform/README_zh.md1-20 README.md43-66 High-level Diagram 6

Deployment Interfaces

LightX2V provides five distinct interfaces for different deployment scenarios:

1. Python API (LightX2VPipeline)

Direct Python integration for programmatic usage and custom workflows. Provides full control over all configuration parameters.

Example initialization:

2. Command Line Interface

Shell script execution via lightx2v.infer command, accepting JSON configuration files or command-line arguments for batch processing and automation.

3. Gradio Web Interface

Interactive web UI defined in app/gradio_demo.py. Provides user-friendly controls for model selection, parameter tuning, and result visualization. Configurable via run_gradio.sh/bat scripts.

4. ComfyUI Integration

Node-based workflow integration for complex multi-stage generation pipelines. Enables visual programming of generation workflows with LightX2V models as nodes.

5. Distributed Server (TorchrunInferenceWorker)

Production-ready HTTP server for multi-worker distributed inference. Supports task queuing, load balancing, and dynamic LoRA serving from directories.

Each interface shares the same underlying LightX2VPipeline infrastructure while providing different levels of abstraction and control appropriate to the use case.

Sources: README.md238-252 High-level Diagram 1

Inference Pipeline Flow

All model runners follow a standardized three-stage inference pipeline:

Stage 1: Input Processing

Configuration loading and validation via set_config()
Runner instantiation from RUNNER_REGISTER
Module loading: transformer, text/image/audio encoders, VAE
Input encoding to latent representations

Stage 2: Diffusion Loop

The core denoising process managed by scheduler classes:

prepare(): Initialize latent tensors with noise
step_pre(): Calculate timestep embeddings
model.infer(): Transformer predicts noise
step_post(): Update latents using predicted noise

For distilled models, this loop executes only 4 iterations. For standard models, 40-50 iterations.

Stage 3: Decoding

run_vae_decoder(): Convert latents back to pixel space
Post-processing and file I/O
Optional frame interpolation for higher FPS

Optimizations (quantization, offloading, caching, parallelism) are transparently applied throughout this pipeline without changing the fundamental flow.

Sources: High-level Diagram 4, README.md137-197

Model Distribution and Storage

Models are distributed through multiple channels:

Distribution Channel	Region	Model Types	Purpose
HuggingFace Hub	Global	All official models	Primary distribution
ModelScope	China	All official models	China mirror
Quark Cloud	China	Windows packages	One-click installers
lightx2v HuggingFace	Global	Distilled, quantized, LoRA	Optimized variants

Official models include base transformer weights, text encoders (T5, CLIP, Qwen), image encoders, VAEs, and schedulers. The lightx2v HuggingFace organization hosts performance-optimized variants:

4-step distilled models (Wan2.1/2.2, HunyuanVideo)
Quantized models (INT8, FP8, NVFP4)
LoRA adapters for distillation and style control
Lightweight VAE variants (LightTAE)

Models should be stored on SSD for optimal performance when using offloading features. The framework supports both local path specification and automatic download from model hubs.

Sources: README.md207-237 scripts/hunyuan_video_15/README.md16-18

Overview

Relevant source files

Purpose and Scope

X2V denotes the transformation of arbitrary input modalities (X) into vision outputs (Vision), encompassing text, images, audio, and their combinations.

This page provides a high-level overview of the LightX2V framework architecture, supported models, and key capabilities. For detailed information on specific topics:

Installation and setup: see Getting Started
User interfaces and APIs: see User Interfaces
Performance optimization techniques: see Performance Optimization
Specific model runners: see Model Runners and Tasks
Hardware platform support: see Hardware Platform Support

Sources: README.md1-19 README_zh.md1-19

System Architecture

LightX2V is organized into five distinct architectural layers, each serving a specific purpose in the inference pipeline:

Layer 1: Entry Points - Five different interfaces provide access to the framework, from direct Python API usage to production-ready distributed servers.

Sources: README.md1-340 High-level Diagram 1

Supported Model Families and Tasks

LightX2V integrates multiple state-of-the-art model families, each optimized for specific generation tasks:

Model Family	Primary Tasks	Key Features	Runner Classes
WAN 2.1/2.2	T2V, I2V, S2V	Audio-driven video, MoE architecture, distilled variants	`WanRunner`, `WanAudioRunner`, `Wan22AudioRunner`, `WanDistillRunner`
Qwen-Image	T2I, I2I	Vision-language models, layered generation, image editing	`QwenImageRunner`
Z-Image	T2I, I2I	Fast turbo variants, Qwen3 text encoder	`ZImageRunner`
HunyuanVideo 1.5	T2V, I2V	720p support, 4-step distillation, LightTAE support	`HunyuanVideo15Runner`
LTX-2	T2AV, I2AV	Simultaneous audio+video generation, multi-stage pipeline	`LTX2Runner`
LongCat	T2I, I2I	High-resolution image generation	`LongCatImageRunner`
CausVid	T2V	Autoregressive generation with KV cache	`WanCausVidRunner`

Task Type Definitions

T2V: Text-to-Video - Generate video sequences from text descriptions
I2V: Image-to-Video - Animate static images into video sequences
S2V: Speech-to-Video - Generate videos driven by audio/speech inputs
T2I: Text-to-Image - Generate images from text descriptions
I2I: Image-to-Image - Edit or transform existing images
T2AV: Text-to-Audio+Video - Generate synchronized audio and video from text
I2AV: Image-to-Audio+Video - Generate audio+video from image conditioning

Sources: README.md207-234 README_zh.md206-233 High-level Diagram 2

Core Framework Components

LightX2VPipeline Class

The LightX2VPipeline serves as the primary user-facing API. It provides a fluent interface for configuring and executing inference:

Initialization Methods:

__init__(model_path, model_cls, task) - Initialize with model location and type
create_generator(config_json=None, **kwargs) - Configure generation parameters

Optimization Methods:

enable_offload(cpu_offload, offload_granularity) - Configure CPU/disk offloading
enable_quantize(dit_quantized, text_encoder_quantized) - Enable weight quantization
enable_lightvae(use_tae, use_lightvae) - Use lightweight VAE variants
enable_lora(lora_path, lora_alpha) - Apply LoRA adapters

Execution Methods:

generate(prompt, seed, image_path, audio_path, **kwargs) - Execute generation
switch_lora(lora_path, lora_alpha) - Dynamically switch LoRA at runtime

Runner System

The runner system uses a registry pattern (RUNNER_REGISTER) to map model types to their corresponding runner implementations. Each runner implements:

init_modules() - Load and initialize model components
run_pipeline() - Execute the complete inference pipeline
run_input_encoder() - Process inputs (text, image, audio)
run_dit() - Execute diffusion transformer denoising
run_vae_decoder() - Decode latents to pixels

Sources: README.md137-197 High-level Diagrams 1, 3, 4

Key Optimization Features

LightX2V achieves approximately 20-25x speedup through a comprehensive suite of optimizations that can be applied independently or combined:

Performance Benchmarks

Cross-Framework Comparison (H100, Wan2.1-I2V-14B-480P):

Framework	Single GPU	8 GPUs	Speedup
Diffusers (baseline)	9.77s/it	-	1.0x
xDiT	8.93s/it	2.70s/it	1.1x
FastVideo	7.35s/it	2.94s/it	1.3x
SGL-Diffusion	6.13s/it	1.19s/it	1.6-2.5x
LightX2V	5.18s/it	0.75s/it	1.9-3.9x
LightX2V + FP8	-	0.35s/it	Up to 28x

Combined with 4-step distillation, the total speedup reaches approximately ~200x compared to baseline 50-step inference.

Optimization Categories

1. Step Distillation

Reduces 40-50 diffusion steps to 4 steps
Eliminates CFG requirement (Classifier-Free Guidance)
Available for WAN, Qwen-Image, HunyuanVideo families
Achieves ~10-12x speedup from step reduction alone

2. Quantization

w8a8-int8: INT8 weights and activations
w8a8-fp8: FP8 precision (NVIDIA H100+)
w4a4-nvfp4: 4-bit NVIDIA FP4 format
MXFP4/6/8: Microscaling formats
Conversion tools in tools/convert/converter.py

3. Memory Offloading

Disk → CPU → GPU three-tier architecture
Granularities: model-level, block-level, phase-level
WeightAsyncStreamManager with double-buffering
Enables 14B models on 8GB VRAM + 16GB RAM

4. Attention Operators

Sage Attention (sage_attn2)
Flash Attention 2/3 (flash_attn2, flash_attn3)
Radial Attention for sparse patterns
Neighborhood attention for local focus

5. Feature Caching

Tea Cache: cache early denoising steps
Mag Cache: cache magnitude features
Taylor Cache: Taylor expansion approximation
Ada Cache: adaptive caching strategies

6. Distributed Processing

CFG Parallelism: separate GPUs for conditional/unconditional
Tensor Parallelism: split model weights across GPUs
Sequence Parallelism: Ulysses and Ring attention
Pipeline Parallelism: stage-based execution

Sources: README.md72-114 README.md254-272 High-level Diagram 5

Hardware Platform Support

The lightx2v_platform module provides hardware abstraction, enabling deployment across diverse compute backends:

Supported Hardware Platforms

Platform	Device Types	Status	Docker Images
NVIDIA CUDA	RTX 30/40/50, A100, H100	Primary target	cu124, cu128
Cambricon	MLU590	Supported	Available
MetaX	C500	Supported	Available
Hygon	DCU	Supported	Available
Ascend	910B	Supported	Available
AMD	ROCm MI series	Supported	Available
MThreads	MUSA	Supported	Available
Enflame	S60 GCU	Supported	Available

Docker environments for each platform are available in dockerfiles/platforms/, and usage scripts are provided in scripts/platforms/.

Sources: lightx2v_platform/README.md1-19 lightx2v_platform/README_zh.md1-20 README.md43-66 High-level Diagram 6

Deployment Interfaces

LightX2V provides five distinct interfaces for different deployment scenarios:

1. Python API (LightX2VPipeline)

Direct Python integration for programmatic usage and custom workflows. Provides full control over all configuration parameters.

Example initialization:

2. Command Line Interface

Shell script execution via lightx2v.infer command, accepting JSON configuration files or command-line arguments for batch processing and automation.

3. Gradio Web Interface

Interactive web UI defined in app/gradio_demo.py. Provides user-friendly controls for model selection, parameter tuning, and result visualization. Configurable via run_gradio.sh/bat scripts.

4. ComfyUI Integration

Node-based workflow integration for complex multi-stage generation pipelines. Enables visual programming of generation workflows with LightX2V models as nodes.

5. Distributed Server (TorchrunInferenceWorker)

Production-ready HTTP server for multi-worker distributed inference. Supports task queuing, load balancing, and dynamic LoRA serving from directories.

Each interface shares the same underlying LightX2VPipeline infrastructure while providing different levels of abstraction and control appropriate to the use case.

Sources: README.md238-252 High-level Diagram 1

Inference Pipeline Flow

All model runners follow a standardized three-stage inference pipeline:

Stage 1: Input Processing

Configuration loading and validation via set_config()
Runner instantiation from RUNNER_REGISTER
Module loading: transformer, text/image/audio encoders, VAE
Input encoding to latent representations

Stage 2: Diffusion Loop

The core denoising process managed by scheduler classes:

prepare(): Initialize latent tensors with noise
step_pre(): Calculate timestep embeddings
model.infer(): Transformer predicts noise
step_post(): Update latents using predicted noise

For distilled models, this loop executes only 4 iterations. For standard models, 40-50 iterations.

Stage 3: Decoding

run_vae_decoder(): Convert latents back to pixel space
Post-processing and file I/O
Optional frame interpolation for higher FPS

Optimizations (quantization, offloading, caching, parallelism) are transparently applied throughout this pipeline without changing the fundamental flow.

Sources: High-level Diagram 4, README.md137-197

Model Distribution and Storage

Models are distributed through multiple channels:

Distribution Channel	Region	Model Types	Purpose
HuggingFace Hub	Global	All official models	Primary distribution
ModelScope	China	All official models	China mirror
Quark Cloud	China	Windows packages	One-click installers
lightx2v HuggingFace	Global	Distilled, quantized, LoRA	Optimized variants

Official models include base transformer weights, text encoders (T5, CLIP, Qwen), image encoders, VAEs, and schedulers. The lightx2v HuggingFace organization hosts performance-optimized variants:

4-step distilled models (Wan2.1/2.2, HunyuanVideo)
Quantized models (INT8, FP8, NVFP4)
LoRA adapters for distillation and style control
Lightweight VAE variants (LightTAE)

Models should be stored on SSD for optimal performance when using offloading features. The framework supports both local path specification and automatic download from model hubs.

Sources: README.md207-237 scripts/hunyuan_video_15/README.md16-18

Overview

Purpose and Scope

System Architecture

Supported Model Families and Tasks

Task Type Definitions

Core Framework Components

LightX2VPipeline Class

Runner System

Key Optimization Features

Performance Benchmarks

Optimization Categories

Hardware Platform Support

Supported Hardware Platforms

Deployment Interfaces

1. Python API (LightX2VPipeline)

2. Command Line Interface

3. Gradio Web Interface

4. ComfyUI Integration

5. Distributed Server (TorchrunInferenceWorker)

Inference Pipeline Flow

Stage 1: Input Processing

Stage 2: Diffusion Loop

Stage 3: Decoding

Model Distribution and Storage

On this page

Overview

Purpose and Scope

System Architecture

Supported Model Families and Tasks

Task Type Definitions

Core Framework Components

LightX2VPipeline Class

Runner System

Key Optimization Features

Performance Benchmarks

Optimization Categories

Hardware Platform Support

Supported Hardware Platforms

Deployment Interfaces

1. Python API (LightX2VPipeline)

2. Command Line Interface

3. Gradio Web Interface

4. ComfyUI Integration

5. Distributed Server (TorchrunInferenceWorker)

Inference Pipeline Flow

Stage 1: Input Processing

Stage 2: Diffusion Loop

Stage 3: Decoding

Model Distribution and Storage

On this page