LightX2V is an advanced lightweight inference framework for image and video generation models. It provides a unified platform for running state-of-the-art generative models with extreme performance optimization and memory efficiency. The framework supports diverse generation tasks including text-to-video (T2V), image-to-video (I2V), speech-to-video (S2V), text-to-image (T2I), and image-to-image editing (I2I).
X2V denotes the transformation of arbitrary input modalities (X) into vision outputs (Vision), encompassing text, images, audio, and their combinations.
This page provides a high-level overview of the LightX2V framework architecture, supported models, and key capabilities. For detailed information on specific topics:
Sources: README.md1-19 README_zh.md1-19
LightX2V is organized into five distinct architectural layers, each serving a specific purpose in the inference pipeline:
Layer 1: Entry Points - Five different interfaces provide access to the framework, from direct Python API usage to production-ready distributed servers.
Layer 2: Core Framework - The LightX2VPipeline class orchestrates the entire generation workflow, managing configuration, runner selection, and optimization application. The RUNNER_REGISTER implements a factory pattern for model-specific runner instantiation.
Layer 3: Model Runners - Task-specific runners inherit from BaseRunner and DefaultRunner to implement model-specific inference logic. Each runner handles its model family's unique requirements while maintaining a consistent interface.
Layer 4: Model Components - Reusable components including transformer models, input encoders (text, image, audio), VAE encoder/decoders, and diffusion schedulers. These components are shared across different runners where applicable.
Layer 5: Optimization & Hardware - Cross-cutting optimization features (quantization, offloading, caching, parallelism) and hardware abstraction through lightx2v_platform enable deployment across diverse hardware backends.
Sources: README.md1-340 High-level Diagram 1
LightX2V integrates multiple state-of-the-art model families, each optimized for specific generation tasks:
| Model Family | Primary Tasks | Key Features | Runner Classes |
|---|---|---|---|
| WAN 2.1/2.2 | T2V, I2V, S2V | Audio-driven video, MoE architecture, distilled variants | WanRunner, WanAudioRunner, Wan22AudioRunner, WanDistillRunner |
| Qwen-Image | T2I, I2I | Vision-language models, layered generation, image editing | QwenImageRunner |
| Z-Image | T2I, I2I | Fast turbo variants, Qwen3 text encoder | ZImageRunner |
| HunyuanVideo 1.5 | T2V, I2V | 720p support, 4-step distillation, LightTAE support | HunyuanVideo15Runner |
| LTX-2 | T2AV, I2AV | Simultaneous audio+video generation, multi-stage pipeline | LTX2Runner |
| LongCat | T2I, I2I | High-resolution image generation | LongCatImageRunner |
| CausVid | T2V | Autoregressive generation with KV cache | WanCausVidRunner |
The WAN family is the most extensively featured, particularly the audio-to-video (S2V) capability which is unique among open-source models. The framework also provides distilled and quantized variants across multiple model families, enabling 4-step inference instead of the standard 40-50 steps.
Sources: README.md207-234 README_zh.md206-233 High-level Diagram 2
The LightX2VPipeline serves as the primary user-facing API. It provides a fluent interface for configuring and executing inference:
Initialization Methods:
__init__(model_path, model_cls, task) - Initialize with model location and typecreate_generator(config_json=None, **kwargs) - Configure generation parametersOptimization Methods:
enable_offload(cpu_offload, offload_granularity) - Configure CPU/disk offloadingenable_quantize(dit_quantized, text_encoder_quantized) - Enable weight quantizationenable_lightvae(use_tae, use_lightvae) - Use lightweight VAE variantsenable_lora(lora_path, lora_alpha) - Apply LoRA adaptersExecution Methods:
generate(prompt, seed, image_path, audio_path, **kwargs) - Execute generationswitch_lora(lora_path, lora_alpha) - Dynamically switch LoRA at runtimeThe runner system uses a registry pattern (RUNNER_REGISTER) to map model types to their corresponding runner implementations. Each runner implements:
init_modules() - Load and initialize model componentsrun_pipeline() - Execute the complete inference pipelinerun_input_encoder() - Process inputs (text, image, audio)run_dit() - Execute diffusion transformer denoisingrun_vae_decoder() - Decode latents to pixelsSources: README.md137-197 High-level Diagrams 1, 3, 4
LightX2V achieves approximately 20-25x speedup through a comprehensive suite of optimizations that can be applied independently or combined:
Cross-Framework Comparison (H100, Wan2.1-I2V-14B-480P):
| Framework | Single GPU | 8 GPUs | Speedup |
|---|---|---|---|
| Diffusers (baseline) | 9.77s/it | - | 1.0x |
| xDiT | 8.93s/it | 2.70s/it | 1.1x |
| FastVideo | 7.35s/it | 2.94s/it | 1.3x |
| SGL-Diffusion | 6.13s/it | 1.19s/it | 1.6-2.5x |
| LightX2V | 5.18s/it | 0.75s/it | 1.9-3.9x |
| LightX2V + FP8 | - | 0.35s/it | Up to 28x |
Combined with 4-step distillation, the total speedup reaches approximately ~200x compared to baseline 50-step inference.
1. Step Distillation
2. Quantization
w8a8-int8: INT8 weights and activationsw8a8-fp8: FP8 precision (NVIDIA H100+)w4a4-nvfp4: 4-bit NVIDIA FP4 formatMXFP4/6/8: Microscaling formatstools/convert/converter.py3. Memory Offloading
WeightAsyncStreamManager with double-buffering4. Attention Operators
5. Feature Caching
6. Distributed Processing
Sources: README.md72-114 README.md254-272 High-level Diagram 5
The lightx2v_platform module provides hardware abstraction, enabling deployment across diverse compute backends:
| Platform | Device Types | Status | Docker Images |
|---|---|---|---|
| NVIDIA CUDA | RTX 30/40/50, A100, H100 | Primary target | cu124, cu128 |
| Cambricon | MLU590 | Supported | Available |
| MetaX | C500 | Supported | Available |
| Hygon | DCU | Supported | Available |
| Ascend | 910B | Supported | Available |
| AMD | ROCm MI series | Supported | Available |
| MThreads | MUSA | Supported | Available |
| Enflame | S60 GCU | Supported | Available |
The platform abstraction layer allows model developers to write hardware-agnostic code while the lightx2v_platform module handles backend-specific implementations. This architecture is particularly important for Chinese domestic chip adoption where multiple vendors provide AI accelerators.
Docker environments for each platform are available in dockerfiles/platforms/, and usage scripts are provided in scripts/platforms/.
Sources: lightx2v_platform/README.md1-19 lightx2v_platform/README_zh.md1-20 README.md43-66 High-level Diagram 6
LightX2V provides five distinct interfaces for different deployment scenarios:
Direct Python integration for programmatic usage and custom workflows. Provides full control over all configuration parameters.
Example initialization:
Shell script execution via lightx2v.infer command, accepting JSON configuration files or command-line arguments for batch processing and automation.
Interactive web UI defined in app/gradio_demo.py. Provides user-friendly controls for model selection, parameter tuning, and result visualization. Configurable via run_gradio.sh/bat scripts.
Node-based workflow integration for complex multi-stage generation pipelines. Enables visual programming of generation workflows with LightX2V models as nodes.
Production-ready HTTP server for multi-worker distributed inference. Supports task queuing, load balancing, and dynamic LoRA serving from directories.
Each interface shares the same underlying LightX2VPipeline infrastructure while providing different levels of abstraction and control appropriate to the use case.
Sources: README.md238-252 High-level Diagram 1
All model runners follow a standardized three-stage inference pipeline:
set_config()RUNNER_REGISTERThe core denoising process managed by scheduler classes:
prepare(): Initialize latent tensors with noisestep_pre(): Calculate timestep embeddingsmodel.infer(): Transformer predicts noisestep_post(): Update latents using predicted noiseFor distilled models, this loop executes only 4 iterations. For standard models, 40-50 iterations.
run_vae_decoder(): Convert latents back to pixel spaceOptimizations (quantization, offloading, caching, parallelism) are transparently applied throughout this pipeline without changing the fundamental flow.
Sources: High-level Diagram 4, README.md137-197
Models are distributed through multiple channels:
| Distribution Channel | Region | Model Types | Purpose |
|---|---|---|---|
| HuggingFace Hub | Global | All official models | Primary distribution |
| ModelScope | China | All official models | China mirror |
| Quark Cloud | China | Windows packages | One-click installers |
| lightx2v HuggingFace | Global | Distilled, quantized, LoRA | Optimized variants |
Official models include base transformer weights, text encoders (T5, CLIP, Qwen), image encoders, VAEs, and schedulers. The lightx2v HuggingFace organization hosts performance-optimized variants:
Models should be stored on SSD for optimal performance when using offloading features. The framework supports both local path specification and automatic download from model hubs.
Sources: README.md207-237 scripts/hunyuan_video_15/README.md16-18
Refresh this wiki