Stories by xdit-project on Medium

Enhancing Parallelism and Speedup for xDiT in Serving the Mochi-1 Video Generation Model

xdit-project — Mon, 11 Nov 2024 07:51:33 GMT

Mochi 1 preview: https://github.com/genmoai/models

The year 2024 has witnessed a surge in interest surrounding AI-generated text-to-video content, captivating the AIGC community. From Sora’s mesmerizing depiction of Tokyo’s streets through paintings earlier this year to the emergence of innovative products like Meta Movie Gen, PixelDance, and CogVideo, the industry is abuzz with creativity.

Last month, the Genmo team made a groundbreaking move by open-sourcing Mochi, a cutting-edge text-to-video model that has revolutionized the field. With an unprecedented 10 billion parameters, Mochi has elevated its ability to understand prompts, excelling in capturing intricate details of physical laws and delivering smoother motion in video generation.

Despite its prowess, the creators of Mochi revealed that the current inference process demands four H100 GPUs to process a single video, posing challenges for both individual users and enterprise deployments. By meticulously optimizing Mochi’s memory usage and harnessing the capabilities of xDiT for multidimensional parallel inference, we have successfully revamped the inference process to operate efficiently on a single L40 card without compromising on video quality. Moreover, through the implementation of parallel inference techniques, we have achieved a remarkable 3.54x reduction in Mochi’s inference latency.

The xDiT Framework

At the core of Mochi’s text-to-video model lies the DiT (Diffusion Transform), which utilizes a Transformer to predict noise in videos and progressively enhance input quality, transforming user prompts into high-quality videos. xDiT aims to parallelize the inference process of DiT across multiple GPUs, catering to real-time video generation demands. Notably, xDiT not only enables end-to-end parallel inference for mainstream generative models like Stable Diffusion 3, Flux, CogVideoX, and others but also streamlines acceleration and parallelization of each DiT module through a standardized API interface. For users developing new models, the conversion of torch API calls to xDiT API calls offers a cost-effective solution for rapid parallelization of the inference process, effectively reducing latency.

project link: https://github.com/xdit-project/mochi-xdit

Optimizating Memory Consumption for Mochi

To address significant memory consumption issues within the Mochi model, we undertook a detailed analysis and implemented targeted solutions:

1. We optimized the storage of parameters by introducing variable precision types, allowing users to select the appropriate precision level based on GPU specifications.

2. By refining the VAE decoder to decode videos in smaller patches based on absolute positions, we mitigated memory overhead caused by decoding entire videos, ensuring accurate generation across various resolutions.

3. Through code enhancements, we minimized unnecessary GPU cache variables, further optimizing memory consumption.

These optimizations have notably reduced the peak memory requirement for single-card Mochi inference to approximately 40 GB, enabling seamless inference processing on a single L40 card.

Parallelization methods within xDiT

xDiT stands at the forefront of parallel computing, offering a versatile array of parallelization methods such as sequence parallelism, pipeline parallelism, data parallelism, and CFG parallelism, alongside multi-dimensional mixed parallelization.

xDiT Framework

In the context of optimizing Mochi, we have harnessed the power of sequence parallelism and CFG parallelism to enhance inference efficiency.

xDiT not only integrates popular sequence parallelism techniques like DeepSpeed-Ulysses and Ring Attention but also introduces the innovative USP method. This unique approach allows for a customizable blend of these methods at varying degrees of parallelism, catering to a diverse range of GPU configurations. By segmenting input sequences across different GPUs, USP effectively reduces the computational burden on each GPU’s Attention layer as GPU quantities scale and ensures precision through inter-GPU communication.

Mochi also leverages Classifier Free Guidance (CFG) to elevate video quality. By feeding user prompts and blank text separately into the video model network and merging them at each iteration’s conclusion, CFG parallelism accelerates video generation by concurrently running these processes on two GPUs.

The communication volumn of a single GPU in DeepSpeed-Ulysses, Ring Attention, and CFG parallelism is (4/#GPUs)O(N×hs)L, 2O(N×hs)L, and 2O(N×hs), respectively. We show the standard layout of eight L40 cards on a single machine as below, containing both high-bandwidthPCIe and low-bandwidth QPI/UPI.

8xL40 Architecture

xDiT’s multi-dimensional parallelism strategy tailors the optimal parallel approach for diverse hardware setups, for instance, perfoming USP parallelism within 0–3 GPUs and 4–7 GPUs and deploying CFG parallelism between 0/4 1/5, 2/6, 3/7 GPU pairs.

Mochi Parallelization

Before applying xDiT for parallelizing Mochi video generation, initialization of xDiT is required, determining the number of GPUs involved, the rank of each GPU card, and the degree of parallelism for various parallelization methods. Since each GPU card in Mochi’s inference process is bound to a process, managed by MultiGPUContext, the following code is added to the __init__() function of this class:

init_distributed_environment(rank=cp_rank, world_size=cp_size)
initialize_model_parallel(
    sequence_parallel_degree=ulysses_degree * ring_degree,
    ring_degree=ring_degree,
    ulysses_degree=ulysses_degree,
    classifier_free_guidance_degree=2,
)

The init_distributed_environment function is used to determine the number of GPUs and the rank of each GPU. The initialize_model_parallel function is used to establish the degree of parallelism for various strategies in multi-dimensional mixed parallelization. After calling this function, xDiT internally initializes all parallel groups and determines the rank of each GPU card within its corresponding group.

To achieve sequence parallelism, USP requires splitting the sequence before it enters the Transformer Block. By utilizing xDiT’s get_sequence_parallel_world_size and get_sequence_parallel_rank functions, each GPU card queries the size of its USP parallel group and its own rank. With this information, sequence splitting can be implemented by adding the following code.

M = N // get_sequence_parallel_world_size() // M is the squence length per GPU after partition
x = x.narrow(1, get_sequence_parallel_rank * M, M) // each GPU gets a patch from x

Subsequently, the F.scaled_dot_product_attention function used for attention computation is replaced with xDiT’s xFuserLongContextAttention function to perform parallelized attention operations. All communication operations are encapsulated within the function, allowing users to leverage xDiT’s parallel capabilities without needing to understand the specific parallel implementation and communication details.

For CFG computation, the original sequential Mochi inference executes two inference processes and combines their network outputs at the end. Here, cond_text represents user prompt words, and cond_null denotes empty text.

out_cond = dit(z, sigma, **cond_text)
out_uncond = dit(z, sigma, **cond_null)
return out_uncond + cfg_scale * (out_cond - out_uncond)

Similarly, each GPU card can determine the CFG parallelism count and its rank within the CFG parallel group by utilizing the return values of xDiT’s get_classifier_free_guidance_world_size and get_classifier_free_guidance_rankfunctions. Additionally, get_cfg_group() can be used to directly obtain the CFG parallel group to facilitate communication operations. By modifying the sequential CFG process as described, the CFG parallel computation using xDiT’s capabilities can be achieved.

if get_classifier_free_guidance_rank() == 0:
    out = dit(z, sigma, **cond_text)
else:
    out = dit(z, sigma, **cond_null)
out_cond, out_uncond = get_cfg_group().all_gather(
    out, separate_tensors=True
)

Other Optimizations

As the original Mochi inference cannot fully load the model onto a single GPU card, FSDP (Fully Sharded Data Parallelism) is employed to partition the model. FSDP, a method provided by PyTorch, sacrifices time for space by storing only parts of the entire model on each GPU card and communicating during inference to compute accurately. Given that inference memory requirements have been reduced to fit within the memory of a single card, FSDP has been disabled to reduce communication overhead during inference, resulting in improved performance.

Performance

Comparisons were made between the baseline Mochi inference (referred to as Baseline) which solely supports parallel capabilities from DeepSpeed-Ulysses, and a hybrid parallel version post xDiT transformation, regarding the inference latency when generating a 848x480px resolution, 49-frame video.

As depicted in the following figure:

On a single card, Baseline performance matches that of xDiT.
Despite disabling FSDP, xDiT outperforms Baseline even when both leverage a parallelism degree of 2 using Ulysses parallelism.
xDiT’s broader range of supported parallel strategies allows for superior acceleration using CFG parallelism on two cards compared to Ulysses and Ring.
In a three-card scenario, due to increased communication from sequence parallelism and FSDP, Baseline’s three-card acceleration is inferior to that of two cards, while xDiT achieves lower inference latency on three cards than on two.
Baseline cannot utilize a parallelism degree of 4 with Ulysses to generate videos due to the video sequence length of Mochi’s default resolution not being divisible by 4. Conversely, xDiT supports various mixed parallelization methods, enabling operation on four cards for further acceleration.
With larger communication loads and lower QPI/UPI bandwidth, the six-card parallel Baseline performs worse than a single card. However, xDiT’s ability to deploy mixed parallelization methods based on bandwidth differences (conducting CFG parallel communication on QPI/UPI and sequence parallel communication on PCIe) allows for a 3.51x acceleration compared to a single card and a 3.88x faster inference than six-card Baseline.
Overall, Baseline exhibits the shortest inference latency on two cards at 319seconds, while xDiT achieves the shortest inference latency on six cards at 125 seconds. Post xDiT transformation, Mochi is 2.55x faster than Baseline.

Conclusion

Mochi stands as a top-tier video generation model, yet the official inference process is cost-intensive. As a multi-card parallel inference framework for DiT models, xDiT integrates acceleration and parallelization of DiT operations into a standardized API interface. Once new models like Mochi are released, users can swiftly implement parallelization of model inference by calling xDiT’s encapsulated interfaces, unleashing multi-card computing power to reduce inference latency. This innovative framework provides a feasible solution for the rapid deployment of high-quality video generation models, offering users a more convenient and efficient inference experience.

Concat Us!

xDiT is committed to providing real-time inference solutions for DiT diffusion models in the field of image and video generation. If you are interested in this project or have intention to cooperate, please contact us via email xditproject@tencent.com.

利用xDiT优化Mochi：单卡L40完成推理，多卡性能提升3.54倍！

xdit-project — Mon, 11 Nov 2024 07:15:25 GMT

利用xDiT优化Mochi：单卡L40完成推理，多卡性能提升2.55倍！

Mochi 1 preview: https://github.com/genmoai/models

文生视频是2024年AIGC领域最热门的话题之一。从年初Sora发布的漫步东京街头的世界名画开始，行业内众多优秀产品，如生数、可灵、PixelDance等。开源模型，如以Latte，CogVideoX为代表的模型也备受关注，成为研究热点。而人们惊叹于CogVideoX-5B的巨大模型参数和优异视频质量时，Genmo团队作为行业颠覆者，一夜间开源了目前最先进的文生视频模型Mochi，彻底改变了整个领域的发展方向。作为参数最多（10B）的开源模型，Mochi在理解提示词方面有了显著提升，对物理规律细节还原得更为出色，动作展示更加流畅。

然而，正如模型作者所说，目前Mochi需要4张H100才能完成单一视频的推理。这种高昂的推理成本不仅让个人用户望而却步，也给企业级部署带来了挑战。而我们对Mochi的显存开销做了深度优化，并利用xDiT为Mochi提供了多维并行推理能力。我们重新设计的推理流程在不损失视频质量的前提下，可以在一张L40卡上运行。同时，通过并行推理，我们成功将Mochi的推理延迟缩短了3.54倍。

xDiT框架

DiT（Diffusion Transform）是Mochi文生视频模型的核心网络结构：其通过Transformer预测视频包含的噪声，并迭代去噪过程逐渐将无意义的输入转化为高质量的视频。我们的工作xDiT旨在利用多 GPU 并行化 DiT 的推理过程，以满足实时视频生成的需求。xDiT不仅支持主流生成模型（如Stable Diffusion 3、Flux、CogVideoX、Latte、Pixart、HunyuanDiT等）的全流程并行推理，还将DiT各模块的加速与并行化整合为规范的API接口。对于新模型，用户可以以较低的成本将原本针对torch API的调用转换为xDiT API的调用，快速实现推理过程的并行化改造，缩短推理时延。

本文将介绍我们对Mochi推理的优化实践，以及如何通过调用xDiT API实现并行生成过程。

项目链接：https://github.com/xdit-project/mochi-xdit

优化一：节省显存开销

Mochi模型包含10B参数，与现有模型在同一数量级上，从理论上讲，单卡推理是可行的。通过分析Mochi代码，我们发现以下三个显著增加内存开销的问题：

Text encoder, transformer, VAE decoder的所有参数均以32为浮点数存储在GPU上，而目前主流推理精度为16为浮点数。
VAE decoder默认为对整个视频进行解码，由于视频分辨率高且时长长，解码过程中间结果庞大。
代码中存在不必要的GPU缓存变量。

我们逐一对上述问题进行了修正：

我们将精度类型设置为可变参数，并添加至模型的类成员中，使用户可根据显卡类型和精度需求选择合适的精度。
原始代码提供了分块VAE的解码过程，但非默认选项，因为此方式按照相对位置切分视频并分块解码。然而，由于像素空间和潜在空间的相对位置在取整后不再一致，生成视频尺寸通常与输入不一致，且当生成视频像素数为奇数时还可能导致ffmpeg报错。我们改进了分块VAE解码过程，通过绝对位置将视频切分为更小的补丁，降低显存需求的同时，确保了各种分辨率视频的正确生成。
通过代码优化，减少了不必要的GPU缓存开销。

经过优化，单卡Mochi推理的峰值显存需求降至约40G，可以在单卡L40上完成推理。

xDiT的并行逻辑

xDiT目前支持多种并行方式，包括序列并行、流水线并行、数据并行、CFG并行等，还支持这些方式的多维混合并行。针对Mochi，我们采用序列并行和CFG并行来优化推理性能。

xDiT 框架

目前主流的序列并行方式有DeepSpeed-Ulysses和Ring Attention，xDiT不仅实现了这两种方式，还提出了USP方法，可以将这两种方式以任意并行度混合，以支持更广泛的GPU类型。USP将输入序列分割并分发到不同GPU上，因此每张GPU卡的Attention层计算量随着GPU数量的增加而减少。由于Attention计算涉及完整序列，USP通过GPU之间的通信确保计算结果的准确性。

Mochi采用了Classifier Free Guidance（CFG）来提升生成视频的质量，具体实现是将用户提示和空文本分别输入到视频模型网络中，并在每次迭代的末尾融合二者的输入。CFG并行将这两个推理过程同时部署在两张GPU上进行计算，以实现加速。

在DeepSpeed-Ulysses，Ring Attention和CFG并行中，每张GPU卡一次迭代涉及的通信量分别为(4/#GPUs)O(N×hs)L，2O(N×hs)L和2O(N×hs)，其中N为序列长度，hs为隐藏维度，L为网络层数。同时，我们展示了单机8卡L40的经典布局。

8卡L40构架

其中PCIe通信带宽较大，而QPI/UPI带宽较小。xDiT的多维并行可以为不同形态的硬件设计提供最佳的并行策略。例如，我们可以将0–3号GPU视为一组，4–7号GPU视为一组，每组内进行USP并行。相反，我们可以在0/4号GPU之间、1/5号GPU之间、2/6号GPU之间、3/7号GPU之间分别进行CFG并行。

优化二：使用xDiT对Mochi视频生成的并行化改造

在对Mochi视频生成进行并行化改造中使用xDiT之前，需要初始化xDiT，确定参与运算的GPU数量、每张GPU卡的rank，以及各种并行方式的并行度。由于Mochi推理过程中，每张GPU卡与一个进程绑定，而每个进程由MultiGPUContext管理，因此我们在该类的__init__()函数中添加如下代码：

init_distributed_environment(rank=cp_rank, world_size=cp_size)
initialize_model_parallel(
        sequence_parallel_degree=ulysses_degree * ring_degree,
        ring_degree=ring_degree,
        ulysses_degree=ulysses_degree,
        classifier_free_guidance_degree=2,
)

其中init_distributed_environment用于确定GPU数量和每张GPU的rank。而initialize_model_parallel用于确定多维混合并行中各种策略的并行度，调用此函数后，xDiT会在内部初始化所有并行组，确定每张GPU卡在对应组内的rank。

为实现序列并行，USP需要在视频序列进入Transformer Block之前切分序列。通过xDiT提供的get_sequence_parallel_world_size和get_sequence_parallel_rank函数，每张GPU卡查询到自己所处的USP并行组的大小以及自己的rank。根据此信息，我们可添加如下代码实现序列切分。

M = N // get_sequence_parallel_world_size() // M is the squence length per GPU after partition
x = x.narrow(1, get_sequence_parallel_rank * M, M) // each GPU gets a patch from x

随后我们直接将用于注意力计算的F.scaled_dot_product_attention函数替换成xDiT提供的xFuserLongContextAttention函数来完成并行版本的注意力运算。所有的通信操作已包装进函数内部，用户无需了解具体并行实现和通信细节即可使用xDiT的并行能力。

对于CFG计算，原始mochi推理串行执行两个推理过程，并在最后合成二者的网络输出，其中cond_text为用户提示词，cond_null为空文本。

out_cond = dit(z, sigma, **cond_text)
out_uncond = dit(z, sigma, **cond_null)
return out_uncond + cfg_scale * (out_cond - out_uncond)

同样，每张GPU卡可根据xDiT提供的get_classifier_free_guidance_world_size函数和get_classifier_free_guidance_rank函数的返回值来获取CFG并行数和自己所处CFG并行组的rank。另外，get_cfg_group()可以直接获取CFG并行组，以完成通信操作。因此，将上述串行CFG流程修改为如下并行CFG运算流程即可利用xDiT的CFG并行能力。

if get_classifier_free_guidance_rank() == 0:
    out = dit(z, sigma, **cond_text)
else:
    out = dit(z, sigma, **cond_null)
out_cond, out_uncond = get_cfg_group().all_gather(
    out, separate_tensors=True
)

其他优化

由于原始Mochi推理无法将模型完全加载到单张GPU卡中，因此采用了FSDP来实现模型的分块加载。FSDP是torch提供的一种通过牺牲时间换取空间的方法，每张GPU卡只存储整个模型的部分，在推理过程中通过通信完成正确的计算。鉴于我们已将推理所需显存降低到单卡显存大小范围内，我们关闭了FSDP以减少推理中的通信开销，以获得更佳性能。

性能展示

原始Mochi推理（以下简称为Baseline）仅支持DeepSpeed-Ulysses提供的并行能力，而xDiT则支持两种序列并行和CFG并行之间的任意混合。

我们对Baseline并行方法和经过xDiT改造的混合并行版本在生成 848x480px 分辨率、49帧视频时的推理延迟进行了比较。

如图所示，

在单卡上，Baseline与xDiT的性能相当；
由于xDiT关闭了FSDP，即使当Baseline和xDiT同时使用并行度为2的Ulysses并行时，xDiT性能优于Baseline；
xDiT支持的并行策略更加广泛，在二卡条件下使用CFG并行可获得比 Ulysses 和 Ring 更高的加速效果；
在三卡条件下，由于序列并行和FSDP引入了更高的通信量，Baseline的三卡加速不如二卡，而相反，xDiT在三卡上获得了比二卡更低的推理延迟；
Mochi默认分辨率的视频序列长度无法被4整除，因而Baseline无法使用并行度为4的Ulysses生成视频，而xDiT支持多种并行方式的混合，因此可运行在四卡上，并获得进一步加速；
较大的通信量和低带宽的QPI/UPI，六卡并行Baseline性能不如单卡，然而由于xDiT可以根据通信带宽的差异性部署混合并行方式（QPI/UPI上开展CFG并行通信，PCIe上开展序列并行通信），六卡并行可获得相较单卡3.51倍的加速，比六卡Baseline快3.88倍。
总体而言，Baseline在二卡上并行推理延迟最短，为319秒，xDiT在六卡上并行推理延迟最短为125秒。xDiT改造后的Mochi比Baseline快2.55倍。

结论

Mochi是目前质量最高的视频生成模型，但官方提供的推理流程成本极高。作为DiT模型的多卡并行推理框架，xDiT将DiT运算各模块的加速与并行化整合为规范的API接口。一旦像Mochi这样的新模型发布，用户便可通过调用xDiT封装的接口，快速实施模型推理的并行化改造，释放多卡计算能力，从而缩短推理延迟。这一创新性框架为高质量视频生成模型的快速部署提供了可行性，为用户提供了更为便捷和高效的推理体验。

联系我们!

xDiT致力于为图片及视频生成领域的DiT扩散模型提供实时推理解决方案。如果您对此项目感兴趣或有合作意向，欢迎通过电子邮件xditproject@tencent.com与我们取得联系。

利用xDiT多GPU并行执行CogVideoX文生视频流程

xdit-project — Mon, 14 Oct 2024 09:31:20 GMT

最近，PixelDance 和 Meta Movie Gen 的发布引发了广泛关注，文生视频应用因此备受瞩目。这些工具以出色的视频生成质量，令一些自媒体纷纷感叹技术的飞速发展。正如不久前发布的 ViDu、可灵、通义万相等文生图与文生视频工具，不仅获得了用户的积极使用，也在商业化上取得了成功。可以预见，未来更多的新模型将积极加入这一赛道。CogVideoX 作为当前文生视频领域最大的开源预训练模型，正受到越来越多的关注。

English Version: here

一、引言

DiT（Diffusion Transformer）是包括 CogVideoX 在内的众多文生图与文生视频模型的核心网络结构。由于其优异的生成效果，近年来 DiT 逐渐取代 U-Net，成为内容生成的标准模型。DiT 属于扩散模型（Diffusion Model），其基本原理是将初始图像或视频（通常为高斯噪声）输入后，利用 Transformer 网络预测噪声并对其去噪，经过多次迭代后生成图像或视频。然而，由于模型参数庞大、视频尺寸较大，以及对视频生成时长的要求，推理过程中的高延迟成为 CogVideoX 等生成模型无法回避的问题，限制了其在实时场景中的应用。

我们最近的工作 xDiT 旨在通过多 GPU 并行化 DiT 的推理过程，缩短推理延迟，满足实时视频生成的需求。xDiT 目前已支持 Stable Diffusion 3、Flux、Pixart、HunyuanDiT、Latte 等众多模型，成为高性能视频与图像生成的首选框架。它不仅高效实现了 DiT 运算的通用并行策略，还针对每种模型的网络结构开发了专有的并行方案。

二、认识CogVideoX

智谱开放平台不久前推出了 CogVideoX 视频生成模型，该模型采用了创新的 Transformer 架构，能够同时融合文本、时间和空间三个维度，实现不同模态信息的对齐与交互。在使用 CogVideoX 生成视频的过程中，输入包括视频和文本提示，视频首先通过 VAE 模块从像素空间编码到潜在空间，然后在模型处理后返回像素空间；文本提示则被映射到连续空间，以便与视频数据一起处理。在潜在空间中，CogVideoX Transformer 预测并去除视频中的噪声，去噪后的视频会继续输入 Transformer 进行迭代。

CogVideoX 推理流程

CogVideoX Transformer 的详细流程如图所示。由于标准 Transformer 只能处理一维数据，而视频数据具有三个维度，即帧数、高度和宽度，在 CogVideoX 中，首先通过卷积层（Conv）对视频进行特征提取，以捕捉其空间信息。接着，将卷积层的输出展平（Flatten）成一维的视频序列，以便适应 Transformer 的输入要求。为了保留视频序列中各元素在原始视频中的位置信息，最后会添加位置编码。

CogVideoX Transformer 网络结果

在图的右侧，CogVideoX Block 同时接收文本序列与视频序列，通过注意力机制将两者融合，符号 ⨁ 表示加法，⨂ 表示乘法。首先，将归一化的文本和视频序列进行合并并计算注意力，然后将注意力激活值进行分开处理，经过 LayerNorm 和 FeedForward 层处理。TimeStamp 则通过 Linear 层分别生成文本和视频序列的 scale、shift 和 gate，前两者用于输入 LayerNorm，而 gate 用于与注意力的激活值相乘。经过 30 个相同的 CogVideoX Block 后，视频序列通过 LayerNorm 和 Linear and Reshape 得到预测噪声。

由于 DiT 每轮迭代之间存在数据依赖，我们目前专注于在单次迭代内挖掘并行处理的潜力，以提高每轮噪声预测的计算效率。基于 xDiT 框架，我们并行化了 CogVideoX 的生成过程，实现了序列并行和 CFG 并行两种推理优化方法。

三、序列并行

序列并行使用多张 GPU，每张 GPU 保存全部的模型参数，但将视频序列拆分（Scatter）成多个碎片（Shard），将碎片分别传入 GPU 进行噪声预测，最后再将视频碎片组合（Gather）成完整视频，如下图所示。

CogVideoX 推理流程（序列并行版）

当输入数据从完整视频转变为视频碎片时，CogVideoX Transformer 中的部分组件，如卷积层和注意力层，需要进行并行化改造，因为它们需要在多个序列元素间进行运算。相反，线性层、归一化层和 FeedForward 层由于仅进行元素级别的操作，可以直接应用于并行环境。

在 xDiT 框架中，已经积累了对常见网络模块的并行化最佳实践。例如，对于卷积模块，xDiT 设计了多卡并行的卷积操作，保持输入输出与原始卷积模块对齐，同时在模块内部处理通信细节，并利用缓存策略减少内存重分配的开销。因此，用户可以直接将原始网络的卷积层替换为 xDiT 的并行版本，无需深入理解多卡间的通信细节，以最小的时间成本实现高效并行卷积运算。

随着长序列生成的需求增长，注意力机制的序列并行也变得至关重要。现有的并行方法包括 Ulysses Attention 和 Ring Attention，前者通过运算之前的 All2All 原语完成数据交换，而后者在运算过程中借助 Peer-to-Peer 通信传输视频碎片。这两种注意力模块的并行策略都已整合到 xDiT 中。xDiT 还实现了 USP（Unified Sequence Parallelism），结合了这两种策略，支持灵活的并行模式组合。

然而，上述方法只针对单一输入序列做注意力运算，而 CogVideoX 的 Full Attention 计算涉及文本和视频序列的组合，这为并行化带来了新的挑战。通过分析 CogVideoX 模型各层权重的维度，我们发现文本序列长度固定为 226，而视频序列长度与生成视频的帧数相关，通常很长（例如 17550）。鉴于两者巨大的长度差异，我们为每台 GPU 分配一个视频碎片，但将整个文本序列存储在所有 GPU 上。这样，我们可以完全复用 xDiT 的视频拆分逻辑，只需对现有的并行注意力运算进行些微调整，以支持文本和视频之间的 Full Attention：在传输时，由于每台 GPU 存储了全部文本序列，All2All 和 Peer-to-Peer 传输中涉及的文本部分将被省略；在计算时，当 GPU 处理视频序列的前端部分时，同时处理对文本的注意力运算。

四、CFG并行

CogVideoX 还采用了 Classifier-Free Guidance（CFG）技术来提升生成内容的质量和细节。

CogVideoX 推理流程（序列并行+CFG并行版）

每轮迭代中，模型不仅将视频和文本输入到 CogVideoX Transformer，还会生成一个空文本，并与相同的视频一起传递给另一个 Transformer 副本。在每轮迭代结束后，CogVideoX 会融合两个 Transformer 的输出来计算噪声。CFG 并行化不需要进一步分割视频序列，通信开销较小，因此实现起来相对简单。在 xDiT 框架中，我们为每个视频碎片在两张 GPU 上分别部署网络模型。在每轮迭代中，这两张 GPU 独立执行计算，然后在迭代后交换输出数据以获得最终的噪声。这种设计使得 CFG 并行化在不增加复杂性的同时，能够有效地利用多 GPU 资源，提高生成效率。

五、性能分析

上述两种并行策略，即序列并行和 Classifier-Free Guidance（CFG）并行，已经被整合到 xDiT 项目中。在配备 L40（PCIe）GPU 的系统上，我们对比了使用 diffusers 库的单卡 CogVideoX 推理与我们并行版本在生成 720x480 分辨率视频时的性能差异。

如图所示，在基础模型 CogVideoX-2b 上，无论是使用 Ulysses Attention、Ring Attention 还是 CFG 并行，推理延迟都有所减少。其中，CFG 并行由于通信量较小，表现出比其他两种技术更高的性能。通过结合序列并行和 CFG 并行，我们进一步提高了推理效率。随着并行度的增加，推理延迟持续降低。在最佳配置下，xDiT 相比于单卡推理方法可以实现 3.53 倍的加速。

对于更复杂的 CogVideoX-5b 模型，它拥有更多的参数以提升视频生成的质量和视觉效果，但计算成本也显著增加。尽管如此，所有方法在该模型上依然保持了与 CogVideoX-2b 模型相似的性能趋势，并行版本的加速比得到了进一步提升。与单卡版本的比较中，xDiT 可以实现高达 3.91 倍的推理速度提升。

六、总结

CogVideoX 在视频生成方面的出色表现赢得了业界的广泛赞誉，进一步证明了 DiT 模型在文本生成图像和视频领域的巨大潜力。作为 DiT 模型的多卡并行推理框架，xDiT 成功地适应并优化了现有的图像和视频生成流程，同时对多种模型组件进行了并行化处理。随着新模型的发布，它们可以轻松整合到 xDiT 中，从而实现显著的性能提升。

联系我们!

xDiT致力于为图片及视频生成领域的DiT扩散模型提供实时推理解决方案。如果您对此项目感兴趣或有合作意向，欢迎通过电子邮件jiaruifang@tencent.com与我们取得联系。

Leveraging xDiT to Parallelize the Open-Sourced Video Generation Model CogVideoX

xdit-project — Mon, 14 Oct 2024 09:21:54 GMT

In recent months, text-to-video applications have surged in popularity with the release of tools like Meta Movie Gen and OpenAI Sora. These tools have made waves with their impressive video generation capabilities, leading many to believe we’re entering a new era of digital content creation. Among the most exciting developments is CogVideoX: the largest open-source pre-trained model for text-to-video generation. It’s gaining considerable attention for its ability to transform text prompts into high-quality video content.

中文版

Introduction

At the heart of CogVideoX is DiT (Diffusion Transformer), a cutting-edge neural network architecture also used in many text-to-image models. DiT has largely replaced older models like U-Net due to its ability to generate high-quality media content. DiT operates by transforming noisy initial inputs (often Gaussian noise) into coherent images or videos, iterating through multiple stages of refinement.

However, as powerful as DiT is, it faces a major challenge: inference latency. The large number of parameters, the size of video data, and the increased demand for generating longer videos result in slow inference times, blocking real-time applications.

This is where our work xDiT comes in. We’ve developed a framework to parallelize DiT’s inference process across multiple GPUs, aiming to reduce latency and enable real-time video generation. xDiT supports a wide range of models, including Stable Diffusion 3, Flux, Pixart, HunyuanDiT, and Latte, making it a versatile and high-performance option for both video and image generation. It not only implements general parallel strategies for DiT models but also provides specialized solutions for unique network architectures.

GitHub - xdit-project/xDiT: xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters

Understanding CogVideoX

CogVideoX is an innovative video generation model developed by the Zhipu Open Platform. It integrates data from three perspectives: text, time, and spatial dimensions. This allows it to align and interact with different types of information in a sophisticated manner.

CogVideoX Text-to-Video Workflow

Here’s how it works:

Text prompts and video data are the key inputs.
The video is first encoded from pixel space into a latent space through a VAE module, which compresses the data.
Simultaneously, the text is mapped into continuous space to be processed alongside the video.
In the latent space, the CogVideoX Transformer (shows below) predicts and removes noise from the video. This denoising step is repeated over multiple iterations to refine the output.

CogVideoX Transformer

A key challenge in this process is that videos are inherently three-dimensional (frames, height, and width), while Transformers typically operate on one-dimensional data. To address this, CogVideoX first uses convolutional layers to extract spatial features from the video. The video data is then flattened into a one-dimensional sequence, which the Transformer can process. Throughout this process, positional embedding is applied to retain important spatial and temporal information.

In each processing block of CogVideoX, both text and video sequences are merged and processed using an attention mechanism. This mechanism allows the model to “focus” on important elements of the inputs, making it possible to generate coherent and contextually-accurate video content from text prompts.

Sequence Parallelism in xDiT

Sequence parallelism is one of the core methods we use to accelerate video generation in CogVideoX, as the figure shown below.

CogVideoX Text-to-Video Workflow with Sequence Parallelism

Here’s how it works:

We use multiple GPUs, with each GPU holding the full model parameters. However, instead of processing the entire video on one GPU, we split the video sequence into smaller shards.
Each GPU processes one shard of the video, predicting the noise that needs to be removed.
Once each GPU finishes its task, the video fragments are combined to create the full video.

Some layers of the model, such as convolutional and attention layers, require special handling because they operate across multiple elements in the sequence. Other layers, like linear and normalization layers, can be applied independently to each element and parallelized without issue. The xDiT framework has been optimized to handle these challenges.

For convolution, we’ve designed multi-GPU parallel convolution operations, which ensure that input and output data stay consistent even when processed across different GPUs. This parallelization is done efficiently, minimizing communication overhead between GPUs.
For Attention, xDiT provides several parallel methods, such as Ulysses Attention and Ring Attention, which handle communication between GPUs and further accelerate processing. We’ve also developed USP (Unified Sequence Parallelism), which combines these strategies for even greater flexibility and performance.

However, Ulysses Attention and Ring Attention primarily focus on attention computation for a single input sequence, while CogVideoX’s Full Attention calculation involves the combination of both text and video sequences. This introduces new challenges for parallelization. By analyzing the dimensions of weights in the CogVideoX model, we found that the text sequence length is fixed at 226, whereas the video sequence length depends on the number of frames in the generated video — often very long (e.g., 17,550). Given this significant length disparity, we assign each GPU a video shard but store the entire text sequence on all GPUs.

This approach allows us to fully reuse xDiT’s video sharding logic, requiring only minor adjustments to the existing parallel attention computation to support Full Attention between text and video. During data transmission, since each GPU stores the full text sequence, the text portion involved in All2All and Peer-to-Peer transfers is omitted. During computation, when a GPU processes the front part of the video sequence, it simultaneously computes attention with the text sequence.

Classifier-Free Guidance (CFG) Parallelism

Another method we use to improve CogVideoX’s performance is Classifier-Free Guidance (CFG) parallelism. This technique is crucial for enhancing the quality and detail of the generated videos, illustrated below.

CogVideoX Text-to-Video Workflow with Sequence Parallelism and CFG Parallelism

Here’s how it works:

In each iteration, the model processes both the text prompt and video data, but it also generates an empty text prompt. This empty prompt is passed through a second copy of the Transformer alongside the same video data.
After both versions of the Transformer finish processing, their outputs are fused together to compute the final noise for the video.

Since CFG parallelism doesn’t involve splitting the video sequence further, it has a low communication overhead, making it simpler to implement. In xDiT, we deploy the model across two GPUs, each processing one version of the input (normal prompt vs. empty prompt). After each iteration, the GPUs exchange their outputs to compute the final result.

This method allows CFG parallelism to efficiently leverage multiple GPUs, reducing complexity while improving speed.

Performance Analysis

We’ve integrated both sequence parallelism and CFG parallelism into xDiT, and we’ve tested the performance improvements on a system equipped with L40 (PCIe) GPUs. Specifically, we compared the performance of single-GPU CogVideoX inference (using the diffusers library) against our parallelized version when generating videos at a resolution of 720x480.

With the CogVideoX-2b model, inference latency dropped significantly across all methods — whether using Ulysses Attention, Ring Attention, or CFG parallelism. CFG parallelism showed the best performance due to its low communication requirements. When combined with sequence parallelism, the efficiency was improved even further. In the best-case scenario, xDiT achieved a 3.53x speedup compared to single-GPU inference.

For more complex models like CogVideoX-5b, which have more parameters to improve video quality, the computational cost increases. However, we still observed significant performance improvements, with xDiT achieving up to a 3.91x speedup compared to single-GPU inference.

Conclusion

CogVideoX has already been widely praised for its ability to generate high-quality videos from text prompts, proving that Diffusion Transformer (DiT) models have enormous potential in both text-to-image and text-to-video generation.

With xDiT, we’ve successfully parallelized the inference process across multiple GPUs, making it faster and more efficient. Our framework not only adapts to existing video and image generation models, but it also optimizes their performance by parallelizing key components.

As new models are developed, they can be easily integrated into xDiT, enabling even greater performance improvements and pushing the boundaries of what’s possible in real-time video generation.

Please feel free to share your thoughts or ask any questions in the comments. We’re excited to see what the future holds for text-to-video technology and the role xDiT will play in it!

Concat Us!

使用xDiT对ComfyUI Flux.1工作流多 GPU 并行加速！

xdit-project — Sat, 12 Oct 2024 08:30:47 GMT

增强您的AIGC体验：利用xDiT在ComfyUI Flux.1工作流中实现多GPU并行!

在AIGC领域快速发展的今天，用户对高质量图像生成的需求日益增长，同时也对使用的便捷性和成本效益提出了更高的要求。以ComfyUI为代表的工作流生图场景，随着社区的持续建设，俨然已成为目前最为主流的用户平台。然而ComfyUI作为原生单卡设计的平台，虽然在个人PC玩家中广受欢迎，但其处理能力仍受限于单GPU的性能。对于如Flux.1-dev这样的大型模型，其庞大的参数量和计算量使得在本地单卡上运行变得充满挑战。使用xDiT让ComfyUI工作流中使用多GPU并行计算，显著降低用户等待延迟，提供足够流畅的用户体验。我们展示了一个例子使Flux.1-dev单图生成时间由14s降低至7s。

阅读英文版本

图片与视频生是大模型多模态应用的关键领域，它们使用DiTs为架构的扩散模型来处理超长输入序列。然而，由于Transformer的计算复杂度随序列长度呈平方级增长，导致DiTs在实际部署中的延迟问题尤为突出。xDiT作为一款的开源的专为DiTs设计高性能推理框架，它可以将DiT推理并行扩展至多卡乃至多机超大规模设备，结合编译优化技术，确保应用的实时性需求得以满足。

开源项目地址：https://github.com/xdit-project/xDiT

通过应用xDiT，我们成功在ComfyUI工作流中部署了多GPU并行处理，从而实现了以下成效：

速度更快&质量无损

在不影响图片生成质量的情况，XDiT通过利用多GPU并行，显著地加速了ComfyUI中FLUX.1工作流的生成效率。以FLUX.1 dev版本为例，在20steps的情况下。单图生成时间由13.87s降低至6.98s:

同时在提高生成速度的同时，我们也保证了生成图像的质量。使用相同的prompt输入时，生成的图像在质量上没有明显差异，都表现出了明显的指令遵循能力:

Prompt: A spacious futuristic classroom with digital plants adorning the walls and floating 3D holographic projections of the solar system hanging from the ceiling. On one side, a large touch-sensitive smartboard reads “XDiT” in neon pen. In the center, there are several transformable smart desks, and students are engaged in interactive learning through augmented reality glasses. In the corner, a small robot is providing personalized learning support to the students.

Baseline(左): 14.13s vs XDiT(右): 7.01s

简单&无缝

在初期，为了将XDiT的能力应用到ComfyUI中，我们以Pipeline为粒度，通过定制化实现XfuserPipelineLoader和XfuserSampler两个核心节点，基于http服务，实现了在ComfyUI中利用XDiT的能力完成端到端的生成功能。

然而，这样高度定制化的实现，并不能很好地兼容庞大的ComfyUI的社区生态；并且端到端的生成逻辑，也很大程度上和ComfyUI高度模块化，以实现更高的灵活性的设计理念相悖。我们希望，能够以最小的改动代价，对ComfyUI完成并行化的加速支持。为了实现这个目标，核心是需要完成对整个工作流中扩散模型(Diffusion Model)相关部分的改造，以标准的FLUX.1-dev官方工作流为例：

扩散模型通过#12 Load Diffusion Model 节点完成加载，经过#30 Model Sampling Flux和#22 BasicGuider节点完成处理封装，传入到#13 SamplerCustomAdvanced执行多步去噪的计算。然而，由于 ComfyUI 最初是为具有单 GPU 的个人计算机设计的，如何将属于单节点的计算任务分发到多个GPU进行并行计算，又不对原有工作流做出重大调整，仍然存在较大的挑战。

为了解决这个问题，我们对最核心的#12 Load Diffusion Model 加载节点和运算节点#13 SamplerCustomAdvanced 进行了定制化的优化。通过整合Ray框架的分布式计算功能，我们仅需使用XDiTUNetLoader和XDiTSamplerCustomAdvanced进行相应的替换，便能自动地将模型的计算任务分散到多个计算节点。得益于XDiT的卓越并行处理性能，我们在不牺牲现有工作流的稳定性和灵活性的前提下，显著提升了处理速度，有效地优化了计算效率。

插件支持

对于一系列扩散模型而言，想要达到生产力的水平，仅仅依赖提示词和基本的模型是远远不够的。为此，在ComfyUI社区中也支持了如Loras, ControlNet, IP-Adapter在内的一系列插件，以更好地实现生成内容的风格化和可控化。

为了实现这一点，XDiT已经成功将最受欢迎的Loras插件集成到ComfyUI中。现在，只需使用一个名为XDiTFluxLoraLoader的节点，就可以轻松加载多种Loras。并且，由于对Loras模块进行了模块化处理。因此在切换不同的Loras时，内部无需更改模型的整体权重，只需调整新增的Loras权重即可。基于这点，XDiT可以支持在运行过程中动态切换Loras，几乎可以即时完成切换，无需等待。

目前，我们仍在持续完成其它社区常用插件如ControlNet, IP-Adapter等的集成适配，将陆续在后续的版本中迭代更新。

联系我们!

目前，我们这项功能仍处于演示开发和试验阶段。如果您对ComfyUI的xDiT并行版本感兴趣，我们欢迎您通过电子邮件jiaruifang@tencent.com与我们取得联系。

Supercharge Your AIGC Experience: Leverage xDiT for Multiple GPU Parallel in ComfyUI Flux.1

xdit-project — Sat, 12 Oct 2024 08:30:42 GMT

Supercharge Your AIGC Experience: Leverage xDiT for Multiple GPU Parallel in ComfyUI Flux.1 Workflow

ComfyUI, is the most popular web-based Diffusion Model interface optimized for workflow. Yet, its design for native single-GPU usage leaves it struggling with the demands of today’s large DiTs, resulting in unacceptably high latency for users like Flux.1. Leveraging the power of xDiT, we’ve successfully implemented a multi-GPU parallel processing workflow within ComfyUI, effectively addressing Flux.1’s performance challenges.

Read the Chinese Version of this article

Image and video generation are key areas for large-model multi-modal applications. They use diffusion models based on DiTs to process ultra-long input sequences. However, since the computational complexity of Transformer increases squarely with the sequence length, the delay problem of DiTs in actual deployment is particularly prominent. xDiT is an open source high-performance inference framework designed specifically for DiTs. It can extend DiT inference in parallel to multi-card or even multi-machine ultra-large-scale devices. Combined with compilation optimization technology, it ensures that the real-time requirements of applications are met.

Open source project address: https://github.com/xdit-project/xDiT

By applying xDiT, we successfully deployed multi-GPU parallel processing in the ComfyUI workflow, thereby achieving the following results:

Faster & Quality-Preserving

Without compromising the quality of image generation, XDiT significantly accelerates the generation efficiency of the FLUX.1 workflow in ComfyUI by leveraging multi-GPU parallelism. Taking the FLUX.1 dev version as an example, under the condition of 20 steps, the single image generation time has been reduced from 13.87 seconds to 6.98 seconds.

While increasing the generation speed, we have also ensured the quality of the generated images. Utilizing the identical prompt input, the resultant images exhibit no significant disparities in quality, both demonstrating a distinct capability for instruction adherence.

Prompt: A spacious futuristic classroom with digital plants adorning the walls and floating 3D holographic projections of the solar system hanging from the ceiling. On one side, a large touch-sensitive smartboard reads “XDiT” in neon pen. In the center, there are several transformable smart desks, and students are engaged in interactive learning through augmented reality glasses. In the corner, a small robot is providing personalized learning support to the students.

Baseline(left): 14.13s vs XDiT(right): 7.01s

Simple & Seamless

In the initial stage, to apply XDiT’s capabilities to ComfyUI, we implemented two core nodes, XfuserPipelineLoader and XfuserSampler, at the Pipeline granularity through customization. Based on an HTTP service, we achieved end-to-end generation functionality in ComfyUI utilizing XDiT's capabilities.

However, such a highly customized implementation does not integrate well with the vast ComfyUI community ecosystem; moreover, the end-to-end generation logic largely contradicts ComfyUI’s highly modularized design philosophy aimed at achieving greater flexibility.

We hope to achieve parallel acceleration support for ComfyUI with minimal modifications. To achieve this goal, the core requirement is to complete the transformation of the diffusion model-related parts throughout the entire workflow. Taking the standard FLUX.1-dev official ComfyUI workflow as an example:

The diffusion model is loaded through the #12 Load Diffusion Model node, processed and encapsulated by the #30 Model Sampling Flux and #22 BasicGuider nodes, and then passed to #13 SamplerCustomAdvanced to perform multi-step denoising calculations. However, since ComfyUI was initially designed for personal computers with a single GPU, there remains a significant challenge in distributing computational tasks belonging to a single node across multiple GPUs for parallel processing without making major adjustments to the existing workflow.

To address this issue, we have implemented customized optimizations for the core #12 Load Diffusion Model loading node and the computational node #13 SamplerCustomAdvanced. By integrating the distributed computing capabilities of the Ray framework, we only need to use XDiTUNetLoader and XDiTSamplerCustomAdvanced as replacements to automatically distribute the model's computational tasks across multiple computing nodes. Thanks to XDiT's excellent parallel processing performance, we have significantly increased processing speed and effectively optimized computational efficiency without sacrificing the stability and flexibility of the existing workflow.

Plugins Supporting

For a series of diffusion models, relying solely on prompts and basic models is far from sufficient to reach a production-ready level. To address this, the ComfyUI community has also supported a range of plugins including Loras, ControlNet, and IP-Adapter to better achieve stylization and controllability of generated content.

To accomplish this, XDiT has successfully integrated the most popular Loras plugin into ComfyUI. Now, multiple Loras can be easily loaded using a single node called XDiTFluxLoraLoader. Moreover, due to the modularization of the Loras module, when switching between different Loras, there's no need to change the overall weight of the model internally; only the weights of the newly added Loras need to be adjusted. Based on this, XDiT can support dynamic switching of Loras during runtime, completing the switch almost instantly without waiting.

Currently, we are continuously working on completing the integration and adaptation of other commonly used community plugins such as ControlNet and IP-Adapter, which will be iteratively updated in subsequent versions.

Concat Us!

Currently, this feature is still in the demonstration development and experimental phase. If you are interested in the parallel version of ComfyUI’s xDiT, we welcome you to contact us via email at jiaruifang@tencent.com.