Stories by OneFlow on Medium

Running Stable Video Diffusion 2x Faster with OneDiff DeepCache Node

OneFlow — Fri, 22 Dec 2023 09:12:41 GMT

The latest post introduced DeepCache, a novel training-free and almost lossless paradigm that accelerates diffusion models. Additionally, OneDiff has provided a new ComfyUI node named ModuleDeepCacheSpeedup(which is a compiled DeepCache Module), enabling SDXL iteration speed 3.5x faster on RTX 3090 and 3x faster on A100.

Today, OneDiff’s ModuleDeepCacheSpeedup also supports SVD(Stable Video Diffusion) Speedup, ensuring almost lossless video quality and increasing iteration speed by more than 2x on A100. Here is the example: https://github.com/Oneflow-Inc/onediff/pull/438

Run

ComfyUI Node name: ModuleDeepCacheSpeedup
Refer to this URL on using the node：https://github.com/Oneflow-Inc/onediff/tree/main/onediff_comfy_nodes#installation-guide

Example Workflow

Depending

The latest main branch of OneDiff: https://github.com/Oneflow-Inc/onediff/tree/main
The latest OneFlow community edition:

cuda 11.8:

python3 -m pip install --pre oneflow -f 
https://oneflow-pro.oss-cn-beijing.aliyuncs.com/branch/community/cu118

cuda12.1:

python3 -m pip install --pre oneflow -f 
https://oneflow-pro.oss-cn-beijing.aliyuncs.com/branch/community/cu121

cuda12.2:

python3 -m pip install --pre oneflow -f 
https://oneflow-pro.oss-cn-beijing.aliyuncs.com/branch/community/cu122

Thanks to Yizhou Zheng from Stability AI, who inspired us to try our acceleration node (DeepCache with OneDiff compilation) on SVD.

Welcome to join OneDiff Discord group to discuss related questions.

Accelerating SDXL 3x Faster with DeepCache and OneDiff

OneFlow — Wed, 20 Dec 2023 07:40:53 GMT

Make SDXL run 3.5x faster on RTX 3090 and 3x faster on A100.

DeepCache was launched last week, which is called a novel training-free and almost lossless paradigm that accelerates diffusion models from the perspective of the model architecture.

Now OneDiff introduces a new ComfyUI node named ModuleDeepCacheSpeedup (which is a compiled DeepCache Module), enabling SDXL iteration speed 3.5x faster on RTX 3090 and 3x faster on A100.

Here is the example: https://github.com/Oneflow-Inc/onediff/pull/426

Run

ComfyUI node name：ModuleDeepCacheSpeedup
You can refer to this URL on using the node：https://github.com/Oneflow-Inc/onediff/tree/main/onediff_comfy_nodes#installation-guide

Example workflow

Depending

The latest main branch of OneDiff: https://github.com/Oneflow-Inc/onediff/tree/main
The latest OneFlow community edition:

cuda 11.8:

python3 -m pip install --pre oneflow -f 
https://oneflow-pro.oss-cn-beijing.aliyuncs.com/branch/community/cu118

cuda12.1:

python3 -m pip install --pre oneflow -f
https://oneflow-pro.oss-cn-beijing.aliyuncs.com/branch/community/cu121

cuda12.2:

python3 -m pip install --pre oneflow -f
https://oneflow-pro.oss-cn-beijing.aliyuncs.com/branch/community/cu122

Welcome to join OneDiff Discord group to discuss related questions.

OneFlow v0.9.0 Came Out!

OneFlow — Fri, 03 Feb 2023 12:18:33 GMT

We are thrilled to announce the release of OneFlow v0.9.0. This update contains 640 commits. For the full changelog, please check out: https://github.com/Oneflow-Inc/oneflow/releases/tag/v0.9.0. Come install OneFlow v0.9.0 for a new user experience. Your feedback will be much appreciated!

Highlights and optimizations in this release:

1. PyTorch API compatibility

With the addition of 86 new API interfaces and operators aligned with PyTorch and the fix of 104 bugs related to operator compatibility, OneFlow v0.9.0 provides better PyTorch API and model compatibility. In v0.9.0, users can migrate more PyTorch models to OneFlow with one click and gain faster performance.

Allowing one-click migration of Stable Diffusion, GLM, YOLOv5 etc to OneFlow.
More convenient model migration. Oneflow.load supports loading the torch.save models directly.
With the newly added oneflow.mock_torch module and mock method（https://docs.oneflow.org/master/cookies/oneflow_torch.html）, oneflow can migrate complex PyTorch models containing multiple scripts with one click without changing the original PyTorch script.

2. Improving the usability of distributed programming

Global Tensor has added a series of interfaces and methods that are convenient for distributed programming. And related bugs have been fixed.

3. Supporting automatic parallelism

The Graph released a new feature of automatic parallelism (version 1), which supports automatic search for the fastest SBP with a specified Placement. When writing distributed models with Global Tensor, users do not need to consider parallelism model.

For more information, please check out: https://oneflow.readthedocs.io/en/master/auto_parallel.html

4. Better performance

Graph improves performance and reduces memory overhead, with a series of optimizations related to memory, execution speed, pipeline masking, and compilation speed.

A series of operator optimizations and system optimizations have been added, including Eager instruction scheduling, high-performance CUDA kernel, opening up of multiple memory pools, etc.

After simple tuning, GLM-Large (335M) pre-trained model based on OneFlow v0.9.0 can outperform the original GLM model based on PyTorch, DeepSpeed, and Apex with up to triple performance and 1/3 memory overhead saved.

On A100 GPU (SXM 80GB / PCIe 40GB), the OneFlow Stable Diffusion inference speed is the fastest compared with other deep learning frameworks or compilers.

5. Debugging

The Graph provides a series of functions to aid debugging, including analyzing memory logs, displaying the progress during the compilation stage, and the computation graph.

6. IR

OneFlow IR supports additional compilation optimization functions such as JIT compilation of LR code, distributed description of SBP signature, and the new OKL Dialect.

7. OneFlow-ONNX

The newly released OneFlow-ONNX version v0.6.0 enhanced the usability of the exchange interface with multiple new features. In addition, it added support for another 6 models and over 20 Ops and fixed 6 bugs during the transformation process. You can use pip install oneflow-onnx==0. 6.0 with just one-click.

Repository URL: https://github.com/Oneflow-Inc/oneflow_convert

8. Better error prompt

The error prompt of OneFlow is more user-friendly, which supports highlighting the error content and simplifies unnecessary information details inside the system. In this connection, you can visually learn about the location and type of the error.

Check out the link below for the full version of OneFlow v0.9.0 updates: https://github.com/Oneflow-Inc/oneflow/releases/tag/v0.9.0

Many thanks to the following contributors:

liujuncheng, BBuf, wyg1997, jackalcooper, Flowingsun007, clackhan, daquexian, marigoold, lixinqi, guo-ran, hjchen2, strint, ouyangyu, MARD1NO, small1945, reygu, Ldpe2G, leaves-zwx, Yipeng1994, zhongshsh, lixiang007666, mosout, chengtbf, hhhfccz, doombeaker, howin98, xiacijie, farmerzhang1, shangguanshiyuan, JasonChen9, liufengwei0103, youxiudeshouyeren, laoliu97, EsdeathYZH, rejoicesyc, AsakusaRinne, LijunZhang01, Chenqll, xiezipeng-ML, simonJJJ, ShawnXuan

Text to Image in less than 1 Second, Probably the Fastest Open Source Stable Diffusion Ever

OneFlow — Thu, 01 Dec 2022 13:29:59 GMT

Text to Image in less than 1 Second, Probably the Fastest Open Source Stable Diffusion Ever

OneFlow has refreshed the SOTA inference performance of Stable Diffusion. On A100 GPU, whether it is PCIe 40GB or SXM 80GB, OneFlow Stable Diffusion leads the performance results compared to other deep learning frameworks/compilers.

The first automobile in the world ran at a speed of merely 16 km/h, easily beaten by a normal carriage. That’s why the initial cars were nothing more than “a cool toy” for quite a long time. AI text-to-image generators were born similarly.

AI art generators started with portrait stylization, an image processing function mostly designed for entertainment. People used it to smooth their skin in photos and generate fun avatars but soon lost interest, like many social media trends go.

But then came the real game changer: the diffusion models. The models allow painters and designers to save the trouble of deciding the colors and composition before drawing. They can just tell the AI model what they want, and then it will generate beautifully crafted images from scratch as required.

However, like the early cars, if the diffusion models could only run at unsatisfiable speeds, they would never go beyond a toy and become a real production tool for humankind.

At first, AI art generators took days to produce an image, then hours, then minutes. They were getting faster and faster, but the question is, how fast do they have to be before they can be put into the everyday toolkit for professional artists and even the general public?

No specific answer has revealed itself. But now, it’s safe to say that with the newly released OneFlow Stable Diffusion, the day is on the horizon!

One of the landmark events is that recently OneFlow accelerated the Stable Diffusion to the era of “generating in one second” for the first time, and then AI community started a race of speeding up the model. Just now, OneFlow refreshed the SOTA record again.

OneFlow Stable Diffusion URL：https://github.com/Oneflow-Inc/diffusers/wiki/How-to-Run-OneFlow-Stable-Diffusion
OneFlow URL：https://github.com/Oneflow-Inc/oneflow/

OneFlow Stable Diffusion: faster than fast

On November 7th, OneFlow announced that the Stable Diffusion model has been literally achieved “generating in 1 second”. In the comparison of various hardware and other frameworks, OneFlow has pushed the reasoning performance of Stable Diffusion to a brand-new SOTA.

The following charts show the inference performance of Stable Diffusion on A100(SXM 80GB / PCIe 40GB) using 4 deep learning frameworks/compilers (PyTorch, TensorRT, AITemplate, and OneFlow).

On A100 （SXM 80GB / PCIe 40GB）, the OneFlow Stable Diffusion inference speeds are at least 15% faster than the second best.

Notably, on A100 SXM 80GB, OneFlow Stable Diffusion reaches a groundbreaking inference speed of 50 it/s, which means that the required 50 rounds of sampling to generate an image can be done in exactly 1 second.

A week later, Meta AITemplate improved the performance of Stable Diffusion, and its speed surpassed OneFlow.

There is no end to performance optimization, and OneFlow is also constantly iterating its products. Two weeks later, OneFlow made a further performance upgrade to Stable Diffusion, and once again surpassed the results of AITemplate.

But now OneFlow is still the fastest one. It can be seen that on A100 GPU, whether it is PCIe 40GB or SXM 80GB, OneFlow continues to improve by more than 10% based on previous performance results.

Showcase

With OneFlow Stable Diffusion, you can turn your wildest imagination into stunning artworks. Here are a few examples to show you what it can do:

A shockingly realistic sunshine beach with a coconut tree

A hamster firefighter and a rabbit-eared dog

An astronaut eating hotpot on Mars

Future another-world AIs

The OneFlow Dragon Balls

Come try OneFlow Stable Diffusion and make your own masterpiece! If you don’t have any prompt ideas for now, you may find inspirations from Lexica, a gallery of AI-generated paintings with the corresponding prompts.

Seamless integration into PyTorch ecosystem to enable easy model transfer

Users can convert PyTorch Stable Diffusion from Hugging Face into OneFlow Stable Diffusion by simple modifications to three lines of code: just replace import torch with import oneflow as torch, and StableDiffusionPipeline with OneFlowStableDiffusionPipeline as follows.

Such effortless model transfer is made possible by two facts about OneFlow Stable Diffusion:

OneFlowStableDiffusionPipeline.from_pretrained is compatible with PyTorch weights.
OneFlow APIs are intrinsically aligned with PyTorch so no changes is needed in the expressions of torch.autocast and torch.float16 for them to work after import oneflow as torch.

From above you can see how OneFlow is seamlessly integrated into the PyTorch ecosystem. This enables easy transfer of not only Stable Diffusion but also many other models to OneFlow. For example, you may transfer most of the Torchvision models to Flowvision via import oneflow as torch.

In addition, users can enable the mock torch feature by running eval $(oneflow-mock-torch) from the command line so all the import torch commands in the subsequent Python scripts will be automatically pointed to oneflow.

How to run OneFlow Stable Diffusion

To try and generate images with OneFlow Stable Diffusion using Docker, all you need is to execute the following snippet:

docker run --rm -it \
  --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  -v ${HF_HOME}:${HF_HOME} \
  -v ${PWD}:${PWD} \
  -w ${PWD} \
  -e HF_HOME=${HF_HOME} \
  -e HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} \
  oneflowinc/oneflow-sd:cu112 \
  python3 /demos/oneflow-t2i.py # --prompt "a photo of an astronaut riding a horse on mars"

For further details, please check: https://github.com/Oneflow-Inc/diffusers/wiki/How-to-Run-OneFlow-Stable-Diffusion

What next

In the coming months, the OneFlow team will work on merging the codes in the forked Diffusers and Transformers repositories from OneFlow to the corresponding upstream repositories in Hugging Face. This is the first time that OneFlow has developed models by contributing to the Transformers/Diffusers backends. Any developers are more than welcome to provide your inputs for us on GitHub.

It is noteworthy that OneFlow’s compiler has played a pivotal role in accelerating OneFlow Stable Diffusion. This compiler can allow any PyTorch frontend-built models to run faster on NVIDIA GPUs. More technological details will be unveiled in our future posts.

Using Global Tensor to Program on Multi-Device Multi-GPU: Basic Operations

OneFlow — Sun, 14 Aug 2022 15:15:49 GMT

By YaoChi, Xu Xiaoyu, Zuo Yihao, Guoliang Cheng, Shen Jiali

Global tensor can be executed on multi-device multi-GPU, and it’s an interface to implement the Global View programming.

Today, most parallel programs adopt the SPMD (Single program, multiple data) programming method, which means the devices will execute the same program but process different parts of the data to realize data parallelism. Take PyTorch’s DDP (Distributed Data Parallel) for example, each process executes the same neural network computing logic, but the difference is that they load different slices of one dataset.

But, the defect of SPMD programming is that multiple data makes communications more complicated. In a deep learning scenario, SPMD programming needs to insert communication operations into original computing codes, such as AllReduce for data parallelism and AllGather/ReduceScatter for model parallelism. If the parallel mode is much more complicated or a new mode needs to be experimented with, it will be troublesome to develop and maintain after inserting the communication operations.

Global View programming permits users to program from the SPSD view. Different from SPMD programming, SPSD programming is a method that data is also single from the programming interface layer.

When we extend a single-process program to a parallelly executed one, the single-process data will also be extended to the multi-process data, so it’s natural that the data on different processes corresponds to the same logic data on the originally single-process program. And the logic data is called Global Tensor in OneFlow.

Global Tensor supports users to utilize the SPSD interface to program, which means users can program on a single device and OneFlow framework will automatically convert to physical SPMD/MPMD mode and execute the program in a parallel/distributed way.

With Global Tensor, a more naturally Global View programming method is available, and users can regard the multi-devices as a single device to implement SPSD programming.

Global Tensor

In programming languages, “Global” usually refers to in-process global visibility, such as Global Variable.

Instead, the “Global” of the “Global Tensor” means inter-process global visibility. So, it’s more accurate to regard the Global Tensor as a tensor that can be seen on all processes.

Global Tensor exists on all processes. When the tensor is executed by an operator on all processes, it will be automatically executed on multi-device multi-GPU.

At present, the commonly-used tensor is only visible on one process and also exists on a single device. OneFlow calls it the Local Tensor, which means it’s a tensor that can be seen on only one process. Local is relative to Global, so Local Tensor can be considered as Local (on one process) Tensor.

Most of OneFlow’s operators are compatible with the execution of Local Tensors and Global Tensors. It’s convenient to convert the Local Tensor to the Global Tensor, so the code originally executed on single-device single-GPU can be smoothly converted to ones that can be executed on multi-device multi-GPU.

Global Tensor allows users to easily develop models on multi-device multi-GPU. Compared to utilizing the original communication operators, the efficiency of developing parallelly executed models will be doubled.

Creating Global Tensor

Let’s try to create a Global Tensor on a machine with two GPUs. Take randn operator for example, a Python file named test_randn_global.py needs to be created and add the following content to it:

import oneflow as flow
# Place a global tensor on cuda device of rank(process) 0 and 1
placement = flow.placement(type="cuda", ranks=[0, 1])
# Each rank's local data is a part data as a result of spliting global data on dim 0
sbp = flow.sbp.split(dim=0)
# Create a global tensor by randn
x = flow.randn(4, 5, placement=placement, sbp=sbp)
# Print local data
print("Local data of global tensor:\n ", x.to_local().numpy())
# Print global data
print("Global data of global tensor:\n ", x.numpy())

Here are some explanations for some new concepts in the code above:

placement refers to the physical device where the Global Tensor locates. The parameter type specifies the type of the physical device, and here we use "cuda" to represent the GPU device. The parameter ranks specifies the device ID. For readers who don’t have 2 GPUs, the parameter type can be specified as "cpu" to use the CPU to simulate multiple devices, and the following code still works.
sbp refers to the distributed way of the Global Tensor. Here, sbp = flow.sbp.split(dim=0) means that the Global Tensor is evenly split along dimension 0.
The to_local() method is to acquire the Local Tensor in the present rank from the Global Tensor because the Global Tensor has one Local Tensor in each rank as its practically existing local component.

Next, configure the environment variables required by multi-process launching. Here, the machine owns 2 GPUs, which correspond to 2 process launchings. So, we should turn on 2 terminals and respectively configure the following environment variables:

!!! Note

**Clicking** the label "Terminal 0" or "Terminal 1" separately to check its corresponding console’s command/code.

=== “Terminal 0”

export MASTER_ADDR=127.0.0.1 MASTER_PORT=17789 WORLD_SIZE=2 RANK=0 LOCAL_RANK=0

=== “Terminal 1”

export MASTER_ADDR=127.0.0.1 MASTER_PORT=17789 WORLD_SIZE=2 RANK=1 LOCAL_RANK=1

More about detailed explanation of the environment variables above and how to conduct a distributed launching with the help of tools, please refer to Further reading.

Finally, launch test_randn_global.py in two terminals respectively and observe the results of creating the Global Tensor:

python3 test_randn_global.py

In Terminal 0 (rank 0), we can see:

Local data of global tensor:
  [[-0.07157125 -0.92717147  1.5102768   1.4611115   1.014263  ]
 [-0.1511031   1.570759    0.9416077   0.6184639   2.4420679 ]]
Global data of global tensor:
  [[-0.07157125 -0.92717147  1.5102768   1.4611115   1.014263  ]
 [-0.1511031   1.570759    0.9416077   0.6184639   2.4420679 ]
 [-0.38203463  0.453836    0.9136015   2.35773    -0.3279942 ]
 [-0.8570119  -0.91476554 -0.06646168  0.50022084 -0.4387695 ]]

In Terminal 1 (rank 1), we can see:

Local data of global tensor:
  [[-0.38203463  0.453836    0.9136015   2.35773    -0.3279942 ]
 [-0.8570119  -0.91476554 -0.06646168  0.50022084 -0.4387695 ]]
Global data of global tensor:
  [[-0.07157125 -0.92717147  1.5102768   1.4611115   1.014263  ]
 [-0.1511031   1.570759    0.9416077   0.6184639   2.4420679 ]
 [-0.38203463  0.453836    0.9136015   2.35773    -0.3279942 ]
 [-0.8570119  -0.91476554 -0.06646168  0.50022084 -0.4387695 ]]

It’s clear that if we concatenate the Local Tensors in rank 1 and rank 2 on dimension 0, we can get the complete value of the Global Tensor.

Converting Local Tensor to Global Tensor

We can firstly create a Local Tensor and then utilize the Tensor.to_global method to convert the Local Tensor to a Global Tensor.

Create the following program and launch it in the similar way mentioned above:

import oneflow as flow
x = flow.randn(2, 5).cuda()
print(x.is_local) # True
print(x.is_global) # False
placement = flow.placement(type="cuda", ranks=[0, 1])
sbp = flow.sbp.split(0)
x_global = x.to_global(placement=placement, sbp=sbp)
print(x_global.shape) # (4, 5)
print(x.is_local) # True
print(x_global.is_global) # True

This program separately creates a Local Tensor with the shape of (2,5) on 2 GPUs, and the newly-created tensors are called x.

Then, we specify cuda devices in rank 0 and rank 1 as the placement and split(dim=0) as its SBP. After the to_global method, the original Local Tensor is converted to the Global Tensor named x_global.

We can see that the shape of x_global has been changed into (4, 5), which is the same as the (global) shape of the Global Tensor.

The relationship between the Global Tensor and the Local Tensor is the total and the component, and the Local Tensor is the component of the total in a certain rank. The specific relationship between the Global Tensor and the Local Tensor is decided by the placement and SBP. For example, in the above case, the relationship is between tensors on GPU 0 and GPU 1, and we split x_global along dimension 0 to get x.

Based on the above relationship, the to_global method can infer x_global.shape according to x.shape: it concatenates the Local Tensor x on 2 GPUs along dimension 0 to obtain x_global.

Except for shape, the Global Tensor also contains some data. The Global Tensor has a Local Tensor in each rank to symbolize its local component, which is its physical data in every rank. By the way, each rank only stores different parts of the data.

Converting Global Tensor to Local Tensor

You can utilize the to_local method to obtain the local component of the Global Tensor, just like the following:

import oneflow as flow
placement = flow.placement(type="cuda", ranks=[0, 1])
sbp = flow.sbp.split(0)
x = flow.randn(4, 5, placement=placement, sbp=sbp)
print(x.to_local())

When the x.to_local() method is executed, two different ranks will separately obtain a Local Tensor with the shape of (2, 5).

In Terminal 0 (rank 0), we can see:

tensor([[-0.2730,  1.8042,  0.0721, -0.5024, -1.2583],
    	[-0.3379,  0.9371,  0.7981, -0.5447, -0.5629]],
   	   dtype=oneflow.float32)

In Terminal 1 (rank 1), we can see:

tensor([[ 0.6829,  0.4849,  2.1611,  1.4059,  0.0934], 
        [-0.0301, -0.6942, -0.8094, -1.3050, -0.1778]], 
       dtype=oneflow.float32)

The to_local() has no parameters, because the Global Tensor has already confirmed its local component according to the placement and SBP, and it’s fine to directly acquire the Local Tensor that its local component corresponds to.

Converting One Global Tensor to Another Global Tensor

Usually, distributed computing requires inserting communication operations into normal computational logic, but OneFlow only needs users to convert the data distribution type of the Global Tensor.

In terms of type, the biggest difference between the Global Tensor and the general Local Tensor is that the Global Tensor has global data distribution type, which specifies how the Global Tensor is distributed in each rank, including its placement and SBP.

The function of placement in global data distribution type is to specify the device group where data is distributed:

The parameter type specifies the physical device type. cuda represents the GPU device memory, and cpu` refers to the CPU device memory.
The parameter ranks specifies the process ID set. Because each rank corresponds to one physical device, ranks can also be seen as the device ID set. Actually, ranks is an nd-array composed of rank ID, which supports high-dimensional device arrangement.

For more details, please refer to oneflow.placement.

The function of SBP in the global data distribution type is to specify the relationship between global data and local data:

S, i.e., split(dim), notes that the relationship between global data and local data is split, indicating the global data is evenly split according to the dimension dim and distributed in each rank.
B, i.e., broadcast, notes that the relationship between global data and local data is broadcast, indicating the global data is replicated in each rank.
P, i.e., partial_sum, notes that the relationship between global data and local data is partial, indicating the value of the global data is the element-wise sum of the local data distributed in each rank.

For more details, please refer to oneflow.sbp.sbp.

Data re-distribution is commonly seen in parallel computing, i.e., changing the distributed way of data, such as gathering all data slices. In the MPI programming paradigm (SPMD), data re-distribution requires writing explicit communication operations like AllReduce, AllGather, and ReduceScatter. But in OneFlow’s Global View programming paradigm (SPSD), data re-distribution can be achieved by utilizing Global Tensor’s global data distribution type conversion.

The conversion of the global data distribution type is similar to (explicit) type conversion in general programming languages. Users only need to specify the targeted type when they convert types, and some implicit operations can be executed automatically. For example, when converting the type from double to int, the system will remove the decimal point automatically.

Similarly, it’s only required to specify the new global data distribution type that the Global Tensor will be converted into, and OneFlow will complete implicit communication operations automatically. And the interface to convert the global data distribution type is Tensor.to_global. The to_global method contains two parameters- placement and sbp, which decide the newly-converted global data distribution type.

The main implicit operations in converting the global data distribution type are to infer and execute the communications, and these operations are implemented by OneFlow’s Boxing, which is a mechanism to re-distribute data automatically.

The following is a case to convert a split-distributed Global Tensor to a broadcast-distributed one:

import oneflow as flow
x = flow.randn(2, 5).cuda()
placement = flow.placement(type="cuda", ranks=[0, 1])
sbp = flow.sbp.split(0)
x_global = x.to_global(placement=placement, sbp=sbp)
print(x_global.shape) # (4, 5)
print(x_global.to_local())
sbp_b = flow.sbp.broadcast
x_global_b = x_global.to_global(placement=placement, sbp=sbp_b)
print(x_global_b.shape) # (4, 5)
print(x_global_b.to_local())

When the global data distribution type is converted from x_global to x_global_b, the parameter sbp has changed from flow.sbp.split(0) to flow.sbp.broadcast. Their global shapes have remained (4, 5), but the local component has turned from a data slice into complete data, and this change can be seen from the printed result of the to_local().

Here, the to_global conversion has merged the Local Tensors. Generally speaking, SPMD programming mode requires users to write an all-gather collective communication to merge the Local Tensors, but in OneFlow Global View programming, the type conversion is enough to complete the merging process.

Global Tensor’s type conversion can infer and execute the communication operations automatically. So, algorithm developers can concentrate on thinking in data distribution rather than thinking in data communication operation, and what they imagine is what they obtain, which helps them to develop distributed programs more efficiently.

Let’s add by introducing how to apply numpy() to the Global Tensor. For random Global Tensor, such as x_global, x_global.numpy() is equivalent to x_global.to_global(spb=flow.sbp.broadcast).to_local().numpy(), which means x_global.numpy() will firstly convert the original Global Tensor to one, which SBP is flow.sbp.broadcast(), then conduct a to_local operation and finally invoke numpy() for the Local Tensor. Therefore, the x_global.numpy() method can obtain complete data.

Global Tensor Participating in Computation

This section introduces how the Global Tensor participates in practical computation. Take the Global Tensor participating in matrix multiplication computation for example, please firstly create the following program:

import oneflow as flow
placement = flow.placement(type="cuda", ranks=[0, 1])
x = flow.randn(4, 5, placement=placement, sbp=flow.sbp.split(dim=0))
w = flow.randn(5, 8, placement=placement, sbp=flow.sbp.broadcast)
y = flow.matmul(x, w)
print(y.is_global)  # True
print(y.shape)  # (4, 8)
print(y.sbp)  # (flow.sbp.split(dim=0))
print(y.to_local().numpy())

In the program above, we have created 2 Global Tensors-x and w, and they participate in oneflow.matmul computation and generate y.

Most of OneFlow’s operators support computing the Global Tensor. When flow.matmul executes the Global Tensor, there is nothing special about its interface. Arguably, most of OneFlow’s operators are polymorphic, so they can decide how to compute according to the input:

If the input of the operator is a Local Tensor, the operator will compute the tensor in normal single-device single-GPU execution mode.
If the input of the operator is a Global Tensor, the operator will compute the tensor in global view (multi-device multi-GPU) mode.

The operators supporting polymorphic execution are very convenient for users to change the single-GPU code into distributed code: they only need to convert the (Local) Tensor they accept to a Global Tensor.

Just like single-device execution requires the data to be input into the same device, in the program above, the premise of the operator being executed successfully is that x and w have the same placement.

The result of matrix multiplication-y is also a Global Tensor. When flow.matmul computes x and w, it will automatically infer the placement and SBP of the output data. The following are the principles:

Placement: The input data and the output data have the same placement;
SBP: The inference principle of the output data’s SBP is decided by the operator type, and this principle is built into OneFlow. For more details, please refer to SBP Signature.

Here, the multiplied result of flow.sbp.split(0) and flow.sbp.broadcast will be inferred as flow.sbp.split(0). x is a data slice in each rank, w complete data, and y a data slice. Anyone familiar with common parallel execution approaches will find that a forward computation with data parallelism is conducted here. x is a data slice, and w the complete parameters.

Conclusion

This article has discussed:

Global View offers the SPSD programming view;
Global Tensor is visible on all processes when being executed;
Global Tensor and Local Tensor are mutually convertible;
Global Tensor supports converting the global data distribution type to implement distributed communication;
OneFlow operators are polymorphic enough to enable the execution of the Global Tensor;

So, this article will come to a close, and it fisrtly introduces how to create a Global Tensor and finally explains the detailed steps for data parallelism computation that is based on a Global Tensor.

More about parallelism ways and SBP’s inference logic will be discussed in our later articles.

OneFlow’s multi-machine multi-GPU launching and its required environment variables

OneFlow’s Global Tensors are executed under ** Multi-Client mode**, which means each device corresponds to one process. For example, n Machine m GPU has n * m processes. Besides, each process has its own rank ID, which corresponds to the ranks of the Global Tensor's placement parameter.

Take 2 Machines 2 GPUs for example, Machine 0 corresponds to GPU 0 and GPU 1, and Machine 1 corresponds to GPU 2 and GPU 3. So, flow.placement(type="cuda", ranks=[2]) can only identify the GPU 0 on Machine 1.

Generally, in the n Machine m GPU environment, flow.placement(type="cuda", ranks=[k]) only identifies the GPU k % m on Machine k / n.

Because the Multi-Client mode is adopted , we need to launch different processes corresponding to each device. In OneFlow, all processes need to launch the same scripts, and different processes distinguish process ID and establish communications according to different environment variables.

Notes of environment variables:

MASTER_ADDR：the IP of Machine 0 under multi-machine training;
MASTER_PORT：the listening port of Machine 0 under multi-machine training, and this port shouldn’t conflict with the occupied ports;
WORLD_SIZE: the number of computing devices in the whole cluster. Because it’s still not feasible to configure different number of GPUs on each device, the WORLD_SIZE equals the machine numbers multiplies the GPU numbers on each machine. In the previous case, we create the Global Tensor in single machine 2 GPUs environment, so the WORLD_SIZE=2;
RANK：the process ID of all devices in the whole cluster;
LOCAL_RANK：the process ID of single device;

Differences between RANK and LOCAL_RANK:

For single machine training, including single-machine single-GPU and single-machine multi-GPU, RANK equals to LOCAL_RANK;
For multi-machine training, the upper limit to LOCAL_RANK for each device is the number of computing devices on each machine; the upper limit to RANK is the sum of computing devices on all machines, and all devices are numbered from 0. (Because these computing devices are numbered from 0, the upper limit doesn’t exist.)

Take 2 Machines 2 GPUs for example, the corresponding relationship between LOCAL_RANK and RANK for each GPU is listed as follows:

Although it is complicated to utilize environment variables launching, this approach is widely applicable because users can adopt random ways to launch the processes.

Besides, OneFlow also offers a convenient tool, oneflow.distributed.launch, to help users launch multiple processes in a distributed way and construct environment variables automatically.

1. OneEmbedding Allows Efficient Training of Large Recommender Models with Single GPU

2. LiBai Model Library to Train Large Models More Easily and Efficiently

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

OneEmbedding Allows Efficient Training of Large Recommender Models with Single GPU

OneFlow — Thu, 11 Aug 2022 09:02:42 GMT

Written by Zheng Zekang, Guo Ran, Liu Juncheng, Yuan Jinhui; Translated by Hu Yanjun, Shen Jiali, Cheng Haoyuan, Jia Chuan

Personalized recommendations have been a main source of information for people today. Unlike in the past, when people could only access what they wanted by searching, nowadays media outlets can suggest relevant items to users after identifying their fields of interest based on recommendation algorithms.

For users, recommender systems improve their online experience. For businesses, the quick match between people and their needed information has turned users into customers, and thus gives rise to business empires at trillion-dollar market values.

Recommender systems are supporting short video feeds, search ads, and online shopping, and the driving factor behind recommender systems is deep learning models.

However, the accumulation of massive data and the increasingly frequent user data iterations pose serious challenges to the extensibility and training speed of these systems. It turns out that the general deep learning frameworks are not enough to meet the needs of industrial recommender systems. We must further customize general deep learning frameworks or even develop specialized recommender systems.

To tackle these problems, the OneFlow team has recently released OneEmbedding, an efficient, extensible, and highly flexible recommender system component. It is as easy to use as general deep learning frameworks. But its performance is much better than that of general frameworks, even surpassing HugeCTR, a specialized recommendation framework developed by NVIDIA.

Specifically, OneEmbedding outperforms HugeCTR by a significant margin in FP32 and Automatic Mixed Precision (AMP) training on both the DCN and DeepFM models, and delivers equal performance compared to HugeCTR on the DLRM model, which is deeply optimized by HugeCTR to the verge of overfitting.

(All experiments are conducted in the same testing environment: CPU Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz * 2；CPU Memory 1920GB；GPU NVIDIA A100-SXM-80GB * 8；SSD Intel SSD D7P5510 Series 3.84TB * 4)

To train a recommender system model with large TB-level table in OneFlow, users only need to configure the Embedding table with the following snippet:

# self.embedding = nn.Embedding(vocab_size, embedding_vec_size)
self.embedding = flow.one_embedding.MultiTableEmbedding(
                     "sparse_embedding",
                     embedding_dim=embedding_vec_size,
                     dtype=flow.float,
                     key_type=flow.int64,
                     tables=tables,
                     store_options=store_options,
                 )

Here are some common cases of search ads recommender models that are constructed with OneEmbedding: https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems

Challenges facing large recommender systems

Normally speaking, building a recommender system entails sparse features such as gender, age, and behaviors. Using the feature ID, the system finds the corresponding Embedding vector of the feature in the Embedding vocabulary table via the lookup function, and then passes the vector downstream for further use.

The popular public dataset Criteo 1TB contains around one billion feature IDs. If embedding_dims is set to 128, then it requires 512 GB storage to accommodate the Embedding parameters. Moreover, if using the Adam optimizer, the required storage will spike to 1536 GB because of the need to store the extra status variables:m and v. In actual application scenarios, the data size can be several orders of magnitude larger than that in Criteo, so it requires much larger space in models.

A major challenge facing large recommender systems is finding a way to support the lookup and update of large-size Embedding more efficiently and economically. Different tradeoffs among scale, cost, and efficiency result in the following three common solutions.

The earliest and most frequently adopted solution is to deploy the Embedding on CPUs, taking advantage of the inexpensively large memory capacity of CPUs to accommodate more parameters. The bright side is that there is nearly no limit for the model size. However, the drawback is pretty huge. The CPU is nowhere near the GPU in terms of computation performance and bandwidth, making Embedding a major bottleneck. That means it will take dozens or even hundreds of CPU servers to support an industrial recommender system.

Since GPU is a perfect tool for dense computation, some people suggest that we should use GPUs to train large Embedding models. But then we are exposed to one problem: GPUs are expensive and have a small memory size. If we use NVIDIA A100 to train 128-dimensional Embedding vectors based on Criteo, we will need at least 13 of them since each A100 only has a memory size of 40 GB. Distributed Embedding entails a technique that is called “model parallelism”. Ideally, to train larger models, we only need to use more GPUs.

The fact is, the cost of GPU compared to that of CPU is very high, and the main computation amount in recommerder models is relatively small. Model parallelism only allows larger Embedding scales, but it does not bring much training speed gain. In contrast, it slows the training down due to the introduction of communication between multiple devices, so it is usually only suitable for small-scale clusters.

In order to increase transmission bandwidths between GPUs, interconnect technologies such as NVSwitch and Infiniband network with higher bandwidths than Ethernet were developed. However, on the one hand, adopting these technologies means additional costs. On the other hand, many users don’t have the infrastructure that meet the conditions for such renovation and upgrading.

Is it possible to have it both ways?

Our answer is “yes”. OneFlow has designed OneEmbedding to solve the problems above. OneEmbedding enables single-GPU training of TB-level models through hierarchical storage and imposes no limit on model size through horizontal expansion. Built on OneFlow’s automatic pipeline mechanism, kernel optimization, and quantitative compression of communication, OneEmbedding can deliver great performance. As easy to use as Pytorch, it outperforms TorchRec by a factor of 3 on DLRM model. What’s more, when enables mixed precision, which is not supported by TorchRec, OneEmbedding outperforms TorchRec by more than 7 times.

(TorchRec’s performance results on 8 A100 GPUs: https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm/#preliminary-training-results)

Core advantages of OneEmbedding

Hierarchical storage: support single-GPU training of TB-level models

Utilizing the spatial and temporal locality of data, multi-level cache allows us to strike a good balance between performance and cost. OneEmbedding implements multi-level cache based on this idea, and thus enables users to train TB-level models even if they only have one GPU.

Users can deploy Embedding to GPU memory, CPU memory and even SSD. This solution can give full play to the low-cost advantage of CPU memory and SSD. It can not only expand the scale of Embedding parameters but also use GPU memory as a cache device to achieve better performance.

OneEmbedding will dynamically cache frequently accessed items to GPU memory and evicts less frequently accessed items to underlying storage such as CPU memory or SSD. On the premise that the data follows a power-law distribution, OneEmbedding can keep the GPU cache hit rate at a high level based on an effective cache management algorithm.

It is worth noting that OneEmbedding only uses CPU memory and SSD as storage devices, and all computations are performed on GPU. Currently, OneEmbedding provides three preset storage configs:

Use GPU memory to store all model parameters
Use CPU memory as Embedding parameter storage device and GPU as cache
Use SSD as Embedding parameter storage device and GPU as cache

# Use SSD as storage device and GPU as cache
store_options = flow.one_embedding.make_cached_ssd_store_options(
                    cache_budget_mb=cache_memory_budget_mb,
                    persistent_path=persistent_path,
                    capacity=vocab_size,
                )

Users can configure with just a few lines of code according to the actual hardware equipment, and achieve the optimization of scale, efficiency and cost at one time.

In order to offset the latency of CPU and SSD fetching data, OneEmbedding introduces techniques such as pipelining and data prefetching. This ensures the same high efficiency of pure GPU training while using CPU memory and SSD as the storage backend.

We test these three storage configs separately. The test case is consistent with the DLRM model of MLPerf, and the parameter size is about 90GB. When using SSD and CPU memory as storage devices, the GPU cache size we configure is 12 GB per GPU. This means only a part of the parameters are stored in the GPU memory. Others are stored in the CPU memory or SSD and will be dynamically swapped into the GPU cache as the training goes on. The test results are shown as follows:

(Testing environment: CPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz * 2；CPU Memory 512GB；GPU NVIDIA A100-PCIE-40GB * 4；SSD Intel SSD D7P5510 Series 7.68TB * 4)

We can tell from the test results that:

(1) The full GPU memory solution delivers the optimal performance, but the largest model it can train is only 160 GB in theory because the GPU memory capacity is only 4x 40GB.

(2) Compared with the full GPU memory solution, the GPU cache plus CPU memory solution only suffers little performance loss, but it can scale the ceiling of parameter size to the CPU memory capacity, often hundreds of GB to several TB.

(3) If you can accept more performance loss, the GPU cache plus SSD memory solution can scale the ceiling of parameter size to the SSD capacity, and the largest model being trained can reach tens of TB or more.

If we want to conduct a complete training of the DLRM model on a server with a single NVIDIA A30–24GB GPU, it’s obvious that 24GB memory can’t directly support training a 90GB model. Instead, with the help of hierarchical storage that specifies CPU memory as the storage device and GPU memory as the cache, it’s no longer a problem to train 90GB or larger models.

Scale-out: multi-GPU linear acceleration to break the ceiling of model training

The hierarchical storage technology helps OneEmbedding to improve the Embedding parameter size ceiling on the single-GPU device, enabling it to train even TB-level models with sufficient memory capacity. If the model size is further scaled to significantly exceed the CPU memory capacity, users can utilize OneFlow’s parallel capability to scale it out to the multi-node multi-GPU to train even larger models based on hierarchical storage technology.

In a recommender system, the model parameter is much smaller than the Embedding parameter. Therefore, we generally specify the Embedding as model parallelism and the model as data parallelism. In this way, the multi-node multi-GPU method can further increase the Embedding size.

The detailed implementation process is: 1) each rank is responsible for storing a part of the Embedding, and the feature IDs are entered into each rank. But there may exist repeated IDs, so it’s essential to eliminate the repeated ones(i.e.ID Shuffle); 2) each rank queries Embedding with its unique ID and obtains its corresponding local data. Then, after the data distributed across all ranks is merged, each rank gets complete Embedding data(i.e.Embedding Shuffle); 3) each rank completes the entire model training process in data parallelism.

The following figure displays the model’s throughput for nodes with different numbers of GPUs when OneEmbedding adopts the pure full memory solution to train the DLRM model (the blue column denotes the GPU utilizing AMP to compute, while the orange one the GPU utilizing FP32 to compute).

(Testing environment: CPU Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30GHz * 2; CPU Memory 1920GB; GPU NVIDIA A100-SXM-80GB * 8; SSD Intel SSD D7P5510 Series 3.84TB * 4)

As the number of GPUs increases, the model’s throughput soars significantly. In the AMP case, the single-GPU node’s throughput reaches 6 million, but the 8-GPU node’s throughput soars to almost 40 million.

Pipeline mechanism: auto overlap compute and data transfer

In the DLRM model, the Dense Feature from the Embedding goes to the Bottom MLP, while the Sparse Feature gets the corresponding feature by querying the Embedding. Next, the Sparse Feature and the Dense Feature conduct feature crosses after entering Interaction and finally enter the Top MLP.

Embedding-related operations include Embedding Lookup and Embedding Update. Since OneEmbedding implements a tiered storage mechanism, there may be cases where the feature IDs do not hit the cache, which may slow the training speed as it takes a longer time to pull the data.

To avoid this defect, OneEmbedding adds an Embedding Prefetch operation to ensure that both Embedding Lookup and Embedding Update are performed on the GPU. Since there is no dependency between the data prefetch of the previous and next iterations, the Embedding data needed for the next iteration can be prefetched while the current iteration is being computed, allowing the computation and prefetching to be conducted at the same time.

When the Embedding data is being queried and exchanged, Dense Features that are not related to the Embedding operation can enter the Bottom MLP for computation, overlapping with the former two operations in time. The full timing of the overlapped execution is shown in the figure below.

Controlling such a complex data pipeline remains a major challenge for traditional deep learning frameworks. Besides, in the actual recommendation scenario, users’ data is changing constantly, which requires the pipeline mechanism to process dynamic data.

The Actor mechanism of OneFlow makes all these problems very simple. Each Actor can implement distributed coordination work via its internal state machine and message mechanism. By dispatching multiple storage blocks to each Actor, different Actors can work at the same time, thus implementing a pipeline between actors by overlapping their respective working times. We only need to dispatch the Embedding operation to a single stream, allowing the system to build the pipeline automatically.

Kernel optimization: approaching GPU’s optimal performance

OneFlow has not only deeply optimized general operators but also added the implementations of multiple high-performance CUDA operators given popular recommender system models’ features.

For the feature crosses in DLRM and DCN, OneFlow has implemented the FusedDotFeatureInteraction and FusedCrossFeatureInteraction operators respectively.

(FusedCrossFeatureInteraction operator, picture from “Deep & Cross Network for Ad Click Predictions”)

For multiple fully connected layers in the model, OneFlow has implemented the FusedMLP operator based on the cublasLt library.

For the fully connected layers with Dropout operation, OneFlow has deeply customized its ReluDropout operation. Specifically, OneFlow will store the forward mask in the form of the bitmask and implement reverse operator fusion by specifying the parameter alpha as alpha = dropout_scale in cublasLt matrix multiplication for backpropagation.

Embedding quantization: improving communication efficiency

In the communication process of model training, a lot of efforts have been done recently to quantize and compress data to save communication volume and improve communication efficiency. This feature is also accessible in OneEmbedding.

In parallel training, each Rank needs to exchange their Embedding data via communication. We first convert the data type from floating point to int8 and then restore it via dequantization technology after data is swapped.

The following figure shows the DLRM model’s throughputs before and after being quantized in FP32 and AMP respectively. Note that the DLRM model is trained on the machine adopting the full GPU memory solution.

Model precision comparsion before and after quantization (AUC):

（Testing environment: CPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz * 2；CPU Memory 512GB；GPU NVIDIA A100-PCIE-40GB * 4；SSD Intel SSD D7P5510 Series 7.68TB * 4）

The testing results demonstrate that, without model precision loss, the throughput for quantized communication has witnessed a 64% rise in the FP32 case and a 13% rise in the AMP case when it is compared to that in the default communication mode.

Easy-to-use: build large-scale recommendation models as easily as in PyTorch

Now, OneEmbedding has been built in OneFlow as an internal extension component. That means users can enjoy the flexibility of the general OneFlow framework to build their recommendation models while using OneEmbedding’s advanced features.

class DLRMModule(nn.Module):
    def __init__(self, args):
        super(DLRMModule, self).__init__()
        self.bottom_mlp = FusedMLP(...)
        self.embedding = OneEmbedding(...)
        self.interaction = FusedDotInteraction(...)
        self.top_mlp = FusedMLP(...)
    def forward(self, sparse_feature, dense_feature): 
        dense_fields = self.bottom_mlp(dense_feature)
        embedding = self.embedding(sparse_feature)
        features = self.interaction(dense_fields, embedding)
        return self.top_mlp(features)

Finally, it is worth mentioning that OneEmbedding encodes the feature IDs via a built-in encoding mechanism and supports the dynamic insertion of new data. There is no need to plan the Embedding capacity in advance or specially process the feature IDs in the dataset. This dynamic mechanism is designed to support various incremental training scenarios while reducing the usage burden.

At the moment, OneFlow’s model library provides a collection of models being constructed based on OneEmbedding, such as DLRM, DeepFM, xDeepFM, DCN, PNN, MMoE. In the future, we will add more recommender models to it.(https://github.com/Oneflow-Inc /models/tree/main/RecommenderSystems).

Conclusion

OneEmbedding is a component designed to train large-scale recommender system models, and its features spanning flexible tiered storage, highly optimized data pipeline, and ease of scale-out enable users to easily train TB-level recommender models.

Currently, OneFlow provides some model samples for you to try out the OneEmbedding with only one click. Later, OneFlow will also launch the Flow-Recommender, a recommender system model library covering all mainstream models. It supports both distributed training and inference, so stay tuned for it.

OneEmbedding address：https://github.com/Oneflow-Inc/models/tree/main/RecommenderSystems
OneEmbedding documentation：https://docs.oneflow.org/master/cookies/one_embedding.html
OneEmbedding API documentation：https://oneflow.readthedocs.io/en/master/one_embedding.html

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

LiBai Model Library to Train Large Models More Easily and Efficiently

OneFlow — Tue, 09 Aug 2022 16:40:39 GMT

Translated by Hu Yanjun, Shen Jiali, Dong Wenwen, Jia Chuan

Starting with BERT in 2018, large models sprung up one after another, including GPT-3 and ViT, whose parameters are counted in billions. Explosive growths in model size happen so frequently that they can hardly impress AI developers. What really troubles the engineers is how to accelerate the training of such large models.

Larger models come with much higher training costs and pose greater challenges to computation and memory resources. For instance, training GPT-3, a model containing over 100 billion parameters, with a state-of-the-art NVIDIA A100 GPU will take more than 100 years.

Larger models are demanding larger GPU memory, but the current GPUs are not growing fast enough in memory size to meet the needs. According to a report by OpenAI, AI model size is doubling every 3.5 months, greatly outpacing the 18 months of GPU memory. This means that one single GPU can no longer accommodate the numerous parameters of large models.

Therefore, developers have to split the computation across multiple GPU devices. Distributed training becomes an inevitable choice.

However, distributed training has a high technical threshold. It takes more than abundant computing resources. Programming for distributed parallel training requires expertise in computer systems and architecture and rich hands-on experience, it increases the difficulty of exploring cutting-edge algorithms and new models. Thus, the development and training of large models become exclusive to tech giants. It’s of top priority to accelerate model training and make large models accessible to more engineers.

But with all the model libraries available for distributed training, which one should we choose?

Luckily, OneFlow has released its LiBai model library recently, making it easier to answer the above question. OneFlow is an open-source DL framework known for its excellent performance. The LiBai model library gathers the merits of mainstream Transformers libraries spanning Hugging Face, Megatron-LM, DeepSpeed, and FairSeq, and outperforms many of its competitors in distributed training. More importantly, its Global View Programming has lowered the bar for distributed training, thus allowing more developers to train large-scale models.

Find out more about the LiBai model library: https://github.com/Oneflow-Inc/libai .

So, how does LiBai make its way towards excellence? In the rest of this article, we will compare LiBai with other distributed training tools in terms of training performance, ease of use, and so forth, to provide you with reference next time you make a choice.

1. One-click auto distributed training with better performance than Megatron-LM and DeepSpeed

Specifically, as a simple and efficient distributed model training toolkit, the LiBai library boasts the following six features:

Easy scaling from single-GPU training to multi-GPU training. The models in LiBai are aligned with those in PyTorch, which saves you the trouble of learning and getting used to new operating styles. With LiBai, scaling to parallel training only requires simple configuration. This means if users want to add a new feature to their models and put it into distributed training, all they need is to add it and debug it in the single-GPU training code, and LiBai will take care of the rest. What’s more, if users want to save effort by skipping the configuration step for distributed training, they can simply install the Auto_Parallel package (https://libai.readthedocs.io/en/latest/tutorials/basics/Auto_Parallel.html) and configure a line of code: graph.auto_parallel=True in LiBai. In this way, they can concentrate on the models themselves without worrying about the implementation details of distributed training while benefiting from quick training speeds.
Compatibility with Hugging Face. OneFlow is highly compatible with PyTorch in the API layer. Users can import Hugging Face models by a simple modification of the code. They can easily train a large model via import oneflow as torch utilizing mechanisms in LiBai such as Data Parallelism, Automatic Mixed Precision, Activation Checkpoint, Zero Redundancy Optimizer(ZeRO). To train large models with 3D parallelism, users only need to replace a fews layers of the model with the layers in LiBai.
Modular design. For the implementation of LiBai, we offer not only replicable basic computation modules for model construction but also abstraction and modularization for data loading, training logic, indicator computing, and so on. The modular design allows users to override codes and integrate them as plug-ins into LiBai’s training system to cater to their own needs.
Out-of-the-box. Training a large-scale model usually requires a common series of techniques, and LiBai supports features spanning Mixed Precision Training, Gradient Re-computation, Gradient Accumulation, and ZeRO, which can be easily used in combination with data parallelism, model parallelism, and pipeline parallelism.
Rapid reproduction of experiments. The OneFlow team has learned from Detectron2 LazyConfig (https://github.com/facebookresearch/detectron2/blob/main/docs/tutorials/lazyconfigs.md) in constructing LiBai’s configuration system. That’s why LiBai has a more flexible configuring system than the traditional argparse and yacs-based configuring methods. The system is constructed in Python grammar, so it’s convenient to add new parameters and modules in it. For example, adding a new module only requires importing the module. Besides, the training configuration can also be serialized to a yaml file for storage, so users can conveniently search for configuration items in the file by inputting keywords. In addition, if users want to reproduce the result of a previous experiment, they can directly import the config.yaml as the training configuration. In this way, LiBai avoids the need to preserve multiple script files, which makes it inconvenient to check the valid modifications and increases risks of confusion between different experiment configurations.
High efficiency. By strict kernel alignment with Megatron-LM, LiBai has implemented various kinds of kernel fusion operations. Besides, benefiting from OneFlow’s static graph design, LiBai surpassed NVIDIA’s Megatron-LM and Microsoft’s DeepSpeed in terms of single-GPU performance and the efficiency of different mixed parallelism methods.

Thanks to OneFlow SBP’s native support for various parallel technologies, LiBai is able to decouple algorithmic description from the parallel system. It has managed to realize features with much less code. It takes NVIDIA Megatron-LM and Microsoft DeepSpeed 100,000 lines of code in total to do what LiBai can do with only around 30,000 lines of code.

Data speaks for itself. The following shows how LiBai and Megatron-LM perform under various models in the same hardware environments, third-party dependencies (CUDA, cuDNN, etc.), parameters, and network structures. (All performance results are public and reproducible, https://libai.readthedocs.io/en/latest/tutorials/get_started/Benchmark.html). In the future, OneFlow will release the performance of LiBai on a larger cluster of devices.

Megatron-LM commit：https://github.com/NVIDIA/Megatron-LM/commit/e156d2fea7fc5c98e645f7742eb86b643956d840
LiBai commit: https://github.com/Oneflow-Inc/libai/commit/9fc504c457da4fd1e92d854c60b7271c89a55222
OneFlow commit: https://github.com/Oneflow-Inc/oneflow/commit/55b822e4d3c88757d11077d7546981309125c73f

Data parallelism

Note: Here are the meanings of the parameters involved:

DP: Data Parallelism

MP: Model Parallelism

PP: Pipeline Parallelism

2D: 2D Parallelism

3D: 3D Parallelism

fp16: enable automatic mixed precision (amp) training

nl: num layers (When pipeline parallel size = 8, in order to have a relative number of layers per stage for computation, we adjust the num layers from 24 to 48.)

ac: enable activation checkpointing

mb: micro-batch size per gpu

gb: global batch size total

dxmxp：

d = data-parallel-size

m = tensor-model-parallel-size

p = pipeline-model-parallel-size

1n1g: 1 node, 1 GPU

1n8g: 1 node, 8 GPUs

2n8g: 2 nodes, 8 GPUs per node (16 GPUs in total)

4n8g: 4 nodes, 8 GPUs per node (32 GPUs in total)

The result of grad_acc_num_step = global_batch_size / (micro_batch_size * data_parallel_size) is throughout.

(Note: In Group 1, num layers = 24, amp enabled, 1n1g micro-batch size = 24; in Group 2~5, micro-batch size = 16.)

(Note: In Group 1, num layers = 24, amp enabled, 1n1g micro-batch size = 6; in Group 2~5, micro-batch size = 4.)

Model Parallelism

(Note: num layers = 24, amp enabled, activation checkpointing enabled, micro-batch size = 128, global batch size = 1024 and grad acc step = 8.)

(Note: num layers = 24, amp enabled.)

Pipeline Parallelism

(Note: In Group 1&2, num layers = 24, grad acc step = 8; in Group 3, num layers = 48, grad acc step = 16. Both amp and activation checkpointing are enabled in all 3 groups.)

2D Parallelism

Data & Model Parallelism

(Note: num layers = 24, amp enabled, activation checkpointing enabled, micro-batch size = 128, grad acc step = 8.)

(Note: num layers = 24, amp enabled, activation checkpointing enabled, micro-batch size = 32, grad acc step = 8.)

Data & Pipeline Parallelism

(Note: num layers = 24, amp enabled, activation checkpointing enabled, micro-batch size = 128, grad acc step = 8.)

(Note: num layers = 24, amp enabled, activation checkpointing enabled, micro-batch size = 32, grad acc step = 8.)

3D Parallelism

(Note: num layers = 24, amp enabled, activation checkpointing enabled, grad acc step = 8.)

As is shown above, the training speeds of LiBai exceed those of Megatron-LM on both Bert and GPT-2 models in every experiment on the basis of strictly aligned experimental environments.

2. LiBai: more and better

As we mentioned, currently there are plenty of large model training solutions, such as Hugging Face, DeepSpeed, Megatron-LM, and FairSeq. Do we really need another model library?

To answer this question, let’s see what LiBai has to offer.

HuggingFace: It provides all kinds of SOTA Transformer models, which are pretrained and only require some fine-tuning before put into use. It also has a well-developed community and ecosystem to support developers. However, it only supports data parallelism, which makes it less handy when the model size exceeds the memory capacity of a single GPU. Plus, training models from scratch with Hugging Face is low-speed.

FairSeq: It is targeted at sequence models and lacks support for CV models under the current merging trend of NLP and CV.

Megatron-LM: Based on PyTorch, it is able to implement data parallelism, model parallelism, and pipeline parallelism and deliver high performance. It can handle the training of super large-scale models.

However, it requires too much customization, making it unfriendly to those algorithm engineers who are less of a distributed training expert. In addition, it provides far fewer models than Hugging Face do, so if engineers want to reproduce a large model in PyTorch, they can only wait until that model is implemented based on Megatron-LM by someone who is more adept in distributed training.

DeepSpeed: It is a deeply-customized library related to model memory optimization based on PyTorch. It supports technologies including distributed training, mixed precision training, and ZeRO, so it can largely reduce memory overhead and allow effective training of large models under data parallelism. However, DeepSpeed does not support model parallelism. Model parallelism (tensor parallelism, pipeline parallelism) is a better choice when a single GPU can not accommodate the parameters of certain layers of the model, or the communication efficiency is dragged down by the sharding of DeepSpeed. Thus, to meet their own needs, users can only use DeepSpeed in combination with Megetron-LM and change the original code.

Megatron-LM and DeepSpeed are the earliest libraries for large model training in the PyTorch ecosystem. Some renowned organizations around the world joined the arena later and launched libraries such as FairSeq. But it’s noteworthy that the core distributed functions of the latecoming libraries are all implemented based on Megatron-LM and DeepSpeed.

LiBai, instead of being a slightly upgraded version of any of the above-mentioned libraries, is a useful kit for large pretrained model development that is built on the outstanding distributed training and graph compiler performance of OneFlow. That’s why it boasts incomparable performance and ease of use in distributed training.

Compatibility. LiBai is compatible with the existing PyTorch-based SOTA models so users can transfer models from PyTorch conveniently.
High Efficiency. LiBai delivers high efficiency in both single-GPU and multi-GPU training.
Ease of Use. With good extensibility, LiBai allows users to easily modify the models based on their own needs or add new features to models to speed up their prototype development work. It lowers the bar for distributed deep learning training so greatly that users don’t need to go through painful studies to get on board. When developing new models and new features, all you need to do is to program for single-GPU training, and LiBai will help scale it to large GPU clusters for distributed training so you don’t have to override any codes. What a time saver!

We believe all these traits make LiBai a wise choice for distributed training.

3. LiBai supports all regular parallel training methods

Distributed large model training entails multiple parallel methods, including data parallelism, tensor/model parallelism, and pipeline parallelism. LiBai supports all these methods and every arbitrary combination of them. (For more information on the parallel methods, please refer to: https://docs.oneflow.org/en/master/parallelism/01_introduction.html)

It’s always a headache to learn to implement new parallel methods by yourself. For example, people had to go through all the trouble of configuring Apex to enable AMP training, DALI to support data loading pipelines, and DeepSpeed for the use of ZeRO to reduce memory usage. However, with LiBai, you have no such worries since it is already packed with various parallel methods and great extensibility.

The followings show you how to implement parallelism with LiBai by concrete examples.

A general way to implement parallelism

Via the SBP interface of OneFlow, users can easily shard the input data or weights in the neural network based on their needs and GPU arrangements to implement data parallelism or tensor parallelism.

libai-layers has incorporated a series of network layers, including the frequently-used Linear, MLP, and Transformer modules, which automatically adapt to various parallelism strategies. Therefore, when constructing a neural network via libai-layers, all you need is to adjust the distributed training hyperparameters in the configuration files. Then you can easily implement training strategies including data parallelism, tensor parallelism, and data & tensor mixed parallelism.

The format of distributed configuration is as follows:

# configs/common/train.py
# Distributed arguments
dist=dict(
        data_parallel_size=1,
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
)

data_parallel_size and tensor_parallel_size are used to determine how the input data and model weights will be sharded on different GPU groups. In the above snippet, all values are set to 1, which means to train on a single GPU. Now imagine that a user has 8 GPUs, in what follows we will show you how to modify the configuration files to implement data parallelism, tensor parallelism, and pipeline parallelism on 8 GPUs.

This document provides detailed instructions about distributed configuration in LiBai:

https://libai.readthedocs.io/en/latest/tutorials/basics/Distributed_Configuration.html

Data parallelism & model parallelism

To implement data (or model) parallel training on 8 GPUs, users only need to override the hyperparameters of distributed training in the configuration files.

Data parallelism

# your config.py
from libai.config import get_config
train = get_config("common/train.py").train
train.dist.data_parallel_size = 8

To implement data parallel training, multiple ranks copy the same model with each rank dedicated to processing a part of the input data.

Model parallelism

# your config.py
from libai.config import get_config
train = get_config("common/train.py").train
train.dist.tensor_parallel_size = 8

To implement model parallel training, the model is partitioned across 8 GPUs and each GPU is only placed with a part of the model.

Data & model mixed parallelism

To implement data & model mixed parallelism on 8 GPUs, users only need to make simple modifications to the distributed training parameters in the configuration files.

# your config.py
from libai.config import get_config
train = get_config("common/train.py").train
train.dist.data_parallel_size = 2
train.dist.tensor_parallel_size = 4

In this case, LiBai will automatically divide the GPUs into groups. We number the 8 GPUs from “0” to “7”. When data_parallel_size is set to "2" and tensor_parallel_size is set to "4", the system will divide the 8 GPUs into two groups: [[0, 1, 2, 3], [4, 5, 6, 7]], with [0, 1, 2, 3] being one group and [4, 5, 6, 7] the other. Data parallelism will be implemented across the two groups and model parallelism will be implemented within each group.

Configuration of pipeline parallelism

In essence, pipeline parallelism can be explained as follows: 1) the neural network is divided into stages; 2) each stage is distributed to one GPU; 3) the computation result of one stage will be passed to the next stage for further computation, which works like an assembly line. For more information about pipeline parallelism, please check: https://docs.oneflow.org/en/master/parallelism/01_introduction.html

Configuration of naive pipeline parallelism

In LiBai, you can assign different layers of the network to different GPUs by setting the placement parameters. You can easily set values for the placement parameters via the get_layer_placement() interface in libai.utils.distributed. LiBai can automatically partition stages and assign placements to stages according to the distributed configuration in the configuration file (config). Therefore, for configuration of pipeline parallelism, you only need to configure the placement for each layer of the network.

In most networks, a Linear layer is often used as the head of the network to produce the final results for classification or other tasks. Therefore, here we take the Linear layer as an example to introduces the simplest pipeline parallelism configuration method in LiBai:

from libai.layers import Linear
self.head = Linear(hidden_size, num_classes)

Configure the placement of network modules

There are two ways to assign a layer of network to the corresponding placement in LiBai:

Manually specify the placement via the to_global interface and get_layer_placement (). In the following snippet, get_layer_placement(-1) means that the head layer is assigned to the last placement.

from libai.layers import Linear
import libai.utils.distributed as dist
self.head = Linear(hidden_size, num_classes).to_global(placement=dist.get_layer_placement(-1))

(Recommended) Modules implemented in libai.layers come with the layer_idx parameter, so we can specify the placement of this layer by directly setting the layer_idx parameter.

from libai.layers import Linear
self.head = Linear(hidden_size, num_classes, layer_idx=-1)

Configure the placement of input data

After configuring the placement of modules in the network, users need to specify the placement of input data, because the calculation can only be carried out when the input and network are in the same stage. The most intuitive way for this is to configure the same placement for the input and network, which can be done via to_global with get_layer_placement():

class MyModule(nn.Module):
    def __init__(self, ... *, layer_idx):
        ...
        self.layer_idx = layer_idx
        ...
    def forward(self, input_data):
        input_data = input_data.to_global(placement=dist.get_layer_placement(self.layer_idx))
        ...

Implement naive pipeline parallelism easily with configuration files

After configuring the placement of network layers and the input data, users only need to adjust the configuration file (config) before they can implement pipeline parallelism. Users need to know the number of network layers beforehand, and adjust the pipeline_num_layers in the configuration file:

# set the number of pipeline stages to be 2
train.dist.pipeline_parallel_size = 2
# set model layers for pipeline
train.dist.pipeline_num_layers = hidden_layers

1F1B is a new pipeline parallel training method introduced in the PipeDream paper (https://arxiv.org/pdf/1806.03377.pdf), which can save GPU memory and utilize resources more efficiently. LiBai support the 1F1B strategy in an easy way (https://github.com/Oneflow-Inc/libai/blob/main/docs/source/tutorials/advanced_tutorials/customize_dataloader.md).

The realization of 3D parallelism

After mastering data & model mixed parallelism and pipeline parallelism, you only need to synthesize the above-mentioned parallelism changes to realize the configuration of data + model + pipeline parallelism.

# your config.py
from libai.config import get_config
train = get_config("common/train.py").train
train.dist.data_parallel_size = 2
train.dist.tensor_parallel_size = 2
train.dist.pipeline_parallel_size = 2
hidden_layers = 8 # Layers of the network
train.dist.pipeline_num_layers = hidden_layers

Again, let’s take 8 GPUs as an example, after setting data_parallel_size, tensor_parallel_size and pipeline_parallel_size to "2", the model will be automatically divided across 8 GPUs according to the pipeline_num_layers set by users.

With the above-mentioned configuration, the model will be partitioned into two stages that are implemented by GPU [0, 1, 2, 3] and [4, 5, 6, 7], respectively. In Stage 0, GPU [0, 2] and [1, 3] will implement data parallelism, and GPU [0, 1] and [2, 3] will implement model parallelism. In Stage 1, GPU [4, 6] and [5, 7] will implement data parallelism, and GPU [4, 5] and [6, 7] will implement model parallelism.

Custom parallel training

As described above, LiBai provides encapsulated modules for users to call in libai/layers/. Using these modules as building blocks, users can construct their own parallel networks.

When the modules in LiBai are not enough to meet their needs, users can customize the parallel strategy conveniently. In PyTorch, users need to insert a complex series of communication operations such as scatter-> forward-> reduce, but in LiBai, users only need to define the sbp and placement when initializing tensor. This makes implementing parallelism as easy as running code on a stand-alone device. (For details of sbp and placement, please refer to https://docs.oneflow.org/en/master/parallelism/04_2d-sbp.html).

For example, when a user performs 4-GPU training, the intermediate result of the network contains a 2D parallel tensor in the shape of (16, 8), which is divided across the 4 GPUs as is shown below. In LiBai, the placement distribution of that tensor is ranks = [[0, 1], [2, 3]], and the SBP is (S [0], S [1]) or (S [1], S [0]).

[            |   
    X00 gpu0 |  X01 gpu1
--------------------------
    X10 gpu2 |  X11 gpu3
             |           ]

Among them, the shapes of Xij are all (8, 4), which means the tensor is evenly distributed across the GPUs. If you want to add some random noise to this tensor, you can easily add the following code in LiBai:

dist.get_nd_sbp() is encapsulated in LiBai to be compatible with the requirements of 1D parallel, and dist.get_layer_placement() is to facilitate the configuration of pipeline parallel. In most cases, users can directly refer to the following code:

# test.py
import oneflow as flow
from omegaconf import DictConfig
from oneflow import nn
from libai.utils import distributed as dist
cfg = DictConfig(
    dict(data_parallel_size=2, tensor_parallel_size=2, pipeline_parallel_size=1))
dist.setup_dist_util(cfg)
class Noise(nn.Module):
    def __init__(self):
        super().__init__()
        self.noise_tensor = flow.randn(
            16, 8,
            sbp=dist.get_nd_sbp([flow.sbp.split(0), flow.sbp.split(1)]),
            placement=dist.get_layer_placement(layer_idx=0)
        )
        # Or the following instead
        # self.noise_tensor = flow.randn(
        #     16, 8,
        #     sbp=(flow.sbp.split(0), flow.sbp.split(1)),
        #     placement=flow.placement("cuda", ranks=[[0, 1],[2, 3]])
        # )
    def forward(self, x):
        return x + self.noise_tensor
Noise = Noise()
x = flow.zeros(
    16, 8,
    sbp=(flow.sbp.split(0), flow.sbp.split(1)),
    placement=flow.placement("cuda", ranks=[[0, 1],[2, 3]])
)
y = Noise(x)
print(f"rank: {flow.env.get_rank()}, global tensor: shape {y.shape} sbp {y.sbp} placement {y.placement}, local tensor shape: {y.to_local().shape}")

Run command:

python3 -m oneflow.distributed.launch --nproc_per_node 4 test.py

The output is shown below. From the shape, you can see the distribution of tensors across the ranks and the information of the tensor from the global perspective.

rank: 2, global tensor: shape oneflow.Size([16, 8]) sbp (oneflow.sbp.split(axis=0), oneflow.sbp.split(axis=1)) placement oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]]), local tensor shape: oneflow.Size([8, 4])
rank: 3, global tensor: shape oneflow.Size([16, 8]) sbp (oneflow.sbp.split(axis=0), oneflow.sbp.split(axis=1)) placement oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]]), local tensor shape: oneflow.Size([8, 4])
rank: 1, global tensor: shape oneflow.Size([16, 8]) sbp (oneflow.sbp.split(axis=0), oneflow.sbp.split(axis=1)) placement oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]]), local tensor shape: oneflow.Size([8, 4])
rank: 0, global tensor: shape oneflow.Size([16, 8]) sbp (oneflow.sbp.split(axis=0), oneflow.sbp.split(axis=1)) placement oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]]), local tensor shape: oneflow.Size([8, 4])

4. Future Plan

So far, LiBai supports common models such as BERT, GPT, ViT, Swin-Transformer and T5, as well as the latest technologies like MoCoV3 and MAE. In LiBai, these are out of the box and can be easily fine-tuned for downstream tasks.

OneFlow will improve compatibility with the Hugging Face models and increase connectivity to the Hugging Face ecosystem. Meanwhile, via OneFlow’s automatic parallel function, users will enjoy the convenience of automatic scaling from single-GPU training to distributed training.

In the future, OneFlow will not only support more models but also improve its features related to inference and serving. From training to deployment, OneFlow aims to be a one-stop development platform for AI engineers.

LiBai model library: https://github.com/Oneflow-Inc/libai
LiBai documentation: https://libai.readthedocs.io/en/latest
OneFlow: https://github.com/Oneflow-Inc/oneflow

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

OneFlow v0.8.0 Came Out!

OneFlow — Tue, 02 Aug 2022 06:40:51 GMT

We are thrilled to announce the release of OneFlow v0.8.0. This update contains 523 commits. For the full changlog, please check out: https://github.com/Oneflow-Inc/oneflow/releases/tag/v0.8.0. Welcome to install OneFlow v0.8.0 for a new user experience. Your feedbacks will be much appreciated!

Highlights and optimizations in this release:

1. PyTorch API compatibility

OneFlow v0.8.0 provides more and better PyTorch compatible APIs. In v0.8.0, a series of new features and interfaces that are compatible with PyTorch 1.10.0 are in place, including 68 new APIs that are aligned with PyTorch; 84 bugs are fixed to ensure better compatibility between operators and interfaces, allowing users to transfer more PyTorch models to OneFlow with just one click.

2. Wider support of global operators

All operators support Global Tensor more widely and efficiently. Fixed 28 bugs related to Global Tensor and added 180 Global operator unit tests, making the development of distributed models with Global Tensor faster and easier.

3. Better performance

The advanced features of Graph have been improved for better performance:

In addition to the original ZeRO-DP, ZeRO can be used in parallel with MP, 2-D, and 3-D to further reduce memory overhead.
Added a new pipeline parallelism API for Graph to simplify the configuration for pipeline parallelism and accelerate training when using pipeline parallelism and 3-D parallelism.
Added debugging features in multiple dimensions, including logical graphs, light plan physical graphs, memory analysis, and Python stack information, to further improve efficiency of Graph.debug.

As is shown above, the combination of OneFlow v0.8.0 and LiBai v0.2.0 enables higher computation speeds of GPT and BERT under 3-D parallelism on multiple dimensions, surpassing those of Megatron-LM with the same configurations. (For more details, see: https://libai.readthedocs.io/en/latest/tutorials/get_started/Benchmark.html).

4. OneEmbedding component

OneEmbedding is an extended component specifically designed for large-scale recommender systems. It boasts excellent performance, extensibility, and flexibility. Its features include:

Support for tiered storage and dynamic expansion of Embedding so users can expand the Embedding capacity at a lower cost.
Support for mixed parallelism strategies so users can easily extend models to multi-node multi-GPU scenarios.
Quantitative compression of communication. It can reduce communication volume and accelerate training in parallel scenarios by quantitative compression of the communication data.
Efficient data pipeline. It can move up the non-data dependent parts of the models for execution to save time.
Support for Automatic Mixed Precision training. It can transfer part of the computations in model training into FP16 computations, so as to reduce memory usage and accelerate training without loss of convergence accuracy.
Inclusion of a series of high-performance CUDA operators for common operations in recommender system models.
Support for flexible model building.

API Documentation: https://docs.oneflow.org/en/master/cookies/one_embedding.html

5. Multi-Device adaptation

OneFlow v0.8.0 provides a neat, efficient, and easily extensible hardware abstraction layer EP (Execution Provider) to adapt to different hardware. With the introduction of the hardware abstraction layer, no modifications are needed for any module of the framework to adapt to new hardware devices, regardless of the implementation details of any underlying hardware or framework.

To make the new hardware devices work, users only need to implement a series of interfaces based on the protocols of the hardware abstraction interfaces and the status quo of the hardware devices.

EP also defines a set of basic computing interface primitives, allowing the reimplementation of kernels. Primitives provide interfaces that are more flexible than the runtime interfaces provided by EP. Different interfaces are independent of each other, and each interface represents a kind of computing capability that can be provided by a certain hardware device.

6. Debugging tool stack

New debug tools: OneFlow-Profiler and AutoProf.

OneFlow-Profiler is a tool used to collect performance information during framework execution. It can keep records of the execution time of operators and system components, the allocation of memory, and the corresponding input and parameters of operators. All this information helps developers find out the main source of overhead in framework execution and thus implement targeted optimization. (Oneflow-Inc/oneflow#8047)

AutoProf is a framework for testing the performance of OneFlow and PyTorch operators. It provides an elegant and efficient method to detect the alignment between OneFlow APIs and PyTorch APIs, allowing users to conveniently compare the performance of OneFlow APIs and PyTorch APIs. (Oneflow-Inc/oneflow#8207)

7. Error message

Improved error message with more details. Refactored exception handling.

8. API documentation

Made over 20 revisions to the OneFlow API documentation, restructured the documentation based on features, and added further elaboration of modules and environment variables including OneFlow oneflow.nn.graph, oneflow.embedding, and oneflow.autograd, in addition to the general operator APIs.

To view the full changelog, please check out OneFlow v0.8.0 Release Note.

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

Use GDB to Walkthrough OneFlow Source Code

OneFlow — Mon, 11 Jul 2022 02:50:00 GMT

Written by Yi Wang

The Trick of gdb python3

PyTorch authors use gdb to debug C++ code triggered by some Python code. The guide is here.

The basic idea is to run a command gdb python3. In the GDB session, we can set a breakpoint given a C++ function name, say at::Tensor::neg. GDB would not be able to find this function at the moment, and it would prompt to make the breakpoint pending on future shared library load. We should answer yes. Then, we can type the command run, which will make GDB to start the Python interpreter. The Python interpreter will then prompt us to input Python source code. We can now type in import torch and other code and press enter. When the Python interpreter executes the import statement, it loads related shared libraries. GDB will watch the loading and set the breakpoint. The following Python source code executes, tiggers the breakpoint, then bring us to the GDB prompt. We can do usual C++ debugging work here, such as using bt to check the backtrack and l to show the C++ code invoked by the Python program.

Build OneFlow in Debug Mode

Linux-Only

OneFlow supports only Linux, but not macOS or Windows. I built OneFlow successfully on an AWS GPU host running Amazon Linux 2, which is similar to CentOS.

https://medium.com/media/880baa43a7228cb877c0c40d3d1465f6/href

Use Conda or Docker

The official document of OneFlow recommends to build using Conda or a Docker image https://github.com/Oneflow-Inc/oneflow#option-1-build-with-conda-recommended. I use Anaconda. The reason to use Conda or Docker is to fix the version of the C++ compiler and other build toolchain. Using a newer version of g++ might require some updates of the source code, for example, https://github.com/Oneflow-Inc/oneflow/issues/8397.

Build the Debug Version

Please be aware that, we must build a debug version of PyTorch for this trick to work, because GDB would need the debug symbols make the output of bt and l some sense.

https://medium.com/media/1613f0e76a7489e37f9ce3c78e5115e4/href

I built a CPU-only version of OneFlow hence the cpu.cmake file. My AWS host is out of China, so I used the file in the international directory.

Report Errors

I suffered from some issues while building OneFlow. Once I reported them, OneFlow authors responded promptly. Kudos to these awesome developers.

Build Step-by-Step

This section records my step-by-step process.

Download and install Anaconda. The default installation destination will be ~/anaconda3. The installation process adds environment variable settings to ~/.bashrc. So, source it or reconnect to the host to make the changes take effect.
Following https://github.com/Oneflow-Inc/conda-env to create and active the Conda environment.
Git clone the source code

mkdir ~/w

cd ~/w

git clone https://github.com/Oneflow-Inc/oneflow

4. Build OneFlow

cd oneflow

mkdir build

cd build

CMAKE_BUILD_TYPE=Debug cmake .. -C ../cmake/caches/international/cpu.cmake

make -k -j $(nproc)

Run and Debug

After the building, in the ~/w/oneflow/build directory, there comes a file source.sh. It sets the PYTHONPATH environment. Run the following command to make it take effect.

https://medium.com/media/3adacde0ec18b7071f7495fb2d32a9f7/href

Then, we can run the Python interpreter using GDB.

https://medium.com/media/a6fd8f265a2d0e0d9ad762c4499068e3/href

At the GDB prompt, I tried to set a breakpoint at oneflow::one::Tensor::is_eager, pending on future shared library load.

https://medium.com/media/1895cc2af321de01724145f2823a552e/href

Then, I can make GDB run the Python interpreter by typing the run command. At the Python interpreter prompt, I could import oneflow.

https://medium.com/media/6d565c6039b7d57de309c0c4d3bd1e79/href

This importing in GDB will take much longer than usual. If it complains ImportError, please make sure that you had source source.sh as aforementioned.

Now, let us create a tensor.

https://medium.com/media/bde6785220646034db91c326da26a80d/href

Typing enter, the execution of this line triggers the breakpoint. The above message tells that a function oneflow::one::CopyBetweenMirroredTensorAndNumpy called tensor->is_eager() at line 98 in a source file.

To display more context, we can type l. At line 98, there is the call to tensor->is_eager().

https://medium.com/media/14f1795d564dd2fcabdc008b61c445ac/href

We might be curious about why/how the creation of a tensor in the Python world would trigger the call to Tensor::is_eager. We can reveal more details by typing the bt command.

https://medium.com/media/e5a1445115cc2b651d04bd2ca96a4d87/href

At the bottom of the call stack is _stack, which is the entrypoint of the Python interpreter. Looking upward, we can see the call boundary between Python and OneFlow shared library — the function _PyMethodDef_RawFastCallKeywords in Python calls the OneFlow C++ function oneflow::one::functional::tensor, which, in turn, triggered the call to oneflow::one::Tensor::is_eager.

Related articles：

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

The Journey of an Operator in a Deep Learning Framework

OneFlow — Fri, 24 Jun 2022 03:28:53 GMT

Written by Luyang Zhao; Translated by Wenwen Dong, Yanjun Hu

The previous article The Execution Process of a Tensor in a Deep Learning Framework has introduced the execution mechanism of a Tensor in OneFlow. In this article, we will take output = flow.relu(input) as an example to illustrate how an op is executed from Python to C++ in OneFlow. Hope this article will enlighten you on the system design of deep learning frameworks.

Op is short for operator. Ops are the basic operations in deep learning. Hundreds of ops used for various types of numerical and tensor operations constitute the deep learning framework. The nn.Module is the block for building neural networks, and the op is the raw material for making blocks.

Take a demo network as an example:

https://medium.com/media/442702edaf51725144ed6b0b83e82aa9/href

From a structural perspective, this network is constructed by various nn.Module such as Linear, ReLU, and Softmax. And these nn.Module are composed of basic ops such as matmul, relu, and softmax.

In OneFlow, how does an existing op complete the call, flow, and execution process from the Python layer to the C++ layer?

This article will take output = flow.relu(input) as an example to illustrate the complete execution process of an op from Python to C++. The analysis will be carried out based on the source code.

The structure of this article is illustrated in the following process diagram:

1. Binding

Typically, we use Python to build networks, train models, and call functions. However, these functions are usually just a layer of wrapper at the Python layer, and the underlying implementation is done by C++. Python bindings allow you to call C++ from Python and take advantage of both languages.

In the implementation of deep learning frameworks, Python/C api and pybind11 are two practical ways to implement bindings. And these two methods are both used in OneFlow.

oneflow/api/python/framework/tensor.cpp
oneflow/api/python/framework/tensor_functions.cpp

The binding methods of tensor.xxx in the above code are Python/C api;

oneflow/core/functional/functional_api.yaml

The binding methods of flow.xxx in the above code are pybind11; For more details about Python/C api and pybind11, please refer to the following documentation:

Back to the flow.relu, the underlayer of flow.relu called in the Python layer is the oneflow._C.relu defined in python/oneflow/__init__.py, where _C indicates that it is now in the underlying C++ layer. Similar to PyTorch, we also define a set of rules for interface export and code generation based on .yaml. For example, in functional_api.yaml, we can see the function signature of relu's export interface:

https://medium.com/media/3e3e2eb72d68f75b978bc85902cc520f/href

As is shown in the code above, the flow._C.relu receives two parameters, a tensor and a bool value. It is bound to the C++ Relu, and the function return value is also a tensor. When compiling OneFlow, C++ .h and .cpp files will be dynamically generated by running the file of tools/functional/generate_functional_api.py and parsing functional_api.yaml:

build/oneflow/core/functional/functional_api.yaml.h
build/oneflow/core/functional/functional_api.yaml.cpp

Moreover, the corresponding functor in the .cpp file will be called to complete the function call at the C++ level.

Take flow._C.relu as an example, its corresponding functor definition is located in oneflow/core/functional/impl/activation_functor.cpp:

https://medium.com/media/2138e5be2e693c97fadaae8d44c79cfb/href

ReluFunctor completes the registration of the functor through the following code. After being registered as the functional interface, flow._C.relu is bound to “Relu” at the python layer. At the same time, this function can also be called directly in C++ through functional::Relu.

https://medium.com/media/fb1bbe3a646ddc59e338f0ac37d37f4c/href

2.Functor

The functor is not only the core of extending Python to C++, but also the first step for calling ops and derivation and examination of input parameters. Typically, ops not only need to check the shape, dtype, dimension, and the number of elements of the input tensor at the functor layer, but also need to parse and process the specific logic of ops.

The code of Relu Functor is as follows:

https://medium.com/media/647ca42a75f9d83dc75bd237aacc2bcd/href

ReluFunctor defines a private variable std::shared_ptr op_; Constructed by OpBuilder, this op_ is the relu op that needs to be executed. The inside of the functor's operator() goes to 2 different branches according to whether inplace=True or False. And the op, input tensor and parameters are dispatched to the interpreter through OpInterpUtil::Dispatch().

3.Dispatch

After completing the check and logical processing in the functor, most ops need to be dispatched through OpInterpUtil::Dispatch(), and the destination is Interpreter, where the op will be further processed.

There are various overloaded Dispatch templates in oneflow/core/framework/op_interpreter/op_interpreter_util.h:

https://medium.com/media/813740cad667061b77a3669159b264d1/href

The parameters of these overloaded functions include input, output, OpExprInterpContext, etc. The OpExprInterpContext is the context required by the op in the interpreter, and it should carry the descriptive information like the attributes required by the op (such as kernel_size and padding needed by the conv2d op), device, sbp and parallel, etc.

These overloaded Dispatch functions will eventually run to:

https://medium.com/media/51bf7f8f5c1f3cb924807536cf8cb533/href

The next stop is interpreter.

4.Interpreter

Get Interpreter

Literally, GetInterpreter means to get the interpreter needed for the subsequent executions of the op. Here is a streamlined version of the code (omitting the logic related to “check”) oneflow/core/framework/op_interpreter/op_interpreter_util.cpp：

https://medium.com/media/693b1243ec48e67e546c9425031c2019/href

It can be seen from the above that interpreters can be roughly divided into two types: eager interpreters and lazy interpreters. Furthermore, eager interpreters can be subdivided based on whether they are eager mirrored or eager consistent.

To sum up, we have the following three types of child class implementations:

EagerMirroredInterpreter
EagerConsistentInterpreter
LazyInterpreter

In a regular eager mode (whether in single-GPU training or DDP training), you will need to go by the EagerMirroredInterpreter logic; Otherwise, if you have set the SBP and placement for the input tensor, you will need to go by the EagerConsistentInterpreter logic; In a lazy mode (using nn.Graph), you will need to go by the LazyInterpreter logic.

Now let’s see how these three types of interpreters are built:

https://medium.com/media/9550884664c0523cf0ca5d42fbb4094c/href

As you can see, these three types of interpreters will be used as an “internal” variable (a private variable) to build an AutogradInterpreter, and eventually return an AutogradInterpreter.

https://medium.com/media/c1882bce606a12ab4c1a95ffbb3c1fec/href

Apply()

As is shown in the above code, EagerMirroredInterpreter, EagerConsistentInterpreter, and LazyInterpreter will be packaged as an AutogradInterpreter to trigger the call of “Apply”. As its name implies, an AutogradInterpreter performs a role that is closely related to autograd. It mainly adds the corresponding nodes that calculate the grad by backpropagation for the op nodes in forward propagation in an eager mode.

The code is as follows (annotations for the key parts are provided):

https://medium.com/media/1ae3ce899e29a8a1134d4dd26a0d5e36/href

Phew, this is too much for you, right? But don’t worry. For a simple ReLU op, you only need to focus on this part:

https://medium.com/media/63d67b649a70e10535847e1b1fed3bf4/href

Now let’s go back to the aforementioned example of flow.relu. Since it is in an eager mode, we will need to use the Apply() method of EagerInterpreter.

https://medium.com/media/488ba2f52070a1a1e0cb241ae4783929/href

By defining a macro for APPLY_IF, we add branch processing for different types of ops. Among them, UserOp is the most frequently used type for our users, which means they will most likely enter this branch:

https://medium.com/media/8d5890e586a57071333e9aef0ac0e324/href

Now let’s take a look at EagerMirroredInterpreter::ApplyImpl. It is here:oneflow/core/framework/op_interpreter/eager_mirrored_op_interpreter.cpp;

https://medium.com/media/5a38b372b67914b1a5ed0e412988cef4/href

The ultimate implementation is NaiveInterpret.

NaiveInterpret

Simply put, NaiveInterpret is used for the following purposes:

check if the device of the input tensor is the same with the default device
create an output tensor
derive and check shape/stride/dtype for the output tensor
build op execution instructions and dispatch them to the Virtual Machine (VM)

Here is a simplified version of the code:

https://medium.com/media/46b0082bb095726db6d640fb3685e7e2/href

As you see, VM is the destination of the interpreter. OneFlow has its unique design of VM, which is worth writing about, too. But for your ease of understanding, I will just put it this way:

After the VM receives the dispatched instructions, the op will wait in a task queue for scheduling and execution.

5.Compute

The op execution instructions, dispatched to the VM by the interpreter, will be processed by the schedule logic in the VM. Then the instructions will be triggered and executed in oneflow/core/eager/opkernel_instruction_type.cpp.

The core code is as follows:

https://medium.com/media/a4eae06564ec9cf040184a74e051aaa0/href

This line of the above code will trigger the execution of the op kernel: operand->user_opkernel()->Compute(compute_ctx, state, cache);

Normally, op kernels, based on their different devices, will be dispatched to different implementations, usually oneflow/user/kernels/xxx_kernel.cpp or oneflow/user/kernels/xxx_kernel.cu.

However, the ReLU op is an exception since it is implemented by primitives. (OneFlow also has its unique primitives, boasting good abstractions and composability.) For example, the following UnaryPrimitive is a combination of the elementwise unary template and UnaryFunctor. The trace of UnaryPrimitive is as follows:

UnaryPrimitiveKernel

https://medium.com/media/d406df403e7d967fe66c7b8b2d5f7b88/href

ep::primitive::ElementwiseUnary

https://medium.com/media/08c99967a391006b4c3adf7329d57141/href

UnaryFunctor

The UnaryFunctor renders a specific functor implementation to each Unary op type. For instance, the functor implementation for the ReLU op is as follows:

oneflow/core/ep/common/primitive/unary_functor.h:

https://medium.com/media/6ece416f3f1f0b92f4e92ffccd7ddcd3/href

Now, we have gone through the Python -> C++ journey of an op.

You may find the details intimidatingly complicated, but if you zoom it out and see the big picture, you will realize that it is simply a four-step process:

Functor -> Dispatch -> Interpreter -> Kernel Compute

Usually, to implement/add an op, you don’t need to pay much attention to the intermediate steps, namely, “Dispatch” and “Interpreter”. You can just focus on what’s strongly correlated with the op, that is, the parameters and op logic check in the Functor step, and the op computation in the Kernel Compute step.

Reference

OneFlow source code:

https://github.com/Oneflow-Inc/oneflow

Related articles：

Welcome to visit OneFlow on GitHub and follow us on Twitter and LinkedIn.

Also, welcome to join our Discord group to discuss and ask OneFlow related questions, and connect with OneFlow contributors and users all around the world.

The Journey of an Operator in a Deep Learning Framework was originally published in CodeX on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by OneFlow on Medium

Running Stable Video Diffusion 2x Faster with OneDiff DeepCache Node

Accelerating SDXL 3x Faster with DeepCache and OneDiff

Make SDXL run 3.5x faster on RTX 3090 and 3x faster on A100.

OneFlow v0.9.0 Came Out!

1. PyTorch API compatibility

2. Improving the usability of distributed programming

3. Supporting automatic parallelism

4. Better performance

5. Debugging

6. IR

7. OneFlow-ONNX

8. Better error prompt

Text to Image in less than 1 Second, Probably the Fastest Open Source Stable Diffusion Ever

OneFlow Stable Diffusion: faster than fast

Showcase

Seamless integration into PyTorch ecosystem to enable easy model transfer

How to run OneFlow Stable Diffusion

What next

Using Global Tensor to Program on Multi-Device Multi-GPU: Basic Operations

Global Tensor

Creating Global Tensor

Converting Local Tensor to Global Tensor

Converting Global Tensor to Local Tensor

Converting One Global Tensor to Another Global Tensor

Global Tensor Participating in Computation

Conclusion

Further Reading

OneFlow’s multi-machine multi-GPU launching and its required environment variables

OneEmbedding Allows Efficient Training of Large Recommender Models with Single GPU

Challenges facing large recommender systems

Core advantages of OneEmbedding

Hierarchical storage: support single-GPU training of TB-level models

Scale-out: multi-GPU linear acceleration to break the ceiling of model training

Pipeline mechanism: auto overlap compute and data transfer

Kernel optimization: approaching GPU’s optimal performance

Embedding quantization: improving communication efficiency

Easy-to-use: build large-scale recommendation models as easily as in PyTorch

Conclusion

LiBai Model Library to Train Large Models More Easily and Efficiently

1. One-click auto distributed training with better performance than Megatron-LM and DeepSpeed

Data parallelism

Model Parallelism

Pipeline Parallelism

2D Parallelism

2. LiBai: more and better

3. LiBai supports all regular parallel training methods

A general way to implement parallelism

Data parallelism & model parallelism

Data & model mixed parallelism

Configuration of pipeline parallelism

The realization of 3D parallelism

Custom parallel training

4. Future Plan

OneFlow v0.8.0 Came Out!

1. PyTorch API compatibility

2. Wider support of global operators

3. Better performance

4. OneEmbedding component

5. Multi-Device adaptation

6. Debugging tool stack

7. Error message

8. API documentation

Use GDB to Walkthrough OneFlow Source Code

The Trick of gdb python3

Build OneFlow in Debug Mode

Linux-Only

Use Conda or Docker

Build the Debug Version

Report Errors

Build Step-by-Step

Run and Debug

The Journey of an Operator in a Deep Learning Framework

1. Binding

2.Functor

3.Dispatch

4.Interpreter

Get Interpreter

Apply()

NaiveInterpret