opt ll dispatch layered algo#500

alpha-baby · 2025-11-21T12:25:07Z

introduce

algo opt for dispatch in low-latency mode:

In the dispatch kernel of DeepEP's low-latency mode, the original algorithm directly sends data to the destination rank via the RDMA cross-orbit network. A drawback of this algorithm is that it results in excessive duplicate data being transmitted over the RDMA network. Now, drawing inspiration from the approach used in normal mode, we can improve the dispatch kernel in low-latency mode by first sending data to the same-orbit rank on the cross-node, and then forwarding it to the actual destination rank via the NVLink interconnect.

Note: This feature conflicts with the existing Elasticity Support to DeepEP for Fault-Tolerant EP Inference functionality, and the two features cannot be enabled simultaneously.

before:

after:

performance

benchmark:

use

This feature is enabled by default and requires no additional activation from the user. To disable it, please set the following environment variable: DEEPEP_DISABLE_LL_DISPATCH_OPT=1.

alpha-baby · 2025-11-21T12:28:15Z

benchmark test pass on env: one/two/four 8*H200,

deep_ep/buffer.py

csrc/deep_ep.cpp

csrc/deep_ep.hpp

csrc/config.hpp

csrc/kernels/internode_ll.cu

csrc/kernels/utils.cuh

csrc/kernels/internode_ll.cu

wangfakang

LGTM. Thanks.

ywj55555 · 2026-02-02T08:54:38Z

Hello, I'd like to ask why similar optimizations weren't made to combine?

opt ll dispatch

86f8836

wangfakang reviewed Nov 25, 2025

View reviewed changes

fix comments and format code

a3c90be

wangfakang approved these changes Dec 4, 2025

View reviewed changes

wangfakang merged commit 38dfc2d into deepseek-ai:antgroup-opt Dec 4, 2025

alpha-baby changed the title ~~opt ll dispatch~~ opt ll dispatch layered algo Dec 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

opt ll dispatch layered algo#500