opt ll dispatch layered algo#500
Merged
wangfakang merged 2 commits intodeepseek-ai:antgroup-optfrom Dec 4, 2025
Merged
Conversation
Contributor
Author
|
benchmark test pass on env: one/two/four 8*H200, |
wangfakang
reviewed
Nov 25, 2025
|
Hello, I'd like to ask why similar optimizations weren't made to combine? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
introduce
algo opt for dispatch in low-latency mode:
In the dispatch kernel of DeepEP's low-latency mode, the original algorithm directly sends data to the destination rank via the RDMA cross-orbit network. A drawback of this algorithm is that it results in excessive duplicate data being transmitted over the RDMA network. Now, drawing inspiration from the approach used in normal mode, we can improve the dispatch kernel in low-latency mode by first sending data to the same-orbit rank on the cross-node, and then forwarding it to the actual destination rank via the NVLink interconnect.
Note: This feature conflicts with the existing Elasticity Support to DeepEP for Fault-Tolerant EP Inference functionality, and the two features cannot be enabled simultaneously.
before:

after:
performance
benchmark:
use
This feature is enabled by default and requires no additional activation from the user. To disable it, please set the following environment variable:
DEEPEP_DISABLE_LL_DISPATCH_OPT=1.