pd: fix local_rank and in mutlti nodes training#4811
pd: fix local_rank and in mutlti nodes training#4811njzjz merged 3 commits intodeepmodeling:develfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes the retrieval method of local rank for multi-node training and modifies the training step to disable automatic synchronization during forward–backward passes, opting for a manual allreduce approach before optimizer step.
- Retrieve local rank from the PADDLE_LOCAL_RANK environment variable with a default fallback
- Introduce a no_sync context to disable synchronization and manually fuse allreduce gradients for distributed training
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| deepmd/pd/utils/env.py | Updated local rank retrieval from environment variable |
| deepmd/pd/train/training.py | Added no_sync context for forward–backward operations and manual fused allreduce |
📝 Walkthrough""" WalkthroughThe changes update gradient synchronization logic during distributed training by explicitly disabling automatic synchronization during forward and backward passes and performing a manual allreduce of gradients. Additionally, the determination of the local rank is modified to use an environment variable instead of querying the distributed API. The parallel training documentation is extended with an alternative Changes
Suggested labels
Suggested reviewers
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (3.3.7)deepmd/pd/utils/env.pyNo files to lint: exiting. 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (29)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
doc/train/parallel-training.md (2)
221-221: Fix grammatical error: use “wrap” instead of “wrapper”.Change:
-or you can wrapper the training script with `mpirun`: +or you can wrap the training script with `mpirun`:
223-234: Ensure script name consistency and improve usability.
- The script header refers to
train_pp.sh, but thempiruninvocation callsrun_pp.sh. Align both totrain_pp.sh.- Add a shebang (
#!/usr/bin/env bash) at the top oftrain_pp.shand mark it executable.- Explicitly specify
-np(number of processes) and use the correct script path in thempiruncommand.Proposed diff:
@@ 224,224 -# ----- train_pp.sh ------- +#!/usr/bin/env bash +# ----- train_pp.sh ------- @@ 231,234 -```bash -mpirun run_pp.sh -``` +```bash +chmod +x train_pp.sh +mpirun -np 8 ./train_pp.sh +```
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
doc/train/parallel-training.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
doc/train/parallel-training.md
[grammar] ~221-~221: The word ‘wrapper’ is a noun or an adjective. A verb or adverb is missing or misspelled here, or maybe a comma is missing.
Context: ...p --pd train input.json or you can wrapper the training script with `mpirun`: ...
(PRP_MD_NN)
⏰ Context from checks skipped due to timeout of 90000ms (30)
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Analyze (javascript-typescript)
- GitHub Check: Analyze (python)
- GitHub Check: Test C++ (true)
- GitHub Check: Test C++ (false)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C++ (cuda120, cuda)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Build C++ (cuda, cuda)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## devel #4811 +/- ##
=======================================
Coverage 84.79% 84.79%
=======================================
Files 698 698
Lines 67824 67830 +6
Branches 3540 3540
=======================================
+ Hits 57508 57515 +7
- Misses 9181 9182 +1
+ Partials 1135 1133 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: HydrogenSulfate <490868991@qq.com>
1. get local rank from `PADDLE_LOCAL_RANK` environment variable instead of `get_rank()`(which will return global rank). 2. disable gradient synchronization in forward-backward and synchronize manually before optimizer update 4. update parallel training tutorial(multi-node multi-GPU) in document <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Bug Fixes** - Improved gradient synchronization in distributed training for multi-process setups. - Updated local rank assignment to use environment variables for enhanced compatibility. - **Documentation** - Added an example using `mpirun` and a sample shell script to the parallel training guide for distributed training launch. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: HydrogenSulfate <490868991@qq.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
PADDLE_LOCAL_RANKenvironment variable instead ofget_rank()(which will return global rank).Summary by CodeRabbit
Bug Fixes
Documentation
mpirunand a sample shell script to the parallel training guide for distributed training launch.