[None][chore] Add test configurable moe module#10575
Conversation
📝 WalkthroughWalkthroughIntroduces reference MoE modules with quantization support and a parameterized test suite. Adds RefGatedMLPFusedMoE base class with FP8 and NVFP4 variants for testing, abstract BaseQuantizeUtil with FP8QuantizeUtil and NVFP4QuantizeUtil implementations for weight generation, and test_moe function validating MoE output across quantization modes and backends. Changes
Sequence Diagram(s)sequenceDiagram
participant Input as Input Tensor
participant Router as Router
participant Experts as Expert Pool
participant Aggregator as Output Aggregator
Input->>Router: hidden_states, router_logits
Router->>Router: compute routing weights
Router->>Experts: select top-k experts
par Expert Processing
Experts->>Experts: expert_1 forward pass
Experts->>Experts: expert_2 forward pass
Experts->>Experts: expert_n forward pass
end
Experts->>Aggregator: per-expert outputs
Aggregator->>Aggregator: scale by routing weights
Aggregator->>Aggregator: accumulate outputs
Aggregator-->>Input: final hidden_states
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @tests/unittest/_torch/modules/moe/quantize_utils.py:
- Line 220: The assertion message in BaseQuantizeUtil incorrectly names the
class as "NoQuantizeUtil"; update the assertion in the constructor or
initializer that reads assert self.quant_config is None, "quant_config should be
None for NoQuantizeUtil" to reference the correct class name
("BaseQuantizeUtil") so the message matches the class symbol BaseQuantizeUtil
and helps future debugging.
- Around line 34-43: The constructor parameter model_config currently uses a
mutable default ModelConfig() which is evaluated once; change the signature of
quantize_utils.Moe...__init__ to default model_config: Optional[ModelConfig] =
None and inside the constructor set model_config = ModelConfig() if model_config
is None, and use that local variable thereafter; apply the exact same change to
FP8RefGatedMLPFusedMoE.__init__ and NVFP4RefGatedMLPFusedMoE.__init__ (update
their model_config defaults to None and instantiate when None) to avoid sharing
a single ModelConfig instance across calls.
In @tests/unittest/_torch/modules/moe/test_moe_module.py:
- Around line 58-60: The skip logic currently skips all TRTLLM tests for float16
(if moe_backend == "TRTLLM" and dtype == torch.float16) but the message mentions
"NVFP4" specifically; either narrow the condition to only skip when quant_algo
== QuantAlgo.NVFP4 as well (check quant_algo and use QuantAlgo.NVFP4 alongside
moe_backend == "TRTLLM" and dtype == torch.float16) or, if the limitation
applies to all TRTLLM float16 tests, keep the condition but update the
pytest.skip message to accurately say "TRTLLM MoE backend does not support
float16 yet" so the message matches the actual condition.
🧹 Nitpick comments (4)
tests/unittest/_torch/modules/moe/test_moe_module.py (2)
104-105: Replaceassert Falsewithraise AssertionError().
assert Falsestatements are removed when Python runs with optimizations (-O). Use explicit exception raising instead.♻️ Proposed fix
else: - assert False, "unsupported quant_algo" + raise AssertionError("unsupported quant_algo")
156-157: Same issue: replaceassert Falsewithraise AssertionError().♻️ Proposed fix
else: - assert False, "unsupported quant_algo to check accuracy" + raise AssertionError("unsupported quant_algo to check accuracy")tests/unittest/_torch/modules/moe/quantize_utils.py (2)
196-200: Consider removingABCinheritance or adding@abstractmethoddecorators.
BaseQuantizeUtilinherits fromABCbut defines no abstract methods. If subclasses are expected to overridecreate_weights, consider making it abstract. Otherwise, remove theABCinheritance since the class provides concrete implementations.
123-144: Code duplication acrossload_weightsmethods.The base weight and bias loading logic (lines 127-135) is duplicated from
RefGatedMLPFusedMoE.load_weights. Consider callingsuper().load_weights()for base loading, then adding quantization-specific scales. However, for test utilities, explicitness can be acceptable.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/unittest/_torch/modules/moe/quantize_utils.pytests/unittest/_torch/modules/moe/test_moe_module.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g.,some_file.py)
Python classes should use PascalCase (e.g.,class SomeClass)
Python functions and methods should use snake_case (e.g.,def my_awesome_function():)
Python local variables should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL)
Python constants should use upper snake_case (e.g.,MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format"""<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic
Files:
tests/unittest/_torch/modules/moe/test_moe_module.pytests/unittest/_torch/modules/moe/quantize_utils.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification
Files:
tests/unittest/_torch/modules/moe/test_moe_module.pytests/unittest/_torch/modules/moe/quantize_utils.py
🧠 Learnings (2)
📓 Common learnings
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tests/unittest/_torch/modules/moe/quantize_utils.py
🧬 Code graph analysis (1)
tests/unittest/_torch/modules/moe/quantize_utils.py (4)
tensorrt_llm/_torch/modules/fused_moe/routing.py (1)
BaseMoeRoutingMethod(162-185)tensorrt_llm/_torch/modules/gated_mlp.py (1)
GatedMLP(19-195)tensorrt_llm/quantization/mode.py (1)
QuantAlgo(23-48)tensorrt_llm/models/modeling_utils.py (2)
QuantConfig(131-271)quant_algo(550-551)
🪛 Ruff (0.14.10)
tests/unittest/_torch/modules/moe/test_moe_module.py
105-105: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
157-157: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
tests/unittest/_torch/modules/moe/quantize_utils.py
41-41: Do not perform function call ModelConfig in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
116-116: Do not perform function call ModelConfig in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
160-160: Do not perform function call ModelConfig in argument defaults; instead, perform the call within the function, or read the default from a module-level singleton variable
(B008)
196-196: BaseQuantizeUtil is an abstract base class, but it has no abstract methods or properties
(B024)
216-216: Unused method argument: quant_kwargs
(ARG002)
🔇 Additional comments (7)
tests/unittest/_torch/modules/moe/test_moe_module.py (3)
1-14: LGTM!Copyright header is correctly formatted with 2025 date.
76-84: LGTM!Good use of
torch.devicecontext manager and explicit seed setting for reproducibility.
143-150: LGTM!Clear inference mode usage and appropriate tolerance levels for different quantization modes.
tests/unittest/_torch/modules/moe/quantize_utils.py (4)
1-14: LGTM!Copyright header is correctly formatted with 2025 date.
67-84: LGTM!The forward implementation correctly applies routing weights with float32 accumulation for numerical stability. The reference implementation logic is clean and readable.
268-315: LGTM!FP8 weight creation correctly uses per-tensor quantization and properly validates required kwargs.
339-421: LGTM!NVFP4 weight creation correctly handles the constraint that w1 and w3 must share the same global scale, and properly uses the trtllm fp4_quantize op.
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
c28f9c3 to
0be5d72
Compare
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #31515 [ run ] triggered by Bot. Commit: |
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #31532 [ run ] triggered by Bot. Commit: |
|
PR_Github #31532 [ run ] completed with state
|
Signed-off-by: leslie-fang25 <leslief@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #31665 [ run ] triggered by Bot. Commit: |
|
PR_Github #31665 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #31725 [ run ] triggered by Bot. Commit: |
|
PR_Github #31725 [ run ] completed with state |
Signed-off-by: leslie-fang25 <leslief@nvidia.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Add test for configurable moe module
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.