Skip to content

Comments

[None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE#11143

Merged
chzblych merged 2 commits intoNVIDIA:release/1.2.0rc6.post1from
nekorobov:user/nkorobov/update-ds-fp8-cubins-fuse-shared
Feb 4, 2026
Merged

[None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE#11143
chzblych merged 2 commits intoNVIDIA:release/1.2.0rc6.post1from
nekorobov:user/nkorobov/update-ds-fp8-cubins-fuse-shared

Conversation

@nekorobov
Copy link
Collaborator

@nekorobov nekorobov commented Jan 30, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for fused shared experts in Mixture of Experts (MoE) routing, enabling more efficient handling of shared expert components in model inference.
  • API Changes

    • Added numFusedSharedExpert parameter to MoE runner interfaces to specify the number of fused shared experts.
    • Updated tensor shapes and workspace allocations to account for expanded total expert counts when shared experts are fused.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@nekorobov nekorobov requested review from a team as code owners January 30, 2026 18:43
@nekorobov nekorobov changed the title feat: fuse shared to sparse experts in TRT-LLM Gen MoE [None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE Jan 30, 2026
@nekorobov
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34250 [ run ] triggered by Bot. Commit: 0b713c9

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34250 [ run ] completed with state SUCCESS. Commit: 0b713c9
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #24 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nekorobov
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34255 [ run ] triggered by Bot. Commit: 0b713c9

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34255 [ run ] completed with state SUCCESS. Commit: 0b713c9
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #25 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nekorobov
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34294 [ run ] triggered by Bot. Commit: 2006aac

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34294 [ run ] completed with state SUCCESS. Commit: 2006aac
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #27 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@longlee0622
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34525 [ run ] triggered by Bot. Commit: 2006aac

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34525 [ run ] completed with state SUCCESS. Commit: 2006aac
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #31 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Copy link
Collaborator

@xxi-nv xxi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM

@longlee0622 longlee0622 added the Release Blocker PRs that blocking the final release build or branching out the release branch label Feb 3, 2026
@@ -219,7 +226,7 @@ def _check_configs(self):
def _get_quant_method(self):
if self.quant_config is not None:
if self.quant_config.layer_quant_mode.has_fp8_block_scales():
return DeepSeekFP8BlockScalesFusedMoEMethod()
return DeepSeekFP8TRTLLMGenBlockScalesFusedMoEMethod()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that you're overriding the original behavior. The new method will invariably attempt to fuse the shared_expert. Could I confirm whether this consistently yields better performance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @lishicheng1996 to confirm.

Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
@nekorobov nekorobov force-pushed the user/nkorobov/update-ds-fp8-cubins-fuse-shared branch from 2006aac to 6659abf Compare February 3, 2026 13:39
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 3, 2026

📝 Walkthrough

Walkthrough

This pull request adds comprehensive fused shared expert support to TensorRT-LLM's MoE routing pipeline. Changes span CUDA kernels, C++ runners and implementations, Python bindings, quantization methods, and model-specific integration, introducing new parameters to track fused expert counts and updating expert indexing, tensor shapes, and routing logic throughout.

Changes

Cohort / File(s) Summary
CUDA Kernel and Core Routing
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu, cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h
Added new fields (mNumFusedSharedExperts, mSharedExpertTokenOffset, mSharedExpertNumTokens, mTotalExpertsPerToken) to kernel data structures. Updated routing kernel to handle fused shared experts by expanding expert indexing, computing new indices with shared expert offsets, and adjusting per-token expert counts to accommodate shared experts.
Routing Runner Infrastructure
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h, cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu
Extended Runner::run signature with new numFusedSharedExpert parameter. Updated routing data initialization to compute total expert counts and per-token sharing details. Modified GEMM workspace and kernel configurations to use expanded expert totals (totalExpertsPerToken, totalNumExperts). Added validation guards for routing methods incompatible with fused shared experts.
C++ MoE Kernel Call Sites
cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp, cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp, cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp, cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp
Updated routing runner invocations to pass numFusedSharedExpert parameter (set to 0 or computed value). Adjusted call argument order and added new buffer pointers (cta_idx_xy_to_mn_limit, num_non_exiting_ctas) required by extended routing kernel interface.
FP8 Block Scale MoE Implementation
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp
Introduced num_fused_shared_experts parameter across public entry points and internal methods. Updated tensor allocations, weight/scale validation, and autotuner logic to use total expert counts (num_total_experts, num_total_local_experts, total_experts_per_token). Modified routing kernel call to propagate fused shared expert parameter and updated GEMM configurations accordingly.
Python Custom Ops Bindings
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
Added num_fused_shared_experts parameter to FP4BlockScaleMoERunner and FP8BlockScaleMoERunner constructors and public wrapper functions. Updated forward paths and tactic generation to propagate the parameter through to underlying kernel runners. Extended AutoTuner input preparation to include fused shared expert configuration.
MoE Fusion and Quantization
tensorrt_llm/_torch/modules/fused_moe/quantization.py, tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
Updated create_weights method signatures to accept n_shared_experts parameter. Added new fuse_shared_expert methods to handle fusion of shared-expert weights and scales with existing MoE weights. Modified weight shape calculations to accommodate expanded expert dimensions. Introduced new public attribute num_fused_shared_expert in fused MoE class initialization.
DeepSeekV3 Model Integration
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Extended shared expert TP size computation to consider fused shared expert presence. Added conditional execution path for parallel shared expert computation. Modified post-load-weights to fuse shared experts into MoE layers via new fuse_shared_expert method calls. Updated MoE forward to force finalization behavior when shared experts are present.
Test Reference Implementation
tests/unittest/_torch/thop/serial/test_moe.py
Extended routing reference implementation to support num_fused_shared_experts parameter. Updated test data structures and tensor allocations to expand per-token expert arrays by shared expert count. Modified test parameterizations and Moe runner invocations to propagate fused expert parameter throughout test flows. Added backward-compatibility preservation when parameter is zero.

Sequence Diagram(s)

sequenceDiagram
    participant Kernel as Routing Kernel
    participant Runner as MoE Runner
    participant Config as Config/Quantization
    participant Fusion as Expert Fusion

    Runner->>Kernel: Execute with numFusedSharedExperts
    Note over Kernel: Expand per-token expert indices<br/>mTotalExpertsPerToken = topK + numFusedSharedExperts
    Kernel->>Kernel: Write routed expert scores<br/>+ fused shared expert weights
    Kernel-->>Runner: Return expanded routing indices<br/>& expert counts
    
    Runner->>Runner: Compute total expert counts<br/>totalNumExperts = numExperts + numFusedSharedExperts
    Runner->>Runner: Configure GEMM with expanded dimensions<br/>totalExpertsPerToken for workspace
    
    Runner->>Config: Request valid configs/tactics<br/>with total expert counts
    Config->>Config: Allocate tensors sized by totalExpertsPerToken
    Config->>Config: Compute weight shapes:<br/>w3_w1, w2 += numFusedSharedExperts
    
    Config->>Fusion: Prepare fused shared experts
    Fusion->>Fusion: Load shared expert weights
    Fusion->>Fusion: Permute and fuse shared weights<br/>into MoE weight structure
    Fusion-->>Config: Return fused weight tensors
    
    Config-->>Runner: Return valid configurations
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains the template with no actual content filling the Description and Test Coverage sections, leaving the actual purpose and testing strategy unclear. Add a clear Description section explaining the feature, its purpose, and implementation approach. Add a Test Coverage section documenting relevant tests that validate the shared-expert fusion functionality.
Docstring Coverage ⚠️ Warning Docstring coverage is 8.93% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main feature: fusing shared to sparse experts in TRT-LLM Gen MoE, which aligns with the extensive changes across kernel, routing, and Python integration files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (1)

646-690: ⚠️ Potential issue | 🟠 Major

Add an explicit guard for the expanded expert count.

numExperts now includes fused shared experts, but there is no check to ensure the expanded value stays within kernel limits. If numExperts exceeds MaxSupportedExpertCount, getMaxNumExperts returns 0 and leads to invalid thread configuration and kernel launches.

🐛 Suggested fix
     int const numExperts = data.mNumExperts + data.mNumFusedSharedExperts;
     int const topK = data.mTopK + data.mNumFusedSharedExperts;
+    TLLM_CHECK_WITH_INFO(numExperts <= MaxSupportedExpertCount,
+        "Routing kernel expects `#experts` %d to be <= %d", numExperts, MaxSupportedExpertCount);
     int const numThreadsHist = getMaxNumExperts(numExperts);
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)

1-2: ⚠️ Potential issue | 🟠 Major

Update the copyright year to 2026.

This file was modified in this PR but the header still ends at 2025.

✏️ Proposed fix
- * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)

1-2: ⚠️ Potential issue | 🟠 Major

Update the copyright year to 2026.

This file was modified in this PR but the header still ends at 2024.

✏️ Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

tests/unittest/_torch/thop/serial/test_moe.py (1)

1-2: ⚠️ Potential issue | 🟠 Major

Update the SPDX year to 2026.

The file is modified in this PR but the SPDX header still ends at 2024.

✏️ Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (3)

1-1: ⚠️ Potential issue | 🟠 Major

Add the required NVIDIA copyright header.

This TensorRT-LLM source file is missing the standard header block.

📄 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+ #
 from dataclasses import dataclass, replace

As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.


561-563: ⚠️ Potential issue | 🟠 Major

Include num_fused_shared_experts in the AutoTuner cache key.

Tactic validity depends on the total experts-per-token. If this parameter changes, the current cache key can reuse incompatible tactics.

🔧 Proposed fix
-    def unique_id(self):
-        return (self.top_k, self.intermediate_size, self.local_num_experts)
+    def unique_id(self):
+        return (self.top_k, self.num_fused_shared_experts or 0,
+                self.intermediate_size, self.local_num_experts)

765-789: ⚠️ Potential issue | 🟡 Minor

Silence the unused num_fused_shared_experts in the fake kernel.

Ruff flags this as unused; keep the signature but mark it explicitly to avoid lint failures.

🧹 Proposed fix
 def _(routing_logits: torch.Tensor,
       routing_bias: torch.Tensor,
       hidden_states: torch.Tensor,
       hidden_states_scale: torch.Tensor,
       gemm1_weights: torch.Tensor,
       gemm1_weights_scale: torch.Tensor,
       gemm2_weights: torch.Tensor,
       gemm2_weights_scale: torch.Tensor,
       num_experts: int,
       top_k: int,
       num_fused_shared_experts: Optional[int],
       n_group: Optional[int],
       topk_group: Optional[int],
       intermediate_size: int,
       local_expert_offset: int,
       local_num_experts: int,
       routed_scaling_factor: Optional[float],
       routing_method_type: int,
       topk_weights: Optional[torch.Tensor] = None,
       topk_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
+    _ = num_fused_shared_experts
     num_tokens = hidden_states.shape[0]
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu (1)

1-2: ⚠️ Potential issue | 🟠 Major

Update the copyright year to 2026.

This file was modified in this PR but the header still ends at 2025.

✏️ Proposed fix
- * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (2)

1-3: ⚠️ Potential issue | 🟡 Minor

Update NVIDIA copyright year to include 2026.

This file was meaningfully modified in 2026, but the header still ends at 2024.

✍️ Suggested update
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines, "All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification."


49-63: ⚠️ Potential issue | 🟠 Major

Normalize/validate num_fused_shared_experts and use it consistently for shape checks and totals.

Right now, the top‑k shape checks still assume top_k, while totals clamp with a > 0 expression and the kernel args pass the raw optional value. With fused shared experts enabled, this can accept too‑small topk_* buffers (or reject expanded ones) and desync totals vs kernel args (e.g., negative values). Consider validating once and using a single normalized value everywhere.

✅ Suggested fix
     TORCH_CHECK(tensorrt_llm::common::isSM100Family(), "Only SM100f is supported by FP8 block scale MOE");
 
+    int64_t const numFusedSharedExperts = num_fused_shared_experts.value_or(0);
+    TORCH_CHECK(numFusedSharedExperts >= 0, "num_fused_shared_experts must be non-negative.");
+
     if (topk_ids.has_value() && topk_weights.has_value())
     {
+        int64_t const expectedCols = top_k + numFusedSharedExperts;
         TORCH_CHECK(topk_ids.value().scalar_type() == at::ScalarType::Int, "topk_ids must be int");
         TORCH_CHECK(topk_weights.value().scalar_type() == at::ScalarType::BFloat16, "topk_weights must be bfloat16.");
         TORCH_CHECK(topk_ids.value().dim() == 2, "topk_ids must be 2D.");
         TORCH_CHECK(topk_ids.value().sizes()[0] == hidden_states.sizes()[0],
             "topk_ids and hidden_states must have the same number of tokens.");
-        TORCH_CHECK(topk_ids.value().sizes()[1] == top_k, "topk_ids dim1 must match top_k.");
+        TORCH_CHECK(topk_ids.value().sizes()[1] == expectedCols,
+            "topk_ids dim1 must match top_k (+ num_fused_shared_experts).");
         TORCH_CHECK(topk_weights.value().dim() == 2, "topk_weights must be 2D.");
         TORCH_CHECK(topk_weights.value().sizes()[0] == hidden_states.sizes()[0],
             "topk_weights and hidden_states must have the same number of tokens.");
-        TORCH_CHECK(topk_weights.value().sizes()[1] == top_k, "topk_weights dim1 must match top_k.");
+        TORCH_CHECK(topk_weights.value().sizes()[1] == expectedCols,
+            "topk_weights dim1 must match top_k (+ num_fused_shared_experts).");
     }
@@
-    int64_t const num_total_experts
-        = num_experts + (num_fused_shared_experts.value_or(0) > 0 ? num_fused_shared_experts.value() : 0);
-    int64_t const total_experts_per_token
-        = top_k + (num_fused_shared_experts.value_or(0) > 0 ? num_fused_shared_experts.value() : 0);
-    int64_t const num_total_local_experts
-        = local_num_experts + (num_fused_shared_experts.value_or(0) > 0 ? num_fused_shared_experts.value() : 0);
+    int64_t const num_total_experts = num_experts + numFusedSharedExperts;
+    int64_t const total_experts_per_token = top_k + numFusedSharedExperts;
+    int64_t const num_total_local_experts = local_num_experts + numFusedSharedExperts;
@@
-    args.num_fused_shared_experts = num_fused_shared_experts.value_or(0);
+    args.num_fused_shared_experts = numFusedSharedExperts;

Also applies to: 145-176

🤖 Fix all issues with AI agents
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu`:
- Around line 714-721: The TLLM_CHECK_WITH_INFO call currently has a malformed
message and doesn't print the actual fused-expert count; update the check at the
TLLM_CHECK_WITH_INFO invocation to include both data.mNumFusedSharedExperts and
WarpSize in the formatted message and fix the punctuation/parentheses – e.g.,
use a format like "Number of fused shared experts (%d) must be less than warp
size (%d)." and pass data.mNumFusedSharedExperts then WarpSize as the format
arguments so the real values are logged.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`:
- Around line 114-129: The code computes numDevices = numExperts /
localNumExperts and deviceIndex = localExpertOffset / localNumExperts without
validating localNumExperts or ensuring experts partition evenly, risking
divide-by-zero or misrouting; add guards at the start of this block to validate
localNumExperts > 0 and that numExperts >= localNumExperts, and handle
non-divisible partitions (e.g., compute numDevices as max(1, numExperts /
localNumExperts) or bail/throw via an error/log when configuration is invalid),
then compute deviceIndex safely (avoid division by zero) and adjust the token
offset/num-tokens calculation for uneven partitions so
routingData.mSharedExpertTokenOffset and routingData.mSharedExpertNumTokens
remain correct and deterministic.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h`:
- Around line 149-156: The function prototype run(...) in runner.h introduces a
new parameter numFusedSharedExpert that lacks Doxygen documentation; add a
Doxygen `@param` entry for numFusedSharedExpert immediately above the run
declaration in runner.h (within the existing Doxygen block for run) describing
what the parameter represents, expected range/units/semantics (e.g., number of
fused shared experts per group or per-token), any constraints or default
behavior, and how it affects routing/processing so callers understand its
purpose and valid values; keep the style consistent with the other `@param`
entries in that comment block.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`:
- Around line 169-178: The code assigns num_fused_shared_expert after calling
create_weights(), but create_weights() uses that value for FP8 block‑scale
sizing; move the assignment of self.num_fused_shared_expert (setting it from
model_config.pretrained_config.n_shared_experts when
model_config.mapping.dp_size == 1 and
self.quant_config.layer_quant_mode.has_fp8_block_scales()) so it occurs before
the call to self.create_weights(); ensure self.layer_idx and
self._weights_created initialization remain correct and unchanged.

In `@tests/unittest/_torch/thop/serial/test_moe.py`:
- Around line 1050-1051: Remove the unconditional debug prints of
output_dequant_reference and output_dequant_actual in the test; either delete
the two print(...) lines or gate them behind a runtime flag (e.g., check
os.environ["CI_DEBUG"] or a pytest config option) so they only print when
debugging is explicitly enabled, and if gating add the necessary import (os) and
use the flag check around the prints in
tests/unittest/_torch/thop/serial/test_moe.py to avoid flooding CI logs.
- Around line 169-174: Replace the deprecated torch.range usage in the
sharedIndices construction: in the block that defines sharedIndices (using
numExperts, num_fused_shared_experts, topKIndices.dtype), switch
torch.range(...) to torch.arange(...) and adjust the end parameter to remove the
`- 1` offset (so the range goes from numExperts to numExperts +
num_fused_shared_experts), then continue to unsqueeze, repeat (numTokens, 1) and
torch.cat with topKIndices as before.
🧹 Nitpick comments (4)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (1)

150-154: Initialize fused‑shared fields to avoid stale values.

These new scalars are uninitialized while adjacent members have defaults; defaulting them to 0 prevents accidental use of garbage when a caller constructs params without setBaseParams().

Suggested change
-    int32_t mNumFusedSharedExperts;
-    int32_t mSharedExpertTokenOffset;
-    int32_t mSharedExpertNumTokens;
-    int32_t mTotalExpertsPerToken;
+    int32_t mNumFusedSharedExperts = 0;
+    int32_t mSharedExpertTokenOffset = 0;
+    int32_t mSharedExpertNumTokens = 0;
+    int32_t mTotalExpertsPerToken = 0;
cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp (1)

77-86: Hoist the default routed scaling factor into a named constant.

Using 1.0 directly in the call expression violates the literal-usage rule; prefer a named const with the prescribed naming scheme.

Suggested change
-    routing_runner.run(routing_logits_ptr, routing_bias_ptr, num_tokens, num_experts, top_k,
-        /* num_fused_shared_expert */ 0, n_group.value_or(0), topk_group.value_or(0), local_expert_offset,
-        local_num_experts, routed_scaling_factor.value_or(1.0), expert_indexes.data_ptr<int>(),
+    double const kDEFAULT_ROUTED_SCALING_FACTOR = 1.0;
+    routing_runner.run(routing_logits_ptr, routing_bias_ptr, num_tokens, num_experts, top_k,
+        /* num_fused_shared_expert */ 0, n_group.value_or(0), topk_group.value_or(0), local_expert_offset,
+        local_num_experts, routed_scaling_factor.value_or(kDEFAULT_ROUTED_SCALING_FACTOR),
+        expert_indexes.data_ptr<int>(),
         expert_count_histogram.data_ptr<int>(), total_num_padded_tokens.data_ptr<int>(),
         expanded_idx_to_permuted_idx.data_ptr<int>(), permuted_idx_to_expanded_idx.data_ptr<int>(),

As per coding guidelines "Except for 0, nullptr, true, and false, all other literals should only be used for variable initialization and not in comparisons or expressions."

tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)

21-23: Keep the module namespace in the GatedMLP import.

This file’s import style breaks the project’s namespace rule. Please switch to a module import and qualify the type hint accordingly.

🔧 Suggested change
-from ..gated_mlp import GatedMLP
+from .. import gated_mlp
@@
-    def fuse_shared_expert(self, module: torch.nn.Module,
-                           shared_experts: GatedMLP, n_shared_experts: int):
+    def fuse_shared_expert(self, module: torch.nn.Module,
+                           shared_experts: gated_mlp.GatedMLP, n_shared_experts: int):

As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

tests/unittest/_torch/thop/serial/test_moe.py (1)

149-156: Use snake_case for newly added locals.

New variables like numTotalExperts / totalExpertsPerToken / numTokensPerExpert should follow snake_case to align with Python style.

♻️ Example rename (apply consistently)
-    numTotalExperts = numExperts + num_fused_shared_experts
-    totalExpertsPerToken = topK + num_fused_shared_experts
+    num_total_experts = numExperts + num_fused_shared_experts
+    total_experts_per_token = topK + num_fused_shared_experts

-    numTokensPerExpert = torch.zeros(numTotalExperts, dtype=torch.int64)
+    num_tokens_per_expert = torch.zeros(num_total_experts, dtype=torch.int64)

As per coding guidelines, Python local variables should use snake_case, with prefix k for variable names that start with a number.

Comment on lines 169 to 178
self._weights_created = False
self.num_fused_shared_expert = 0
if not model_config.skip_create_weights_in_init:
self.create_weights()
self.layer_idx = layer_idx

if model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales(
):
self.num_fused_shared_expert = model_config.pretrained_config.n_shared_experts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Set num_fused_shared_expert before create_weights().

create_weights() now sizes FP8 block‑scale weights based on num_fused_shared_expert, but the value is assigned after the call. For DP‑size 1 + FP8 block‑scales this can allocate weights for zero fused experts and break subsequent fusion.

Suggested fix (reorder assignment)
-        self._weights_created = False
-        self.num_fused_shared_expert = 0
-        if not model_config.skip_create_weights_in_init:
-            self.create_weights()
-        self.layer_idx = layer_idx
-
-        if model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales(
-        ):
-            self.num_fused_shared_expert = model_config.pretrained_config.n_shared_experts
+        self._weights_created = False
+        self.num_fused_shared_expert = 0
+        if model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales(
+        ):
+            self.num_fused_shared_expert = model_config.pretrained_config.n_shared_experts
+        if not model_config.skip_create_weights_in_init:
+            self.create_weights()
+        self.layer_idx = layer_idx
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` around lines
169 - 178, The code assigns num_fused_shared_expert after calling
create_weights(), but create_weights() uses that value for FP8 block‑scale
sizing; move the assignment of self.num_fused_shared_expert (setting it from
model_config.pretrained_config.n_shared_experts when
model_config.mapping.dp_size == 1 and
self.quant_config.layer_quant_mode.has_fp8_block_scales()) so it occurs before
the call to self.create_weights(); ensure self.layer_idx and
self._weights_created initialization remain correct and unchanged.

Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
@nekorobov
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34648 [ run ] triggered by Bot. Commit: 22988f8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34648 [ run ] completed with state FAILURE. Commit: 22988f8
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #35 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nekorobov
Copy link
Collaborator Author

/bot run --disable-failt-fast

@nekorobov
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34686 [ run ] triggered by Bot. Commit: 22988f8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #34685 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --disable-failt-fast

@longlee0622
Copy link
Collaborator

/bot kill

@chzblych chzblych merged commit 7c6df0e into NVIDIA:release/1.2.0rc6.post1 Feb 4, 2026
4 of 5 checks passed
@tensorrt-cicd
Copy link
Collaborator

PR_Github #34686 [ run ] completed with state SUCCESS. Commit: 22988f8
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #36 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

nekorobov added a commit to nekorobov/TensorRT-LLM that referenced this pull request Feb 13, 2026
…#11143)

Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
nekorobov added a commit to nekorobov/TensorRT-LLM that referenced this pull request Feb 13, 2026
…#11143)

Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Release Blocker PRs that blocking the final release build or branching out the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants