[None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE#11143

nekorobov · 2026-01-30T18:43:28Z

Summary by CodeRabbit

Release Notes

New Features
- Added support for fused shared experts in Mixture of Experts (MoE) routing, enabling more efficient handling of shared expert components in model inference.
API Changes
- Added numFusedSharedExpert parameter to MoE runner interfaces to specify the number of fused shared experts.
- Updated tensor shapes and workspace allocations to account for expanded total expert counts when shared experts are fused.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

nekorobov · 2026-01-30T18:45:13Z

/bot run

tensorrt-cicd · 2026-01-30T18:51:27Z

PR_Github #34250 [ run ] triggered by Bot. Commit: 0b713c9

tensorrt-cicd · 2026-01-30T20:51:45Z

PR_Github #34250 [ run ] completed with state SUCCESS. Commit: 0b713c9
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #24 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nekorobov · 2026-01-30T21:01:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-30T21:07:44Z

PR_Github #34255 [ run ] triggered by Bot. Commit: 0b713c9

tensorrt-cicd · 2026-01-31T00:56:19Z

PR_Github #34255 [ run ] completed with state SUCCESS. Commit: 0b713c9
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #25 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nekorobov · 2026-01-31T10:25:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-31T10:32:18Z

PR_Github #34294 [ run ] triggered by Bot. Commit: 2006aac

tensorrt-cicd · 2026-01-31T14:41:52Z

PR_Github #34294 [ run ] completed with state SUCCESS. Commit: 2006aac
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #27 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

longlee0622 · 2026-02-03T00:51:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-03T00:58:31Z

PR_Github #34525 [ run ] triggered by Bot. Commit: 2006aac

tensorrt-cicd · 2026-02-03T03:14:15Z

PR_Github #34525 [ run ] completed with state SUCCESS. Commit: 2006aac
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #31 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

xxi-nv

Overall, LGTM

tensorrt_llm/_torch/models/modeling_deepseekv3.py

xxi-nv · 2026-02-03T05:39:21Z

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

@@ -219,7 +226,7 @@ def _check_configs(self):
    def _get_quant_method(self):
        if self.quant_config is not None:
            if self.quant_config.layer_quant_mode.has_fp8_block_scales():
-                return DeepSeekFP8BlockScalesFusedMoEMethod()
+                return DeepSeekFP8TRTLLMGenBlockScalesFusedMoEMethod()


It appears that you're overriding the original behavior. The new method will invariably attempt to fuse the shared_expert. Could I confirm whether this consistently yields better performance?

Yes, @lishicheng1996 to confirm.

Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>

coderabbitai · 2026-02-03T13:55:27Z

📝 Walkthrough

Walkthrough

This pull request adds comprehensive fused shared expert support to TensorRT-LLM's MoE routing pipeline. Changes span CUDA kernels, C++ runners and implementations, Python bindings, quantization methods, and model-specific integration, introducing new parameters to track fused expert counts and updating expert indexing, tensor shapes, and routing logic throughout.

Changes

Cohort / File(s)	Summary
CUDA Kernel and Core Routing `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h`	Added new fields (`mNumFusedSharedExperts`, `mSharedExpertTokenOffset`, `mSharedExpertNumTokens`, `mTotalExpertsPerToken`) to kernel data structures. Updated routing kernel to handle fused shared experts by expanding expert indexing, computing new indices with shared expert offsets, and adjusting per-token expert counts to accommodate shared experts.
Routing Runner Infrastructure `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h`, `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`	Extended `Runner::run` signature with new `numFusedSharedExpert` parameter. Updated routing data initialization to compute total expert counts and per-token sharing details. Modified GEMM workspace and kernel configurations to use expanded expert totals (`totalExpertsPerToken`, `totalNumExperts`). Added validation guards for routing methods incompatible with fused shared experts.
C++ MoE Kernel Call Sites `cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp`, `cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp`, `cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp`, `cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp`	Updated routing runner invocations to pass `numFusedSharedExpert` parameter (set to 0 or computed value). Adjusted call argument order and added new buffer pointers (`cta_idx_xy_to_mn_limit`, `num_non_exiting_ctas`) required by extended routing kernel interface.
FP8 Block Scale MoE Implementation `cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp`	Introduced `num_fused_shared_experts` parameter across public entry points and internal methods. Updated tensor allocations, weight/scale validation, and autotuner logic to use total expert counts (`num_total_experts`, `num_total_local_experts`, `total_experts_per_token`). Modified routing kernel call to propagate fused shared expert parameter and updated GEMM configurations accordingly.
Python Custom Ops Bindings `tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py`	Added `num_fused_shared_experts` parameter to `FP4BlockScaleMoERunner` and `FP8BlockScaleMoERunner` constructors and public wrapper functions. Updated forward paths and tactic generation to propagate the parameter through to underlying kernel runners. Extended AutoTuner input preparation to include fused shared expert configuration.
MoE Fusion and Quantization `tensorrt_llm/_torch/modules/fused_moe/quantization.py`, `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`	Updated `create_weights` method signatures to accept `n_shared_experts` parameter. Added new `fuse_shared_expert` methods to handle fusion of shared-expert weights and scales with existing MoE weights. Modified weight shape calculations to accommodate expanded expert dimensions. Introduced new public attribute `num_fused_shared_expert` in fused MoE class initialization.
DeepSeekV3 Model Integration `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Extended shared expert TP size computation to consider fused shared expert presence. Added conditional execution path for parallel shared expert computation. Modified post-load-weights to fuse shared experts into MoE layers via new `fuse_shared_expert` method calls. Updated MoE forward to force finalization behavior when shared experts are present.
Test Reference Implementation `tests/unittest/_torch/thop/serial/test_moe.py`	Extended routing reference implementation to support `num_fused_shared_experts` parameter. Updated test data structures and tensor allocations to expand per-token expert arrays by shared expert count. Modified test parameterizations and Moe runner invocations to propagate fused expert parameter throughout test flows. Added backward-compatibility preservation when parameter is zero.

Sequence Diagram(s)

sequenceDiagram
    participant Kernel as Routing Kernel
    participant Runner as MoE Runner
    participant Config as Config/Quantization
    participant Fusion as Expert Fusion

    Runner->>Kernel: Execute with numFusedSharedExperts
    Note over Kernel: Expand per-token expert indices<br/>mTotalExpertsPerToken = topK + numFusedSharedExperts
    Kernel->>Kernel: Write routed expert scores<br/>+ fused shared expert weights
    Kernel-->>Runner: Return expanded routing indices<br/>& expert counts
    
    Runner->>Runner: Compute total expert counts<br/>totalNumExperts = numExperts + numFusedSharedExperts
    Runner->>Runner: Configure GEMM with expanded dimensions<br/>totalExpertsPerToken for workspace
    
    Runner->>Config: Request valid configs/tactics<br/>with total expert counts
    Config->>Config: Allocate tensors sized by totalExpertsPerToken
    Config->>Config: Compute weight shapes:<br/>w3_w1, w2 += numFusedSharedExperts
    
    Config->>Fusion: Prepare fused shared experts
    Fusion->>Fusion: Load shared expert weights
    Fusion->>Fusion: Permute and fuse shared weights<br/>into MoE weight structure
    Fusion-->>Config: Return fused weight tensors
    
    Config-->>Runner: Return valid configurations

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description only contains the template with no actual content filling the Description and Test Coverage sections, leaving the actual purpose and testing strategy unclear.	Add a clear Description section explaining the feature, its purpose, and implementation approach. Add a Test Coverage section documenting relevant tests that validate the shared-expert fusion functionality.
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.93% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main feature: fusing shared to sparse experts in TRT-LLM Gen MoE, which aligns with the extensive changes across kernel, routing, and Python integration files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (1)
646-690: ⚠️ Potential issue | 🟠 Major

Add an explicit guard for the expanded expert count.

numExperts now includes fused shared experts, but there is no check to ensure the expanded value stays within kernel limits. If numExperts exceeds MaxSupportedExpertCount, getMaxNumExperts returns 0 and leads to invalid thread configuration and kernel launches.
🐛 Suggested fix
     int const numExperts = data.mNumExperts + data.mNumFusedSharedExperts;
     int const topK = data.mTopK + data.mNumFusedSharedExperts;
+    TLLM_CHECK_WITH_INFO(numExperts <= MaxSupportedExpertCount,
+        "Routing kernel expects `#experts` %d to be <= %d", numExperts, MaxSupportedExpertCount);
     int const numThreadsHist = getMaxNumExperts(numExperts);
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
1-2: ⚠️ Potential issue | 🟠 Major

Update the copyright year to 2026.

This file was modified in this PR but the header still ends at 2025.
✏️ Proposed fix
- * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
1-2: ⚠️ Potential issue | 🟠 Major

Update the copyright year to 2026.

This file was modified in this PR but the header still ends at 2024.
✏️ Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tests/unittest/_torch/thop/serial/test_moe.py (1)
1-2: ⚠️ Potential issue | 🟠 Major

Update the SPDX year to 2026.

The file is modified in this PR but the SPDX header still ends at 2024.
✏️ Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (3)
1-1: ⚠️ Potential issue | 🟠 Major

Add the required NVIDIA copyright header.

This TensorRT-LLM source file is missing the standard header block.
📄 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+ #
 from dataclasses import dataclass, replace
As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.

561-563: ⚠️ Potential issue | 🟠 Major

Include num_fused_shared_experts in the AutoTuner cache key.

Tactic validity depends on the total experts-per-token. If this parameter changes, the current cache key can reuse incompatible tactics.
🔧 Proposed fix
-    def unique_id(self):
-        return (self.top_k, self.intermediate_size, self.local_num_experts)
+    def unique_id(self):
+        return (self.top_k, self.num_fused_shared_experts or 0,
+                self.intermediate_size, self.local_num_experts)
765-789: ⚠️ Potential issue | 🟡 Minor

Silence the unused num_fused_shared_experts in the fake kernel.

Ruff flags this as unused; keep the signature but mark it explicitly to avoid lint failures.
🧹 Proposed fix
 def _(routing_logits: torch.Tensor,
       routing_bias: torch.Tensor,
       hidden_states: torch.Tensor,
       hidden_states_scale: torch.Tensor,
       gemm1_weights: torch.Tensor,
       gemm1_weights_scale: torch.Tensor,
       gemm2_weights: torch.Tensor,
       gemm2_weights_scale: torch.Tensor,
       num_experts: int,
       top_k: int,
       num_fused_shared_experts: Optional[int],
       n_group: Optional[int],
       topk_group: Optional[int],
       intermediate_size: int,
       local_expert_offset: int,
       local_num_experts: int,
       routed_scaling_factor: Optional[float],
       routing_method_type: int,
       topk_weights: Optional[torch.Tensor] = None,
       topk_ids: Optional[torch.Tensor] = None) -> torch.Tensor:
+    _ = num_fused_shared_experts
     num_tokens = hidden_states.shape[0]
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu (1)
1-2: ⚠️ Potential issue | 🟠 Major

Update the copyright year to 2026.

This file was modified in this PR but the header still ends at 2025.
✏️ Proposed fix
- * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (2)
1-3: ⚠️ Potential issue | 🟡 Minor

Update NVIDIA copyright year to include 2026.

This file was meaningfully modified in 2026, but the header still ends at 2024.
✍️ Suggested update
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines, "All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification."

49-63: ⚠️ Potential issue | 🟠 Major

Normalize/validate num_fused_shared_experts and use it consistently for shape checks and totals.

Right now, the top‑k shape checks still assume top_k, while totals clamp with a > 0 expression and the kernel args pass the raw optional value. With fused shared experts enabled, this can accept too‑small topk_* buffers (or reject expanded ones) and desync totals vs kernel args (e.g., negative values). Consider validating once and using a single normalized value everywhere.
✅ Suggested fix
     TORCH_CHECK(tensorrt_llm::common::isSM100Family(), "Only SM100f is supported by FP8 block scale MOE");
 
+    int64_t const numFusedSharedExperts = num_fused_shared_experts.value_or(0);
+    TORCH_CHECK(numFusedSharedExperts >= 0, "num_fused_shared_experts must be non-negative.");
+
     if (topk_ids.has_value() && topk_weights.has_value())
     {
+        int64_t const expectedCols = top_k + numFusedSharedExperts;
         TORCH_CHECK(topk_ids.value().scalar_type() == at::ScalarType::Int, "topk_ids must be int");
         TORCH_CHECK(topk_weights.value().scalar_type() == at::ScalarType::BFloat16, "topk_weights must be bfloat16.");
         TORCH_CHECK(topk_ids.value().dim() == 2, "topk_ids must be 2D.");
         TORCH_CHECK(topk_ids.value().sizes()[0] == hidden_states.sizes()[0],
             "topk_ids and hidden_states must have the same number of tokens.");
-        TORCH_CHECK(topk_ids.value().sizes()[1] == top_k, "topk_ids dim1 must match top_k.");
+        TORCH_CHECK(topk_ids.value().sizes()[1] == expectedCols,
+            "topk_ids dim1 must match top_k (+ num_fused_shared_experts).");
         TORCH_CHECK(topk_weights.value().dim() == 2, "topk_weights must be 2D.");
         TORCH_CHECK(topk_weights.value().sizes()[0] == hidden_states.sizes()[0],
             "topk_weights and hidden_states must have the same number of tokens.");
-        TORCH_CHECK(topk_weights.value().sizes()[1] == top_k, "topk_weights dim1 must match top_k.");
+        TORCH_CHECK(topk_weights.value().sizes()[1] == expectedCols,
+            "topk_weights dim1 must match top_k (+ num_fused_shared_experts).");
     }
@@
-    int64_t const num_total_experts
-        = num_experts + (num_fused_shared_experts.value_or(0) > 0 ? num_fused_shared_experts.value() : 0);
-    int64_t const total_experts_per_token
-        = top_k + (num_fused_shared_experts.value_or(0) > 0 ? num_fused_shared_experts.value() : 0);
-    int64_t const num_total_local_experts
-        = local_num_experts + (num_fused_shared_experts.value_or(0) > 0 ? num_fused_shared_experts.value() : 0);
+    int64_t const num_total_experts = num_experts + numFusedSharedExperts;
+    int64_t const total_experts_per_token = top_k + numFusedSharedExperts;
+    int64_t const num_total_local_experts = local_num_experts + numFusedSharedExperts;
@@
-    args.num_fused_shared_experts = num_fused_shared_experts.value_or(0);
+    args.num_fused_shared_experts = numFusedSharedExperts;
Also applies to: 145-176

🤖 Fix all issues with AI agents

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu`:
- Around line 714-721: The TLLM_CHECK_WITH_INFO call currently has a malformed
message and doesn't print the actual fused-expert count; update the check at the
TLLM_CHECK_WITH_INFO invocation to include both data.mNumFusedSharedExperts and
WarpSize in the formatted message and fix the punctuation/parentheses – e.g.,
use a format like "Number of fused shared experts (%d) must be less than warp
size (%d)." and pass data.mNumFusedSharedExperts then WarpSize as the format
arguments so the real values are logged.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`:
- Around line 114-129: The code computes numDevices = numExperts /
localNumExperts and deviceIndex = localExpertOffset / localNumExperts without
validating localNumExperts or ensuring experts partition evenly, risking
divide-by-zero or misrouting; add guards at the start of this block to validate
localNumExperts > 0 and that numExperts >= localNumExperts, and handle
non-divisible partitions (e.g., compute numDevices as max(1, numExperts /
localNumExperts) or bail/throw via an error/log when configuration is invalid),
then compute deviceIndex safely (avoid division by zero) and adjust the token
offset/num-tokens calculation for uneven partitions so
routingData.mSharedExpertTokenOffset and routingData.mSharedExpertNumTokens
remain correct and deterministic.

In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h`:
- Around line 149-156: The function prototype run(...) in runner.h introduces a
new parameter numFusedSharedExpert that lacks Doxygen documentation; add a
Doxygen `@param` entry for numFusedSharedExpert immediately above the run
declaration in runner.h (within the existing Doxygen block for run) describing
what the parameter represents, expected range/units/semantics (e.g., number of
fused shared experts per group or per-token), any constraints or default
behavior, and how it affects routing/processing so callers understand its
purpose and valid values; keep the style consistent with the other `@param`
entries in that comment block.

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`:
- Around line 169-178: The code assigns num_fused_shared_expert after calling
create_weights(), but create_weights() uses that value for FP8 block‑scale
sizing; move the assignment of self.num_fused_shared_expert (setting it from
model_config.pretrained_config.n_shared_experts when
model_config.mapping.dp_size == 1 and
self.quant_config.layer_quant_mode.has_fp8_block_scales()) so it occurs before
the call to self.create_weights(); ensure self.layer_idx and
self._weights_created initialization remain correct and unchanged.

In `@tests/unittest/_torch/thop/serial/test_moe.py`:
- Around line 1050-1051: Remove the unconditional debug prints of
output_dequant_reference and output_dequant_actual in the test; either delete
the two print(...) lines or gate them behind a runtime flag (e.g., check
os.environ["CI_DEBUG"] or a pytest config option) so they only print when
debugging is explicitly enabled, and if gating add the necessary import (os) and
use the flag check around the prints in
tests/unittest/_torch/thop/serial/test_moe.py to avoid flooding CI logs.
- Around line 169-174: Replace the deprecated torch.range usage in the
sharedIndices construction: in the block that defines sharedIndices (using
numExperts, num_fused_shared_experts, topKIndices.dtype), switch
torch.range(...) to torch.arange(...) and adjust the end parameter to remove the
`- 1` offset (so the range goes from numExperts to numExperts +
num_fused_shared_experts), then continue to unsqueeze, repeat (numTokens, 1) and
torch.cat with topKIndices as before.

🧹 Nitpick comments (4)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (1)
150-154: Initialize fused‑shared fields to avoid stale values.

These new scalars are uninitialized while adjacent members have defaults; defaulting them to 0 prevents accidental use of garbage when a caller constructs params without setBaseParams().
Suggested change
-    int32_t mNumFusedSharedExperts;
-    int32_t mSharedExpertTokenOffset;
-    int32_t mSharedExpertNumTokens;
-    int32_t mTotalExpertsPerToken;
+    int32_t mNumFusedSharedExperts = 0;
+    int32_t mSharedExpertTokenOffset = 0;
+    int32_t mSharedExpertNumTokens = 0;
+    int32_t mTotalExpertsPerToken = 0;
cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp (1)
77-86: Hoist the default routed scaling factor into a named constant.

Using 1.0 directly in the call expression violates the literal-usage rule; prefer a named const with the prescribed naming scheme.
Suggested change
-    routing_runner.run(routing_logits_ptr, routing_bias_ptr, num_tokens, num_experts, top_k,
-        /* num_fused_shared_expert */ 0, n_group.value_or(0), topk_group.value_or(0), local_expert_offset,
-        local_num_experts, routed_scaling_factor.value_or(1.0), expert_indexes.data_ptr<int>(),
+    double const kDEFAULT_ROUTED_SCALING_FACTOR = 1.0;
+    routing_runner.run(routing_logits_ptr, routing_bias_ptr, num_tokens, num_experts, top_k,
+        /* num_fused_shared_expert */ 0, n_group.value_or(0), topk_group.value_or(0), local_expert_offset,
+        local_num_experts, routed_scaling_factor.value_or(kDEFAULT_ROUTED_SCALING_FACTOR),
+        expert_indexes.data_ptr<int>(),
         expert_count_histogram.data_ptr<int>(), total_num_padded_tokens.data_ptr<int>(),
         expanded_idx_to_permuted_idx.data_ptr<int>(), permuted_idx_to_expanded_idx.data_ptr<int>(),
As per coding guidelines "Except for 0, nullptr, true, and false, all other literals should only be used for variable initialization and not in comparisons or expressions."
tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
21-23: Keep the module namespace in the GatedMLP import.

This file’s import style breaks the project’s namespace rule. Please switch to a module import and qualify the type hint accordingly.
🔧 Suggested change
-from ..gated_mlp import GatedMLP
+from .. import gated_mlp
@@
-    def fuse_shared_expert(self, module: torch.nn.Module,
-                           shared_experts: GatedMLP, n_shared_experts: int):
+    def fuse_shared_expert(self, module: torch.nn.Module,
+                           shared_experts: gated_mlp.GatedMLP, n_shared_experts: int):
As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
tests/unittest/_torch/thop/serial/test_moe.py (1)
149-156: Use snake_case for newly added locals.

New variables like numTotalExperts / totalExpertsPerToken / numTokensPerExpert should follow snake_case to align with Python style.
♻️ Example rename (apply consistently)
-    numTotalExperts = numExperts + num_fused_shared_experts
-    totalExpertsPerToken = topK + num_fused_shared_experts
+    num_total_experts = numExperts + num_fused_shared_experts
+    total_experts_per_token = topK + num_fused_shared_experts

-    numTokensPerExpert = torch.zeros(numTotalExperts, dtype=torch.int64)
+    num_tokens_per_expert = torch.zeros(num_total_experts, dtype=torch.int64)
As per coding guidelines, Python local variables should use snake_case, with prefix k for variable names that start with a number.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h

coderabbitai · 2026-02-03T13:55:31Z

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

        self._weights_created = False
+        self.num_fused_shared_expert = 0
        if not model_config.skip_create_weights_in_init:
            self.create_weights()
+        self.layer_idx = layer_idx
+
+        if model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales(
+        ):
+            self.num_fused_shared_expert = model_config.pretrained_config.n_shared_experts



⚠️ Potential issue | 🟠 Major

Set num_fused_shared_expert before create_weights().

create_weights() now sizes FP8 block‑scale weights based on num_fused_shared_expert, but the value is assigned after the call. For DP‑size 1 + FP8 block‑scales this can allocate weights for zero fused experts and break subsequent fusion.

Suggested fix (reorder assignment)

- self._weights_created = False - self.num_fused_shared_expert = 0 - if not model_config.skip_create_weights_in_init: - self.create_weights() - self.layer_idx = layer_idx - - if model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales( - ): - self.num_fused_shared_expert = model_config.pretrained_config.n_shared_experts + self._weights_created = False + self.num_fused_shared_expert = 0 + if model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales( + ): + self.num_fused_shared_expert = model_config.pretrained_config.n_shared_experts + if not model_config.skip_create_weights_in_init: + self.create_weights() + self.layer_idx = layer_idx

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` around lines 169 - 178, The code assigns num_fused_shared_expert after calling create_weights(), but create_weights() uses that value for FP8 block‑scale sizing; move the assignment of self.num_fused_shared_expert (setting it from model_config.pretrained_config.n_shared_experts when model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales()) so it occurs before the call to self.create_weights(); ensure self.layer_idx and self._weights_created initialization remain correct and unchanged.

tests/unittest/_torch/thop/serial/test_moe.py

Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>

nekorobov · 2026-02-03T15:16:23Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-03T15:22:23Z

PR_Github #34648 [ run ] triggered by Bot. Commit: 22988f8

tensorrt-cicd · 2026-02-03T19:21:02Z

PR_Github #34648 [ run ] completed with state FAILURE. Commit: 22988f8
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #35 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nekorobov · 2026-02-03T20:47:46Z

/bot run --disable-failt-fast

nekorobov · 2026-02-03T20:47:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-02-03T20:53:54Z

PR_Github #34686 [ run ] triggered by Bot. Commit: 22988f8

tensorrt-cicd · 2026-02-03T20:54:13Z

PR_Github #34685 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --disable-failt-fast

longlee0622 · 2026-02-04T02:34:20Z

/bot kill

tensorrt-cicd · 2026-02-04T03:33:13Z

PR_Github #34686 [ run ] completed with state SUCCESS. Commit: 22988f8
/LLM/release-1.2.0rc6.post1/L0_MergeRequest_PR pipeline #36 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

…#11143) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>

nekorobov requested review from a team as code owners January 30, 2026 18:43

nekorobov requested review from hlu1, symphonylyh, xxi-nv and yizhang-nv January 30, 2026 18:43

nekorobov changed the title ~~feat: fuse shared to sparse experts in TRT-LLM Gen MoE~~ [None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE Jan 30, 2026

xxi-nv approved these changes Feb 3, 2026

View reviewed changes

tensorrt_llm/_torch/models/modeling_deepseekv3.py Outdated Show resolved Hide resolved

longlee0622 added the Release Blocker PRs that blocking the final release build or branching out the release branch label Feb 3, 2026

xxi-nv reviewed Feb 3, 2026

View reviewed changes

feat: shared to sparse expert fusion

6659abf

Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>

nekorobov force-pushed the user/nkorobov/update-ds-fp8-cubins-fuse-shared branch from 2006aac to 6659abf Compare February 3, 2026 13:39

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

address review

22988f8

Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>

litaotju approved these changes Feb 4, 2026

View reviewed changes

yizhang-nv approved these changes Feb 4, 2026

View reviewed changes

chzblych merged commit 7c6df0e into NVIDIA:release/1.2.0rc6.post1 Feb 4, 2026
4 of 5 checks passed

longlee0622 mentioned this pull request Feb 9, 2026

[None][chore] Mass merge commits from release/1.2.0rc6.post1 branch #11384

Merged

1 task

aleozlx mentioned this pull request Feb 12, 2026

Shared expert fusion integration flashinfer-ai/flashinfer#2551

Open

nekorobov added a commit to nekorobov/TensorRT-LLM that referenced this pull request Feb 13, 2026

[None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE (NVIDIA…

6d7ee01

…#11143) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>

coderabbitai bot mentioned this pull request Feb 13, 2026

[None][feat] Fuse shared to sparse experts MoE #11499

Open

1 task

nekorobov added a commit to nekorobov/TensorRT-LLM that referenced this pull request Feb 13, 2026

[None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE (NVIDIA…

457e8cf

…#11143) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>

nv-yunzheq mentioned this pull request Feb 23, 2026

feat: Fuse shared experts into trtllm_gen moe (fp8) flashinfer-ai/flashinfer#2625

Open

5 tasks

Comments

Conversation

nekorobov commented Jan 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

nekorobov commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

nekorobov commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 30, 2026

Uh oh!

tensorrt-cicd commented Jan 31, 2026

Uh oh!

nekorobov commented Jan 31, 2026

Uh oh!

tensorrt-cicd commented Jan 31, 2026

Uh oh!

tensorrt-cicd commented Jan 31, 2026

Uh oh!

longlee0622 commented Feb 3, 2026

Uh oh!

tensorrt-cicd commented Feb 3, 2026

Uh oh!

tensorrt-cicd commented Feb 3, 2026

Uh oh!

xxi-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xxi-nv Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

nekorobov Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 3, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nekorobov commented Feb 3, 2026

Uh oh!

tensorrt-cicd commented Feb 3, 2026

Uh oh!

tensorrt-cicd commented Feb 3, 2026

Uh oh!

nekorobov commented Feb 3, 2026

Uh oh!

nekorobov commented Feb 3, 2026

Uh oh!

tensorrt-cicd commented Feb 3, 2026

Uh oh!

tensorrt-cicd commented Feb 3, 2026

Uh oh!

longlee0622 commented Feb 4, 2026

nekorobov commented Jan 30, 2026 •

edited by coderabbitai bot

Loading