[None][feat] fuse shared to sparse experts in TRT-LLM Gen MoE#11143
Conversation
|
/bot run |
|
PR_Github #34250 [ run ] triggered by Bot. Commit: |
|
PR_Github #34250 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34255 [ run ] triggered by Bot. Commit: |
|
PR_Github #34255 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34294 [ run ] triggered by Bot. Commit: |
|
PR_Github #34294 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34525 [ run ] triggered by Bot. Commit: |
|
PR_Github #34525 [ run ] completed with state
|
| @@ -219,7 +226,7 @@ def _check_configs(self): | |||
| def _get_quant_method(self): | |||
| if self.quant_config is not None: | |||
| if self.quant_config.layer_quant_mode.has_fp8_block_scales(): | |||
| return DeepSeekFP8BlockScalesFusedMoEMethod() | |||
| return DeepSeekFP8TRTLLMGenBlockScalesFusedMoEMethod() | |||
There was a problem hiding this comment.
It appears that you're overriding the original behavior. The new method will invariably attempt to fuse the shared_expert. Could I confirm whether this consistently yields better performance?
Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
2006aac to
6659abf
Compare
📝 WalkthroughWalkthroughThis pull request adds comprehensive fused shared expert support to TensorRT-LLM's MoE routing pipeline. Changes span CUDA kernels, C++ runners and implementations, Python bindings, quantization methods, and model-specific integration, introducing new parameters to track fused expert counts and updating expert indexing, tensor shapes, and routing logic throughout. Changes
Sequence Diagram(s)sequenceDiagram
participant Kernel as Routing Kernel
participant Runner as MoE Runner
participant Config as Config/Quantization
participant Fusion as Expert Fusion
Runner->>Kernel: Execute with numFusedSharedExperts
Note over Kernel: Expand per-token expert indices<br/>mTotalExpertsPerToken = topK + numFusedSharedExperts
Kernel->>Kernel: Write routed expert scores<br/>+ fused shared expert weights
Kernel-->>Runner: Return expanded routing indices<br/>& expert counts
Runner->>Runner: Compute total expert counts<br/>totalNumExperts = numExperts + numFusedSharedExperts
Runner->>Runner: Configure GEMM with expanded dimensions<br/>totalExpertsPerToken for workspace
Runner->>Config: Request valid configs/tactics<br/>with total expert counts
Config->>Config: Allocate tensors sized by totalExpertsPerToken
Config->>Config: Compute weight shapes:<br/>w3_w1, w2 += numFusedSharedExperts
Config->>Fusion: Prepare fused shared experts
Fusion->>Fusion: Load shared expert weights
Fusion->>Fusion: Permute and fuse shared weights<br/>into MoE weight structure
Fusion-->>Config: Return fused weight tensors
Config-->>Runner: Return valid configurations
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (1)
646-690:⚠️ Potential issue | 🟠 MajorAdd an explicit guard for the expanded expert count.
numExpertsnow includes fused shared experts, but there is no check to ensure the expanded value stays within kernel limits. IfnumExpertsexceedsMaxSupportedExpertCount,getMaxNumExpertsreturns 0 and leads to invalid thread configuration and kernel launches.🐛 Suggested fix
int const numExperts = data.mNumExperts + data.mNumFusedSharedExperts; int const topK = data.mTopK + data.mNumFusedSharedExperts; + TLLM_CHECK_WITH_INFO(numExperts <= MaxSupportedExpertCount, + "Routing kernel expects `#experts` %d to be <= %d", numExperts, MaxSupportedExpertCount); int const numThreadsHist = getMaxNumExperts(numExperts);cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (1)
1-2:⚠️ Potential issue | 🟠 MajorUpdate the copyright year to 2026.
This file was modified in this PR but the header still ends at 2025.
✏️ Proposed fix
- * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
1-2:⚠️ Potential issue | 🟠 MajorUpdate the copyright year to 2026.
This file was modified in this PR but the header still ends at 2024.
✏️ Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tests/unittest/_torch/thop/serial/test_moe.py (1)
1-2:⚠️ Potential issue | 🟠 MajorUpdate the SPDX year to 2026.
The file is modified in this PR but the SPDX header still ends at 2024.
✏️ Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (3)
1-1:⚠️ Potential issue | 🟠 MajorAdd the required NVIDIA copyright header.
This TensorRT-LLM source file is missing the standard header block.
📄 Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + # from dataclasses import dataclass, replaceAs per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
561-563:⚠️ Potential issue | 🟠 MajorInclude
num_fused_shared_expertsin the AutoTuner cache key.Tactic validity depends on the total experts-per-token. If this parameter changes, the current cache key can reuse incompatible tactics.
🔧 Proposed fix
- def unique_id(self): - return (self.top_k, self.intermediate_size, self.local_num_experts) + def unique_id(self): + return (self.top_k, self.num_fused_shared_experts or 0, + self.intermediate_size, self.local_num_experts)
765-789:⚠️ Potential issue | 🟡 MinorSilence the unused
num_fused_shared_expertsin the fake kernel.Ruff flags this as unused; keep the signature but mark it explicitly to avoid lint failures.
🧹 Proposed fix
def _(routing_logits: torch.Tensor, routing_bias: torch.Tensor, hidden_states: torch.Tensor, hidden_states_scale: torch.Tensor, gemm1_weights: torch.Tensor, gemm1_weights_scale: torch.Tensor, gemm2_weights: torch.Tensor, gemm2_weights_scale: torch.Tensor, num_experts: int, top_k: int, num_fused_shared_experts: Optional[int], n_group: Optional[int], topk_group: Optional[int], intermediate_size: int, local_expert_offset: int, local_num_experts: int, routed_scaling_factor: Optional[float], routing_method_type: int, topk_weights: Optional[torch.Tensor] = None, topk_ids: Optional[torch.Tensor] = None) -> torch.Tensor: + _ = num_fused_shared_experts num_tokens = hidden_states.shape[0]cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu (1)
1-2:⚠️ Potential issue | 🟠 MajorUpdate the copyright year to 2026.
This file was modified in this PR but the header still ends at 2025.
✏️ Proposed fix
- * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines, all TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (2)
1-3:⚠️ Potential issue | 🟡 MinorUpdate NVIDIA copyright year to include 2026.
This file was meaningfully modified in 2026, but the header still ends at 2024.
✍️ Suggested update
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines, "All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification."
49-63:⚠️ Potential issue | 🟠 MajorNormalize/validate
num_fused_shared_expertsand use it consistently for shape checks and totals.Right now, the top‑k shape checks still assume
top_k, while totals clamp with a> 0expression and the kernel args pass the raw optional value. With fused shared experts enabled, this can accept too‑smalltopk_*buffers (or reject expanded ones) and desync totals vs kernel args (e.g., negative values). Consider validating once and using a single normalized value everywhere.✅ Suggested fix
TORCH_CHECK(tensorrt_llm::common::isSM100Family(), "Only SM100f is supported by FP8 block scale MOE"); + int64_t const numFusedSharedExperts = num_fused_shared_experts.value_or(0); + TORCH_CHECK(numFusedSharedExperts >= 0, "num_fused_shared_experts must be non-negative."); + if (topk_ids.has_value() && topk_weights.has_value()) { + int64_t const expectedCols = top_k + numFusedSharedExperts; TORCH_CHECK(topk_ids.value().scalar_type() == at::ScalarType::Int, "topk_ids must be int"); TORCH_CHECK(topk_weights.value().scalar_type() == at::ScalarType::BFloat16, "topk_weights must be bfloat16."); TORCH_CHECK(topk_ids.value().dim() == 2, "topk_ids must be 2D."); TORCH_CHECK(topk_ids.value().sizes()[0] == hidden_states.sizes()[0], "topk_ids and hidden_states must have the same number of tokens."); - TORCH_CHECK(topk_ids.value().sizes()[1] == top_k, "topk_ids dim1 must match top_k."); + TORCH_CHECK(topk_ids.value().sizes()[1] == expectedCols, + "topk_ids dim1 must match top_k (+ num_fused_shared_experts)."); TORCH_CHECK(topk_weights.value().dim() == 2, "topk_weights must be 2D."); TORCH_CHECK(topk_weights.value().sizes()[0] == hidden_states.sizes()[0], "topk_weights and hidden_states must have the same number of tokens."); - TORCH_CHECK(topk_weights.value().sizes()[1] == top_k, "topk_weights dim1 must match top_k."); + TORCH_CHECK(topk_weights.value().sizes()[1] == expectedCols, + "topk_weights dim1 must match top_k (+ num_fused_shared_experts)."); } @@ - int64_t const num_total_experts - = num_experts + (num_fused_shared_experts.value_or(0) > 0 ? num_fused_shared_experts.value() : 0); - int64_t const total_experts_per_token - = top_k + (num_fused_shared_experts.value_or(0) > 0 ? num_fused_shared_experts.value() : 0); - int64_t const num_total_local_experts - = local_num_experts + (num_fused_shared_experts.value_or(0) > 0 ? num_fused_shared_experts.value() : 0); + int64_t const num_total_experts = num_experts + numFusedSharedExperts; + int64_t const total_experts_per_token = top_k + numFusedSharedExperts; + int64_t const num_total_local_experts = local_num_experts + numFusedSharedExperts; @@ - args.num_fused_shared_experts = num_fused_shared_experts.value_or(0); + args.num_fused_shared_experts = numFusedSharedExperts;Also applies to: 145-176
🤖 Fix all issues with AI agents
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu`:
- Around line 714-721: The TLLM_CHECK_WITH_INFO call currently has a malformed
message and doesn't print the actual fused-expert count; update the check at the
TLLM_CHECK_WITH_INFO invocation to include both data.mNumFusedSharedExperts and
WarpSize in the formatted message and fix the punctuation/parentheses – e.g.,
use a format like "Number of fused shared experts (%d) must be less than warp
size (%d)." and pass data.mNumFusedSharedExperts then WarpSize as the format
arguments so the real values are logged.
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`:
- Around line 114-129: The code computes numDevices = numExperts /
localNumExperts and deviceIndex = localExpertOffset / localNumExperts without
validating localNumExperts or ensuring experts partition evenly, risking
divide-by-zero or misrouting; add guards at the start of this block to validate
localNumExperts > 0 and that numExperts >= localNumExperts, and handle
non-divisible partitions (e.g., compute numDevices as max(1, numExperts /
localNumExperts) or bail/throw via an error/log when configuration is invalid),
then compute deviceIndex safely (avoid division by zero) and adjust the token
offset/num-tokens calculation for uneven partitions so
routingData.mSharedExpertTokenOffset and routingData.mSharedExpertNumTokens
remain correct and deterministic.
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h`:
- Around line 149-156: The function prototype run(...) in runner.h introduces a
new parameter numFusedSharedExpert that lacks Doxygen documentation; add a
Doxygen `@param` entry for numFusedSharedExpert immediately above the run
declaration in runner.h (within the existing Doxygen block for run) describing
what the parameter represents, expected range/units/semantics (e.g., number of
fused shared experts per group or per-token), any constraints or default
behavior, and how it affects routing/processing so callers understand its
purpose and valid values; keep the style consistent with the other `@param`
entries in that comment block.
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`:
- Around line 169-178: The code assigns num_fused_shared_expert after calling
create_weights(), but create_weights() uses that value for FP8 block‑scale
sizing; move the assignment of self.num_fused_shared_expert (setting it from
model_config.pretrained_config.n_shared_experts when
model_config.mapping.dp_size == 1 and
self.quant_config.layer_quant_mode.has_fp8_block_scales()) so it occurs before
the call to self.create_weights(); ensure self.layer_idx and
self._weights_created initialization remain correct and unchanged.
In `@tests/unittest/_torch/thop/serial/test_moe.py`:
- Around line 1050-1051: Remove the unconditional debug prints of
output_dequant_reference and output_dequant_actual in the test; either delete
the two print(...) lines or gate them behind a runtime flag (e.g., check
os.environ["CI_DEBUG"] or a pytest config option) so they only print when
debugging is explicitly enabled, and if gating add the necessary import (os) and
use the flag check around the prints in
tests/unittest/_torch/thop/serial/test_moe.py to avoid flooding CI logs.
- Around line 169-174: Replace the deprecated torch.range usage in the
sharedIndices construction: in the block that defines sharedIndices (using
numExperts, num_fused_shared_experts, topKIndices.dtype), switch
torch.range(...) to torch.arange(...) and adjust the end parameter to remove the
`- 1` offset (so the range goes from numExperts to numExperts +
num_fused_shared_experts), then continue to unsqueeze, repeat (numTokens, 1) and
torch.cat with topKIndices as before.
🧹 Nitpick comments (4)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (1)
150-154: Initialize fused‑shared fields to avoid stale values.These new scalars are uninitialized while adjacent members have defaults; defaulting them to
0prevents accidental use of garbage when a caller constructs params withoutsetBaseParams().Suggested change
- int32_t mNumFusedSharedExperts; - int32_t mSharedExpertTokenOffset; - int32_t mSharedExpertNumTokens; - int32_t mTotalExpertsPerToken; + int32_t mNumFusedSharedExperts = 0; + int32_t mSharedExpertTokenOffset = 0; + int32_t mSharedExpertNumTokens = 0; + int32_t mTotalExpertsPerToken = 0;cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp (1)
77-86: Hoist the default routed scaling factor into a named constant.Using
1.0directly in the call expression violates the literal-usage rule; prefer a namedconstwith the prescribed naming scheme.Suggested change
- routing_runner.run(routing_logits_ptr, routing_bias_ptr, num_tokens, num_experts, top_k, - /* num_fused_shared_expert */ 0, n_group.value_or(0), topk_group.value_or(0), local_expert_offset, - local_num_experts, routed_scaling_factor.value_or(1.0), expert_indexes.data_ptr<int>(), + double const kDEFAULT_ROUTED_SCALING_FACTOR = 1.0; + routing_runner.run(routing_logits_ptr, routing_bias_ptr, num_tokens, num_experts, top_k, + /* num_fused_shared_expert */ 0, n_group.value_or(0), topk_group.value_or(0), local_expert_offset, + local_num_experts, routed_scaling_factor.value_or(kDEFAULT_ROUTED_SCALING_FACTOR), + expert_indexes.data_ptr<int>(), expert_count_histogram.data_ptr<int>(), total_num_padded_tokens.data_ptr<int>(), expanded_idx_to_permuted_idx.data_ptr<int>(), permuted_idx_to_expanded_idx.data_ptr<int>(),As per coding guidelines "Except for
0,nullptr,true, andfalse, all other literals should only be used for variable initialization and not in comparisons or expressions."tensorrt_llm/_torch/modules/fused_moe/quantization.py (1)
21-23: Keep the module namespace in the GatedMLP import.This file’s import style breaks the project’s namespace rule. Please switch to a module import and qualify the type hint accordingly.
🔧 Suggested change
-from ..gated_mlp import GatedMLP +from .. import gated_mlp @@ - def fuse_shared_expert(self, module: torch.nn.Module, - shared_experts: GatedMLP, n_shared_experts: int): + def fuse_shared_expert(self, module: torch.nn.Module, + shared_experts: gated_mlp.GatedMLP, n_shared_experts: int):As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
tests/unittest/_torch/thop/serial/test_moe.py (1)
149-156: Use snake_case for newly added locals.New variables like
numTotalExperts/totalExpertsPerToken/numTokensPerExpertshould follow snake_case to align with Python style.♻️ Example rename (apply consistently)
- numTotalExperts = numExperts + num_fused_shared_experts - totalExpertsPerToken = topK + num_fused_shared_experts + num_total_experts = numExperts + num_fused_shared_experts + total_experts_per_token = topK + num_fused_shared_experts - numTokensPerExpert = torch.zeros(numTotalExperts, dtype=torch.int64) + num_tokens_per_expert = torch.zeros(num_total_experts, dtype=torch.int64)As per coding guidelines, Python local variables should use snake_case, with prefix
kfor variable names that start with a number.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
Show resolved
Hide resolved
| self._weights_created = False | ||
| self.num_fused_shared_expert = 0 | ||
| if not model_config.skip_create_weights_in_init: | ||
| self.create_weights() | ||
| self.layer_idx = layer_idx | ||
|
|
||
| if model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales( | ||
| ): | ||
| self.num_fused_shared_expert = model_config.pretrained_config.n_shared_experts | ||
|
|
There was a problem hiding this comment.
Set num_fused_shared_expert before create_weights().
create_weights() now sizes FP8 block‑scale weights based on num_fused_shared_expert, but the value is assigned after the call. For DP‑size 1 + FP8 block‑scales this can allocate weights for zero fused experts and break subsequent fusion.
Suggested fix (reorder assignment)
- self._weights_created = False
- self.num_fused_shared_expert = 0
- if not model_config.skip_create_weights_in_init:
- self.create_weights()
- self.layer_idx = layer_idx
-
- if model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales(
- ):
- self.num_fused_shared_expert = model_config.pretrained_config.n_shared_experts
+ self._weights_created = False
+ self.num_fused_shared_expert = 0
+ if model_config.mapping.dp_size == 1 and self.quant_config.layer_quant_mode.has_fp8_block_scales(
+ ):
+ self.num_fused_shared_expert = model_config.pretrained_config.n_shared_experts
+ if not model_config.skip_create_weights_in_init:
+ self.create_weights()
+ self.layer_idx = layer_idx🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` around lines
169 - 178, The code assigns num_fused_shared_expert after calling
create_weights(), but create_weights() uses that value for FP8 block‑scale
sizing; move the assignment of self.num_fused_shared_expert (setting it from
model_config.pretrained_config.n_shared_experts when
model_config.mapping.dp_size == 1 and
self.quant_config.layer_quant_mode.has_fp8_block_scales()) so it occurs before
the call to self.create_weights(); ensure self.layer_idx and
self._weights_created initialization remain correct and unchanged.
Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #34648 [ run ] triggered by Bot. Commit: |
|
PR_Github #34648 [ run ] completed with state
|
|
/bot run --disable-failt-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #34686 [ run ] triggered by Bot. Commit: |
|
PR_Github #34685 Bot args parsing error: usage: /bot [-h] |
|
/bot kill |
|
PR_Github #34686 [ run ] completed with state
|
…#11143) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
…#11143) Signed-off-by: Nikita Korobov <14355239+nekorobov@users.noreply.github.com>
Summary by CodeRabbit
Release Notes
New Features
API Changes
numFusedSharedExpertparameter to MoE runner interfaces to specify the number of fused shared experts.Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.