[TRTLLM-10062][feat] Enable MTP for Nemotron Super#10754
[TRTLLM-10062][feat] Enable MTP for Nemotron Super#10754mikeiovine merged 22 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: qgai <qgai@nvidia.com>
5c0a3ad to
a7830c0
Compare
📝 WalkthroughWalkthroughAdds MTP/speculative decoding support to Nemotron‑H and Mamba: weight key remapping, speculative metadata threading, Triton causal‑conv kernels and update path, multi‑token/state buffering, intermediate caches, and cache promotion logic for draft verification. Changes
Sequence Diagram(s)sequenceDiagram
participant Model as NemotronH Model
participant Mixer as Mamba2Mixer
participant Conv as CausalConv1D (Triton)
participant Cache as MambaCacheManager
participant Verify as Verification Engine
Model->>Mixer: forward(hidden_states, spec_metadata)
alt speculative (spec_metadata present)
Mixer->>Cache: mamba_layer_cache(layer_idx) -> intermediate buffers
Mixer->>Conv: causal_conv1d_update(speculative inputs, intermediate_state_indices)
Conv->>Cache: read/update intermediate_conv_window / conv_state
Mixer->>Mixer: selective_state_update(out=prealloc, intermediate_states_buffer)
Mixer->>Verify: emit outputs + intermediate buffers for verification
Verify->>Cache: update_resources(num_accepted_tokens)
Cache->>Cache: promote intermediate_* -> main caches
else standard
Mixer->>Cache: read conv/temporal states
Mixer->>Conv: causal_conv1d_fwd(standard path)
Mixer->>Mixer: selective_state_update(standard, writeback)
Mixer->>Cache: update main caches
end
Mixer->>Model: return processed output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📝 WalkthroughWalkthroughThis PR introduces speculative decoding and multi-task pattern (MTP) support to the Nemotron H model and Mamba-based components. It extends the model architecture with SpecMetadata propagation through layers, adds MTP decoder components, implements Triton-based causal convolution with continuous batching, and enhances state management with intermediate caching for speculative verification. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Forward Pass
participant Model as NemotronHModel
participant Layer as NemotronHLayer
participant Mixer as Mamba2Mixer
participant Cache as MambaCacheManager
Client->>Model: forward(input_ids, spec_metadata)
Model->>Model: set num_hidden_layers limit
loop For each active layer
Model->>Layer: forward(position_ids, hidden_states, spec_metadata)
Layer->>Mixer: forward(hidden_states, spec_metadata)
Mixer->>Cache: mamba_layer_cache(layer_idx)
Cache-->>Mixer: State{conv, temporal}
alt Speculative Path (spec_metadata active)
Mixer->>Mixer: causal_conv1d_update_triton(intermediate_buffers)
Mixer->>Mixer: selective_state_update(disable_state_update=False)
Cache->>Cache: update intermediate_ssm cache
else Standard Path
Mixer->>Mixer: causal_conv1d_update(conv_states)
Mixer->>Mixer: selective_state_update(standard)
Cache->>Cache: update main conv/temporal states
end
Mixer-->>Layer: hidden_states
Layer-->>Model: hidden_states
end
Model-->>Client: output_tensor
sequenceDiagram
participant MTP as NemotronHMTP
participant Layer as NemotronHMTPDecoderLayer
participant Proj as Projection
participant Mixer as Mixer
participant Cache as CacheManager
MTP->>MTP: for each pattern layer
MTP->>Layer: forward(fused_hiddens, spec_metadata)
opt Has start projection
Layer->>Proj: apply_start_projection()
Proj-->>Layer: projected_states
end
Layer->>Mixer: forward(projected_states, spec_metadata)
Mixer->>Cache: retrieve layer_cache
Cache-->>Mixer: conv_states, temporal_states
Mixer-->>Layer: mixed_output
opt Has end norm
Layer->>Layer: apply_end_norm()
end
Layer-->>MTP: updated_residuals
MTP->>MTP: accumulate for next layer
MTP-->>Client: final_output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
tensorrt_llm/_torch/modules/mamba/selective_state_update.py (1)
360-379: Validate cache_steps and cache indices before kernel launch.When
intermediate_states_bufferis provided,cache_stepscan beNone/0and indices can be negative/out-of-range, which risks out-of-bounds GPU writes. Add explicit checks forcache_steps, index bounds, and buffer capacity.🛡️ Suggested validation
if state_batch_indices is not None: assert state_batch_indices.shape == (batch, ) assert out.shape == x.shape + + if intermediate_states_buffer is not None: + assert cache_steps is not None and cache_steps >= T + if intermediate_state_indices is not None: + min_idx = int(intermediate_state_indices.min().item()) + max_idx = int(intermediate_state_indices.max().item()) + assert min_idx >= 0, "intermediate_state_indices must be >= 0" + elif state_batch_indices is not None: + valid = state_batch_indices != pad_slot_id + max_idx = int(state_batch_indices[valid].max().item()) if valid.any() else -1 + else: + max_idx = batch - 1 + required = (max_idx + 1) * cache_steps * nheads * dim * dstate + assert intermediate_states_buffer.numel() >= requiredtensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py (3)
4-5: Update NVIDIA copyright year range.This file was modified in this PR; please extend the year range to include 2026 to satisfy the header requirement. As per coding guidelines, please update the header to the latest modification year.
501-545: Guard againstout=Noneto avoid AttributeError.
outis optional in the signature but immediately dereferenced. This will crash if a caller hasn’t been updated. Either enforce a clear assertion or allocate a fallback buffer.✅ Safe fallback allocation
- assert out.shape == x.shape + if out is None: + out = torch.empty_like(x) + else: + assert out.shape == x.shape
419-425: Potential numerical blow-up from unclamped exponent in decay calculation.The line
cb *= tl.exp(dA_cs_m[:, None] - dA_cs_k[None, :])can produce positive arguments toexp()when iteration indexkis large enough that allkindices exceed allmindices. SincedA_logvalues represent log-space decay (≤ 0), their cumsum is monotone decreasing, which meansdA_cs_m - dA_cs_kbecomes positive in such cases, causingexp()to amplify beyond 1 and destabilizing the scan.Either add a clamp to constrain the exponent to non-positive values, or document/assert that the index relationship guarantees
dA_cs_m - dA_cs_k ≤ 0:🛠️ Suggested safe clamp
- cb *= tl.exp(dA_cs_m[:, None] - dA_cs_k[None, :]) + cb *= tl.exp(tl.minimum(dA_cs_m[:, None] - dA_cs_k[None, :], 0.0))tensorrt_llm/_torch/modules/mamba/ssd_combined.py (1)
4-5: Update NVIDIA copyright year range.This file was modified in this PR; please extend the year range to include 2026. As per coding guidelines, update the header to the latest modification year.
tensorrt_llm/_torch/models/modeling_speculative.py (1)
1-5: Add NVIDIA copyright header (2026).This file is missing the required NVIDIA SPDX header. Please add/update it to reflect 2026.
As per coding guidelines, please include the header.tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (2)
1-2: Update NVIDIA copyright year to 2026.The SPDX header still ends at 2024 despite new modifications. Please bump the year to 2026.
As per coding guidelines, the header should reflect the latest modification year.
83-88: Fix undefinedself.tp_size(runtime AttributeError).
self.tp_sizeis used when allocatingpreallocated_ssm_out, but it’s never set in__init__. This will raise at runtime. Storetp_sizeonself(or useself.mapping.tp_size).🔧 Proposed fix
if config.mapping.enable_attention_dp: self.mapping = Mapping( @@ ) tp_size = 1 else: self.mapping = config.mapping tp_size = config.mapping.tp_size + self.tp_size = tp_sizeAlso applies to: 208-214
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)
1-1: Update NVIDIA copyright year to 2026.The header still ends at 2025 despite new changes in this file. Please update to 2026.
As per coding guidelines, the latest modification year must be reflected.tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
1-1: Update the NVIDIA copyright year to 2026.The SPDX header still ends at 2024 even though this file was modified in 2026.
As per coding guidelines, please keep the header year current.✏️ Suggested update
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py`:
- Around line 59-67: The MTP key remapping uses re.match which only matches at
the start so keys like "model.mtp.layers.0..." (after earlier backbone->model
replacement) are missed; update the logic in the block handling "mtp.layers."
(the variable key and the regex match call) to use re.search (or a regex that
allows a prefix) e.g. search for r'mtp\.layers\.(\d+)\.(.*)' and, when found,
extract sublayer_idx and rest and rebuild key as before (key =
f"model.layers.{config.num_hidden_layers}.layers.{sublayer_idx}.{rest}'),
otherwise keep the existing logger.warning behavior.
- Around line 1-3: Add the required NVIDIA copyright/SPDX header (updated year
2026) at the very top of the file before any imports; insert the standard NVIDIA
header block containing the copyright statement with year 2026 and the
SPDX-License-Identifier so the file
(tensorrt_llm._torch.models.checkpoints.hf.nemotron_h_weight_mapper.py) begins
with the header followed by the existing imports (e.g., the current "import
torch" and "import tensorrt_llm.logger as logger").
In `@tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py`:
- Around line 946-963: The parameter `metadata` on causal_conv1d_update is
unused and should be removed (or renamed to `_metadata` to intentionally mark it
unused); update the function signature of causal_conv1d_update to drop
`metadata` (or change it to `_metadata`) and adjust any internal uses or callers
accordingly (search for causal_conv1d_update references to update call sites),
and run linter to ensure Ruff warnings are resolved.
- Around line 357-370: The function causal_conv1d_fn currently accepts an unused
**kwargs which triggers lint warnings; remove **kwargs from the signature or
rename it to _kwargs (or add a targeted noqa) and update any callers if
necessary so the function signature matches; locate causal_conv1d_fn in
tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py and either delete the
**kwargs parameter or rename it to _kwargs (or add a noqa comment adjacent to
the parameter) to explicitly mark it as intentionally ignored.
- Around line 31-55: The kernel currently declares unused parameters
(stride_x_seq, stride_o_seq, HAS_CACHE and the other unused args around lines
96-99) and uses a blanket `# noqa`; fix by either renaming each unused argument
to start with an underscore (e.g. `_stride_x_seq`, `_stride_o_seq`,
`_HAS_CACHE`) so linters treat them as intentionally unused, or add targeted
per-argument noqa comments (`# noqa: ARG001`) on those specific parameters and
remove the blanket `# noqa`; apply the same change to the other unused
parameters referenced in the signature so Ruff no longer flags ARG001.
- Around line 550-610: Remove the unused cache_seqlens_ptr from the
_causal_conv1d_update_kernel signature and any internal references; if callers
require the same signature, instead rename the parameter to _cache_seqlens_ptr
to mark it intentionally unused. Also remove the broad "# ruff: noqa: E501" / "#
noqa" suppression and either delete it entirely or replace it with a targeted
suppression (e.g., "# noqa: E501" only where a long-line warning actually
occurs). Apply the same cleanup (remove/rename cache_seqlens_ptr and replace the
blanket noqa) to the other analogous function/signature mentioned in the diff.
- Around line 1-5: Add the standard NVIDIA SPDX copyright header (with current
year 2026) to the top of the new module causal_conv1d_triton.py; specifically
prepend the file with the NVIDIA copyright notice and the
SPDX-License-Identifier line required by our guidelines (e.g.,
"SPDX-License-Identifier: NVIDIA-Scale-2.0" or the project’s mandated SPDX tag),
ensuring the header replaces or precedes the existing comment block so the file
contains the official NVIDIA header and year.
In `@tensorrt_llm/_torch/modules/mamba/selective_state_update.py`:
- Around line 389-395: The tie_hdim computation reads dt_bias.stride(-1)
unconditionally which will crash when dt_bias is None; update the tie_hdim
expression in selective_state_update.py to guard dt_bias (e.g., only call
dt_bias.stride(-1) when dt_bias is not None, or compute a dt_bias_stride
variable defaulting to 0 and use that) so tie_hdim uses safe values for
A.stride(-1), A.stride(-2), dt.stride(-1) and the dt_bias stride only if dt_bias
exists; keep the rest of the logic identical and adjust the conditional form
near the existing tie_hdim definition.
- Around line 1-7: Add the standard NVIDIA source header (including the NVIDIA
copyright notice and the current modification year 2026) at the top of the
module selective_state_update.py (above the existing SPDX and copyright lines)
so the file uses the repo's required NVIDIA header format; ensure the header
text follows the project's canonical wording and includes the 2026 year and any
required legal phrases used elsewhere in the repo.
- Around line 329-358: The code calls out.dim() unconditionally even though out
may be None; update the shape-normalization block in selective_state_update (the
section handling x, dt, A, B, C, D, z, dt_bias, out) to guard against None by
wrapping the out-specific lines in "if out is not None:" (or initialize out
before normalization), i.e., replace the unconditional "if out.dim() == 2/3: out
= out.unsqueeze(1)" checks with a conditional block that only calls out.dim()
when out is not None so no attribute access occurs on None.
In `@tensorrt_llm/_torch/modules/mamba/ssd_combined.py`:
- Around line 237-271: The wrapper currently discards the output tensor from
_mamba_chunk_scan_combined_fwd and only returns state values; restore the
original return contract by ensuring the output tensor (out_x) is returned first
along with state(s). Specifically, if out is None allocate or accept the
produced out_x from _mamba_chunk_scan_combined_fwd and return (out_x, states)
when return_varlen_states is False and return_final_states is False; when
return_varlen_states is True return (out_x, varlen_states) or (out_x,
final_states, varlen_states) depending on return_final_states. Keep references
to out_x, varlen_states, final_states, return_varlen_states, return_final_states
and _mamba_chunk_scan_combined_fwd so callers receive the expected (out,
state...) tuple and avoid unused-variable linter warnings.
In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`:
- Around line 373-378: The update_resources method currently ignores the
kv_cache_dtype_byte_size and num_accepted_draft_tokens parameters; either
remove/suppress them or use them as the source of truth. Modify update_resources
(function name: update_resources) so that if num_accepted_draft_tokens is
provided you use it directly instead of recomputing from scheduled_batch, and if
kv_cache_dtype_byte_size is provided you use it to calculate/assign the KV cache
byte size instead of ignoring it; alternatively, if these parameters are
intentionally unused, rename them to _kv_cache_dtype_byte_size and
_num_accepted_draft_tokens (or remove them) to silence the unused-parameter
warning. Ensure scheduled_batch remains the fallback when passed-in values are
None.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/mamba/selective_state_update.py (1)
380-380: Usedefforgridto satisfy lint.Ruff flags lambda assignments (E731). Switching to
defremoves the lint warning without behavior change.♻️ Suggested refactor
- grid = lambda META: (triton.cdiv(dim, META["BLOCK_SIZE_M"]), batch, nheads) + def grid(META): + return (triton.cdiv(dim, META["BLOCK_SIZE_M"]), batch, nheads)tensorrt_llm/_torch/modules/mamba/ssd_combined.py (1)
20-31: Use module-qualifiedpackagingimport to keep namespace.Current import style violates the namespace guideline. Please import the module and qualify
version.parse.♻️ Suggested change
-import triton -from einops import rearrange -from packaging import version +import triton +import packaging +from einops import rearrange @@ -TRITON_22 = version.parse(triton.__version__) >= version.parse("2.2.0") +TRITON_22 = packaging.version.parse(triton.__version__) >= packaging.version.parse("2.2.0")As per coding guidelines.
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
296-314: Avoid introducing new instance state inforward.
self.intermediate_state_indicesis created insideforwardwithout being initialized in__init__. Either keep it local or initialize it in the constructor to comply with class-member initialization rules.
As per coding guidelines, all externally visible members should be initialized in__init__.🔧 Example adjustment
class Mamba2Mixer(nn.Module): def __init__(...): super().__init__() + self.intermediate_state_indices: Optional[torch.Tensor] = None @@ - self.intermediate_state_indices = torch.arange( + intermediate_state_indices = torch.arange( num_decodes, dtype=torch.int32, device=state_indices_d.device) @@ - intermediate_state_indices=self.intermediate_state_indices, + intermediate_state_indices=intermediate_state_indices,tensorrt_llm/_torch/models/modeling_nemotron_h.py (2)
17-45: Preserve module namespaces in the new imports.The new direct imports (typing, speculative, modeling_deepseekv3, modeling_speculative, modeling_utils) break the “keep module namespace” rule; prefer module imports and qualify usages to avoid name collisions.
As per coding guidelines, please keep module namespaces for imports.
549-615: Initialize optional members and add a Google-style class docstring.
self.enorm,self.hnorm,self.eh_proj, andself.final_layernormare only set conditionally; initializing them toNoneimproves interface stability. The new public class also needs a Google-style docstring.As per coding guidelines, please keep public classes documented and members initialized in the constructor.♻️ Suggested shape
class NemotronHMTPDecoderLayer(NemotronHLayer): + """MTP decoder layer for NemotronH. + + Args: + model_config: Model configuration. + layer_idx: Layer index. + aux_stream_dict: Auxiliary CUDA streams. + has_start_projections: Whether to apply start projections. + has_end_norm: Whether to apply end-layer norm. + layer_type: Layer kind token. + """ def __init__( ... ) -> None: super().__init__(...) config = model_config.pretrained_config self.model_config = model_config self.has_start_projections = has_start_projections self.has_end_norm = has_end_norm + self.enorm = None + self.hnorm = None + self.eh_proj = None + self.final_layernorm = None if has_start_projections: ... if has_end_norm: self.final_layernorm = RMSNorm(...)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/models/modeling_speculative.pytensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/modules/mamba/selective_state_update.pytensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.pytensorrt_llm/_torch/modules/mamba/ssd_combined.pytensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g.,some_file.py)
Python classes should use PascalCase (e.g.,class SomeClass)
Python functions and methods should use snake_case (e.g.,def my_awesome_function():)
Python local variables should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL)
Python constants should use upper snake_case (e.g.,MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format"""<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic
Files:
tensorrt_llm/_torch/models/modeling_speculative.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.pytensorrt_llm/_torch/modules/mamba/ssd_combined.pytensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytensorrt_llm/_torch/modules/mamba/selective_state_update.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification
Files:
tensorrt_llm/_torch/models/modeling_speculative.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.pytensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.pytensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.pytensorrt_llm/_torch/modules/mamba/ssd_combined.pytensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/pyexecutor/mamba_cache_manager.pytensorrt_llm/_torch/modules/mamba/selective_state_update.py
🧠 Learnings (8)
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.pytensorrt_llm/_torch/modules/mamba/ssd_combined.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.
Applied to files:
tensorrt_llm/_torch/modules/mamba/ssd_combined.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
tensorrt_llm/_torch/modules/mamba/ssd_combined.py
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
Applied to files:
tensorrt_llm/_torch/models/modeling_nemotron_h.py
📚 Learning: 2025-12-12T03:27:08.565Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.
Applied to files:
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
tensorrt_llm/_torch/modules/mamba/selective_state_update.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/modules/mamba/selective_state_update.py
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.
Applied to files:
tensorrt_llm/_torch/modules/mamba/selective_state_update.py
🧬 Code graph analysis (5)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
NemotronHMTP(664-759)
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
tensorrt_llm/_torch/speculative/interface.py (1)
SpecMetadata(186-361)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (2)
tensorrt_llm/_torch/models/modeling_utils.py (1)
config(527-528)tests/integration/defs/perf/disagg/utils/logger.py (1)
warning(85-87)
tensorrt_llm/_torch/modules/mamba/ssd_combined.py (2)
tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py (1)
_chunk_scan_fwd(501-634)tensorrt_llm/runtime/generation.py (1)
state_dtype(1355-1358)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (3)
tensorrt_llm/_torch/attention_backend/interface.py (1)
AttentionMetadata(54-420)tensorrt_llm/_torch/pyexecutor/resource_manager.py (10)
BaseResourceManager(62-85)shutdown(84-85)shutdown(411-412)shutdown(1308-1313)shutdown(1563-1564)update_resources(78-79)update_resources(593-609)update_resources(1413-1424)update_resources(1557-1558)KVCacheManager(145-1267)tensorrt_llm/_torch/pyexecutor/scheduler.py (1)
ScheduledRequests(21-42)
🪛 Ruff (0.14.11)
tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py
31-31: Unused function argument: stride_x_seq
(ARG001)
39-39: Unused function argument: stride_o_seq
(ARG001)
49-49: Unused function argument: HAS_CACHE
(ARG001)
96-96: Unused blanket noqa directive
Remove unused noqa directive
(RUF100)
369-369: Unused function argument: kwargs
(ARG001)
556-556: Unused function argument: cache_seqlens_ptr
(ARG001)
610-610: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
629-629: Unused blanket noqa directive
Remove unused noqa directive
(RUF100)
961-961: Unused function argument: metadata
(ARG001)
tensorrt_llm/_torch/modules/mamba/ssd_combined.py
242-242: Unpacked variable out_x is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
242-242: Unpacked variable dt_out is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
242-242: Unpacked variable dA_cumsum is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
242-242: Unpacked variable states is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
tensorrt_llm/_torch/models/modeling_nemotron_h.py
619-619: Unused method argument: positions
(ARG002)
671-671: Unused method argument: is_separate_draft_engine
(ARG002)
672-672: Unused method argument: prefix
(ARG002)
720-720: Unused method argument: layer_idx
(ARG002)
744-744: Unused method argument: all_rank_num_tokens
(ARG002)
745-745: Unused method argument: spec_metadata
(ARG002)
746-746: Unused method argument: kwargs
(ARG002)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
131-131: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
137-137: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
147-148: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
155-156: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
377-377: Unused method argument: kv_cache_dtype_byte_size
(ARG002)
377-377: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
511-511: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
tensorrt_llm/_torch/modules/mamba/selective_state_update.py
71-71: Unused function argument: batch
(ARG001)
380-380: Do not assign a lambda expression, use a def
Rewrite grid as a def
(E731)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/_torch/models/modeling_speculative.py (2)
692-694: Nemotron H MTP routing looks correct.Adding
NemotronHMTPto the MTP dispatch keeps the model-type routing aligned with the new MTP path.
736-741: Verifyis_separate_draft_engineflag for Nemotron H draft path.Other draft-model paths pass
is_separate_draft_engine=True, but this branch sets it toFalse. IfNemotronHMTPuses this flag to alter caching/weight sharing, this could misconfigure the separate draft engine. Please confirm the intended behavior; if it should beTrue, update accordingly.tensorrt_llm/_torch/models/modeling_nemotron_h.py (6)
414-455: Good guard: limit backbone forward toconfig.num_hidden_layers.Slicing the layer list keeps appended MTP layers out of the backbone path, which is what we want for the base forward.
463-491: SpecDecOneEngineForCausalLM wiring looks coherent.The new base class, wrapper forward, and load_weights preprocessing align cleanly with the spec-decoding engine flow.
Also applies to: 523-544
347-362: Verify mixer forward signatures acceptspec_metadata.The code passes
spec_metadatatoself.mixer(...). Confirm that all mixer implementations (Mamba2Mixer and others) either explicitly acceptspec_metadataor use**kwargsin their forward signatures to avoid runtimeTypeError.
492-518: Confirm whetherself.model_nextnshould be assigned the computed value.The computed
model_nextnat line 495 is used only as a local variable in the loop, whileself.model_nextnremains 0. Verify whether this attribute should store the computed value for later use, or if keeping it at 0 is intentional.
616-661: Clarify Python version requirement in guidelines;positionsparameter usage needs verification.The code uses PEP 604 union syntax (
torch.Tensor | None) and PEP 585 (tuple[...]), which are appropriate for Python 3.10+ projects. However, CODING_GUIDELINES.md states the project targets 3.8+, while setup.py requires 3.10+. These guidelines should be reconciled with the actual runtime requirement. Additionally, verify whether thepositionsparameter is genuinely unused in this method.
664-759: Fix the unused parameter claim and correct the QuantConfig import.The parameter
layer_idxis not unused—it is assigned toself.layer_idxand passed to the sublayer initialization. However,is_separate_draft_engine,prefix,all_rank_num_tokens,spec_metadata, and**kwargsshould be reviewed for removal if not intended to be used. Additionally, per coding guidelines, maintain the namespace when importingQuantConfig; changefrom tensorrt_llm.models.modeling_utils import QuantConfigto usefrom tensorrt_llm.models import modeling_utilsand reference it asmodeling_utils.QuantConfig.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
Outdated
Show resolved
Hide resolved
mikeiovine
left a comment
There was a problem hiding this comment.
Great work @sunnyqgg. Shall we add an accuracy test to tests/integration/defs/accuracy/test_llm_api_pytorch.py? We can choose any small mamba hybrid for pre-merge and put the bigger ones in post-merge
Signed-off-by: qgai <qgai@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
1-2: Update the NVIDIA copyright year to include 2026.This source file was modified in 2026, but the header still ends at 2024. Please update the year to reflect the latest meaningful modification. As per coding guidelines, please update the header year.
📝 Suggested change
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
1-2: Update copyright year to include 2026.Per coding guidelines, the copyright header should include the year of latest meaningful modification. This file has significant changes (MTP support).
🔧 Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Around line 621-623: Replace Python 3.10+ union and PEP585 generic usage in
the function signature: change parameters like "residual: torch.Tensor | None =
None" to "residual: Optional[torch.Tensor] = None" and change the return
annotation "-> tuple[torch.Tensor, torch.Tensor | None]" to "->
Tuple[torch.Tensor, Optional[torch.Tensor]]"; ensure you import Optional and
Tuple from typing at the top of the module (e.g., add "from typing import
Optional, Tuple") and update any other occurrences of X | None or built-in
generic "tuple[...]" in the same file to use typing.Optional and typing.Tuple so
the code is compatible with Python 3.8.
In `@tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py`:
- Around line 869-905: The non-eagle update kernel branch is missing handling
for KERNEL_WIDTH == 5; extend the conv loop and state-shift logic in the
function (where acc, matrix_x, matrix_w, col0/col1/col2 are used) to include the
j==4 case and shift an extra column: inside the static_range(KERNEL_WIDTH) loop
add the KERNEL_WIDTH == 5 branch mapping j==1..4 to w_col1..w_col4 and loading
matrix_x from x_base_1d + idx_token * stride_x_token with mask_x_1d for the last
column, and in the state-shifting block add the KERNEL_WIDTH == 5 clause that
sets col0=col1; col1=col2; col2=col3; col3=matrix_x (ensure w_col4 and col3
variables are declared/used consistently like the eagle branch).
- Around line 173-174: The variable loaded_x is assigned from tl.load(x_ptrs,
mask_x, 0.0) but never used because new_conv_state immediately re-loads the same
data; fix by collapsing the duplicate load—either remove loaded_x and keep the
single tl.load into new_conv_state, or assign new_conv_state = loaded_x so the
value is reused; update the code around the tl.load calls (references: loaded_x,
new_conv_state, tl.load, x_ptrs, mask_x) accordingly to eliminate the dead
store.
- Around line 481-487: In the grid function, guard against seq_lens_cpu being
empty before calling max(seq_lens_cpu): add an early check (e.g., if not
seq_lens_cpu:) and return a safe grid tuple (such as (0, 0, 0)) or otherwise
raise a clear ValueError with a descriptive message; update references in grid
that compute max_seq_len and the returned tuple so the function never calls
max() on an empty sequence.
- Around line 302-354: The convolution loop and state-shifting branches miss
cases for KERNEL_WIDTH == 5 causing incorrect results; update the inner j-loop
in the function using symbols acc_preload, matrix_w, matrix_x, w_col0..w_col3
(and add w_col4), col0..col3 (and add col4) so that when KERNEL_WIDTH == 5 you
set matrix_w = w_col1..w_col4 for j==1..4 and matrix_x = col1..col3 or perform
the tl.load for the j that reads the new input (mirror the pattern used for
KERNEL_WIDTH==4), and then extend the state shift block (the if KERNEL_WIDTH ==
... section that assigns col0/col1/col2) to handle KERNEL_WIDTH == 5 by shifting
col0=col1, col1=col2, col2=col3, col3=matrix_x; keep existing mask_1d,
SILU_ACTIVATION, tl.store usage unchanged and ensure any preloaded w_col4/col4
are initialized consistently with the other columns.
In `@tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py`:
- Around line 501-515: The function _chunk_scan_fwd currently defaults out=None
but then unconditionally dereferences out.shape, causing an AttributeError if
the caller omits out; fix by making out required (remove the default and update
any call sites) or by allocating/guarding when out is None: inside
_chunk_scan_fwd (e.g., at start of the function) check if out is None and if so
allocate a tensor with the same shape and dtype/device as x (or use
torch.empty_like(x)) before any assertions/assignments so the subsequent assert
out.shape == x.shape is valid; update function signature or initial guard
accordingly.
♻️ Duplicate comments (9)
tensorrt_llm/_torch/modules/mamba/ssd_combined.py (1)
237-271: Function breaks return contract — callers expect(out, state)tuple.The wrapper unpacks
out_x,dt_out,dA_cumsum,states, andfinal_statesbut discards them all, returning only state value(s). Callers (tests, production code) that expectout, state = mamba_chunk_scan_combined(...)will fail with unpacking errors. Additionally, static analysis correctly flagsout_x,dt_out,dA_cumsum, andstatesas unused.Restore the output tensor in the return values to maintain the expected API contract.
Suggested fix
+ if out is None: + out = torch.empty_like(x) - out_x, dt_out, dA_cumsum, states, final_states, *rest = ( + _out_x, _dt_out, _dA_cumsum, _states, final_states, *rest = ( _mamba_chunk_scan_combined_fwd( ... out=out, state_dtype=state_dtype, )) if not return_varlen_states: if not return_final_states: - return + return out else: - return final_states + return out, final_states else: varlen_states = rest[0] - return ((varlen_states) if not return_final_states else - (final_states, varlen_states)) + return (out, varlen_states) if not return_final_states else ( + out, final_states, varlen_states)tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (2)
1-3: Add NVIDIA copyright header.This file is missing the required NVIDIA SPDX header. As per coding guidelines, all TensorRT-LLM source files should contain a copyright header with the year of latest meaningful modification.
Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # SPDX-License-Identifier: Apache-2.0 + # + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at + # + # http://www.apache.org/licenses/LICENSE-2.0 + # + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License. + import torch
59-67:re.matchwon't match keys withmodel.prefix after remapping.After the
backbone -> modelreplacement on line 53, keys likemodel.mtp.layers.0.xxxwon't match the patternr'mtp\.layers\.(\d+)\.(.*)'becausere.matchonly matches at the start of the string. Usere.searchinstead to find the pattern anywhere in the key.Proposed fix
if "mtp.layers." in key: import re - match = re.match(r'mtp\.layers\.(\d+)\.(.*)', key) + match = re.search(r'mtp\.layers\.(\d+)\.(.*)', key) if match: sublayer_idx, rest = match.groups() key = f"model.layers.{config.num_hidden_layers}.layers.{sublayer_idx}.{rest}"tensorrt_llm/_torch/modules/mamba/selective_state_update.py (3)
1-7: Add the required NVIDIA copyright header (still missing).This source file still lacks the NVIDIA header with the latest modification year, which is required for TensorRT‑LLM source files. As per coding guidelines, please add the standard header.
355-358: Guardoutwhen it’sNone.
outdefaults toNone, butout.dim()is accessed unconditionally. This will raise at runtime for callers that omitout.🐛 Proposed fix
- if out.dim() == 2: - out = out.unsqueeze(1) - if out.dim() == 3: - out = out.unsqueeze(1) + if out is None: + out = torch.empty_like(x) + elif out.dim() == 2: + out = out.unsqueeze(1) + elif out.dim() == 3: + out = out.unsqueeze(1)
389-390: Avoiddt_biasattribute access when it’sNone.
dt_biasis optional, butdt_bias.stride(-1)is accessed unconditionally intie_hdim. This will crash ifdt_biasis omitted.🐛 Proposed fix
- tie_hdim = (A.stride(-1) == 0 and A.stride(-2) == 0 and dt.stride(-1) == 0 - and dt_bias.stride(-1) == 0) + tie_hdim = (A.stride(-1) == 0 and A.stride(-2) == 0 and dt.stride(-1) == 0 + and (dt_bias is None or dt_bias.stride(-1) == 0))tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py (2)
1-5: Add NVIDIA copyright header.This new source file requires the standard NVIDIA SPDX header. As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification (2026).
946-963: Unused parameters should be removed or prefixed with underscore.Parameters
metadata(line 961) and**kwargs(line 369 incausal_conv1d_fn) are unused. Per Python conventions, prefix with_or remove them.tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)
34-43: Consider reusing existing utility function.There may be an existing
get_tensor_size_bytesutility intensorrt_llm/_torch/pyexecutor/_util.py. Consider reusing it to avoid code duplication.
🧹 Nitpick comments (7)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)
61-61: Consider movingimport reto module level.The
remodule import is inside the loop. For better readability and slight performance improvement (avoiding repeated import checks), consider moving it to the top of the file with other imports.Suggested change
import torch + import re import tensorrt_llm.logger as loggerThen remove line 61.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
5440-5451: Avoid permanently skipping the MTP test.The unconditional skip means the test never runs even when the model is present. Consider skipping only when the model path is unavailable so developers can run it locally.
♻️ Suggested change
- `@pytest.mark.skip`(reason="Skip MTP test due to no model path file in CI") `@skip_pre_blackwell` `@pytest.mark.skip_less_mpi_world_size`(8) def test_nvfp4_8gpus_mtp(self): # Test MTP (Multi-Token Prediction) accuracy with nvfp4-fp8kv model. # This test uses MTP with max_draft_len=3 and one_model mode. mtp_config = MTPDecodingConfig( num_nextn_predict_layers=3, mtp_eagle_one_model=True, ) model_path = f"{llm_models_root()}/nemotron-super-sft-repeated-mtp-iter-0010600-nvfp4-fp8kv" + if not os.path.exists(model_path): + pytest.skip("MTP model path not available")tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
297-315: Avoid storing per-call intermediate indices on the module.
self.intermediate_state_indicesis only used within this forward pass. Keeping it local avoids unintended statefulness across concurrent/overlapping calls.♻️ Suggested change
- self.intermediate_state_indices = torch.arange( + intermediate_state_indices = torch.arange( num_decodes, dtype=torch.int32, device=state_indices_d.device) ... - intermediate_state_indices=self.intermediate_state_indices, + intermediate_state_indices=intermediate_state_indices, ... - intermediate_state_indices=self.intermediate_state_indices, + intermediate_state_indices=intermediate_state_indices,Also applies to: 371-382
tensorrt_llm/_torch/models/modeling_nemotron_h.py (2)
667-672: Consider removing or prefixing unused constructor parameters.
is_separate_draft_engineandprefixare unused. If they're for interface compatibility, prefix with_to indicate intentional non-use.♻️ Suggested fix
def __init__(self, model_config: ModelConfig[NemotronHConfig], layer_idx: int, aux_stream_dict: Dict[AuxStreamType, torch.cuda.Stream], - is_separate_draft_engine: bool = False, - prefix: str = ""): + _is_separate_draft_engine: bool = False, + _prefix: str = ""):
750-758:positionsis passed to layer but not used inNemotronHMTPDecoderLayer.forward.Line 754 passes
positions=position_idsto the layer, butNemotronHMTPDecoderLayer.forward(line 619) accepts but ignores thepositionsparameter. Either use it or remove from both signatures for clarity.tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (2)
373-416: Variable shadowing and implicitOptionaltype hints.
Line 412: The local variable
num_accepted_draft_tokensshadows the parameter of the same name. Since the parameter is never used, consider renaming the parameter to_num_accepted_draft_tokensor removing it if the interface allows.Lines 376-377: Use explicit
Optionaltype hints per PEP 484:♻️ Suggested fix
def update_resources( self, scheduled_batch: ScheduledRequests, - attn_metadata: "AttentionMetadata" = None, - kv_cache_dtype_byte_size: float = None, + attn_metadata: Optional["AttentionMetadata"] = None, + _kv_cache_dtype_byte_size: Optional[float] = None, num_accepted_draft_tokens: Optional[torch.Tensor] = None):
507-516: Apply sameOptionaltype hint fixes.For consistency with PEP 484, update the type hints here as well:
♻️ Suggested fix
def update_resources( self, scheduled_batch: ScheduledRequests, - attn_metadata: "AttentionMetadata" = None, - kv_cache_dtype_byte_size: float = None, + attn_metadata: Optional["AttentionMetadata"] = None, + kv_cache_dtype_byte_size: Optional[float] = None, num_accepted_draft_tokens: Optional[torch.Tensor] = None):
|
/bot run |
|
PR_Github #32459 [ run ] triggered by Bot. Commit: |
|
PR_Github #32459 [ run ] completed with state
|
Signed-off-by: qgai <qgai@nvidia.com>
|
/bot run |
|
PR_Github #32598 [ run ] triggered by Bot. Commit: |
|
PR_Github #32598 [ run ] completed with state
|
Signed-off-by: qgai <qgai@nvidia.com>
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
Outdated
Show resolved
Hide resolved
Signed-off-by: qgai <qgai@nvidia.com>
|
PR_Github #33449 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33478 [ run ] triggered by Bot. Commit: |
|
PR_Github #33478 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33498 [ run ] triggered by Bot. Commit: |
|
PR_Github #33498 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #33516 [ run ] triggered by Bot. Commit: |
|
PR_Github #33516 [ run ] completed with state |
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
Show resolved
Hide resolved
9774823 to
a956572
Compare
|
/bot reuse-pipeline |
|
PR_Github #33578 [ reuse-pipeline ] triggered by Bot. Commit: |
Signed-off-by: qgai <qgai@nvidia.com>
a956572 to
a1166bb
Compare
|
/bot reuse-pipeline |
|
PR_Github #33580 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #33578 [ reuse-pipeline ] completed with state |
|
PR_Github #33580 [ reuse-pipeline ] completed with state |
Summary by CodeRabbit
New Features
Improvements
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.