Skip to content

Comments

[TRTLLM-10062][feat] Enable MTP for Nemotron Super#10754

Merged
mikeiovine merged 22 commits intoNVIDIA:mainfrom
sunnyqgg:nemotron-super-mtp
Jan 26, 2026
Merged

[TRTLLM-10062][feat] Enable MTP for Nemotron Super#10754
mikeiovine merged 22 commits intoNVIDIA:mainfrom
sunnyqgg:nemotron-super-mtp

Conversation

@sunnyqgg
Copy link
Collaborator

@sunnyqgg sunnyqgg commented Jan 16, 2026

Summary by CodeRabbit

  • New Features

    • Speculative decoding support for Nemotron H with MTP layers
    • Triton-optimized causal convolution path for faster Mamba inference
    • Multi-token buffering and Eagle-tree-style attention retrieval for speculative flows
  • Improvements

    • Preallocated output buffers to reduce memory copies and improve performance
    • Structured intermediate-state caching for verified speculative updates
    • Safer tile-size selection to avoid edge-case runtime errors
  • Tests

    • Added MTP-enabled accuracy/config variants and a gated integration test (CI-skipped)

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@sunnyqgg sunnyqgg requested review from a team as code owners January 16, 2026 08:20
Signed-off-by: qgai <qgai@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

Adds MTP/speculative decoding support to Nemotron‑H and Mamba: weight key remapping, speculative metadata threading, Triton causal‑conv kernels and update path, multi‑token/state buffering, intermediate caches, and cache promotion logic for draft verification.

Changes

Cohort / File(s) Summary
Nemotron‑H model & weight mapping
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py, tensorrt_llm/_torch/models/modeling_nemotron_h.py, tensorrt_llm/_torch/models/modeling_speculative.py
Add MTP key remapping; thread SpecMetadata through forward paths; change NemotronHForCausalLM base to speculative engine type; add NemotronHMTP components and wire Nemotron‑H into MTP dispatch.
Triton causal conv kernels & wrappers
tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py
New Triton kernels and Python wrappers for causal 1D conv forward and fused update supporting continuous/varlen batching, state caching, optional Eagle‑tree masking, and speculative update flows.
Mamba2 mixer speculative flow
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
Add spec_metadata arg; route speculative decode via Triton update path; replace direct cache reads with layer cache API; preallocate shared SSM output tensor; add num_heads attribute.
Selective state update (multi‑token & caching)
tensorrt_llm/_torch/modules/mamba/selective_state_update.py
Extend kernel and wrapper for multi‑token T dimension, intermediate state buffers/indices, Eagle‑tree parent retrieval, optional out/in‑place behavior, and per‑step caching logic.
SSD chunk scan: out preallocation & semantics
tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py
Add out parameter, assert/out usage, adjust initial_states validation, remove previous min‑clamp in activation scaling, and change _chunk_scan_fwd to return single tensor.
SSD combined: state_dtype, out, returns
tensorrt_llm/_torch/modules/mamba/ssd_combined.py
Add state_dtype and out args, require chunk_size power‑of‑two, adapt state dtype/shape handling, and adjust return shapes for varlen/final states.
Mamba cache manager: speculative caches & promotion
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
Introduce State / SpeculativeState dataclasses, allocate intermediate buffers when speculative, expose per‑layer cache accessors, add update_resources to promote accepted draft states, and extend hybrid manager to propagate speculative config.
FMHA kernel guard
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h
Skip candidate tileSizeQ values smaller than mNumHeadsQPerKv to avoid invalid tile selection.
Integration tests / references
tests/integration/defs/accuracy/references/gsm8k.yaml, tests/integration/defs/accuracy/references/mmlu.yaml, tests/integration/defs/accuracy/test_llm_api_pytorch.py
Add MTP‑enabled reference entries and a skipped CI test exercising NVFP4+MTP decode config (num_nextn_predict_layers=3) for Nemotron‑H.

Sequence Diagram(s)

sequenceDiagram
    participant Model as NemotronH Model
    participant Mixer as Mamba2Mixer
    participant Conv as CausalConv1D (Triton)
    participant Cache as MambaCacheManager
    participant Verify as Verification Engine

    Model->>Mixer: forward(hidden_states, spec_metadata)
    alt speculative (spec_metadata present)
        Mixer->>Cache: mamba_layer_cache(layer_idx) -> intermediate buffers
        Mixer->>Conv: causal_conv1d_update(speculative inputs, intermediate_state_indices)
        Conv->>Cache: read/update intermediate_conv_window / conv_state
        Mixer->>Mixer: selective_state_update(out=prealloc, intermediate_states_buffer)
        Mixer->>Verify: emit outputs + intermediate buffers for verification
        Verify->>Cache: update_resources(num_accepted_tokens)
        Cache->>Cache: promote intermediate_* -> main caches
    else standard
        Mixer->>Cache: read conv/temporal states
        Mixer->>Conv: causal_conv1d_fwd(standard path)
        Mixer->>Mixer: selective_state_update(standard, writeback)
        Mixer->>Cache: update main caches
    end
    Mixer->>Model: return processed output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is almost entirely the repository template with unfilled placeholders; it lacks any explanation of what was changed, why it was changed, and what tests safeguard the modifications. Fill in the Description section explaining the MTP implementation for Nemotron Super and the Test Coverage section documenting which tests validate these changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 23.40% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '[TRTLLM-10062][feat] Enable MTP for Nemotron Super' clearly summarizes the main change: adding MTP support to Nemotron Super model, matching the changeset content.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sunnyqgg sunnyqgg changed the title [TRTLLM-10062][feat] Enable MTP for Nemotron Super [TRTLLM-10062][feat] Enable MTP for Nemotron Super Jan 16, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

This PR introduces speculative decoding and multi-task pattern (MTP) support to the Nemotron H model and Mamba-based components. It extends the model architecture with SpecMetadata propagation through layers, adds MTP decoder components, implements Triton-based causal convolution with continuous batching, and enhances state management with intermediate caching for speculative verification.

Changes

Cohort / File(s) Summary
Nemotron H Model Architecture
tensorrt_llm/_torch/models/modeling_nemotron_h.py
Extended forward signatures with spec_metadata parameter across NemotronHLayer, NemotronHModel, and NemotronHForCausalLM. Added new NemotronHMTPDecoderLayer and NemotronHMTP classes for MTP support. Changed base class to SpecDecOneEngineForCausalLM. Introduced MTP layer extension logic with quantization config overrides. (+281/-13)
Nemotron H Integration
tensorrt_llm/_torch/models/nemotron_h_weight_mapper.py, tensorrt_llm/_torch/models/modeling_speculative.py
Added key remapping for MTP layers in weight mapper with pattern matching and fallback warnings. Extended MTPForCausalLM and MTPDraftModel to recognize and route "nemotron_h" model type with corresponding MTP layer instantiation. (+20/-0)
Mamba Causal Convolution Kernels
tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py
New comprehensive Triton-based implementation with _causal_conv1d_fwd_kernel and _causal_conv1d_update_kernel supporting continuous batching, optional state caching, bias, SiLU activation, and Eagle Tree attention mask constructs. Includes Python wrappers causal_conv1d_fn and causal_conv1d_update. (+1150/-0)
Mamba2 Mixer State Management
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
Added spec_metadata parameter to forward signature. Replaced direct tensor access with layer_cache via kv_cache_manager.mamba_layer_cache(). Introduced output preallocation and speculative path branching with intermediate buffer handling. Added num_heads class attribute. (+121/-34)
Speculative State Update Kernels
tensorrt_llm/_torch/modules/mamba/selective_state_update.py
Extended _selective_scan_update_kernel with intermediate state caching, Eagle Tree custom attention support, and multi-token (time-dimension) processing. Added parameters for intermediate buffers, cache steps, parent token retrieval, and disable_state_update flag. Refactored for 4D/5D tensor shapes. (+221/-91)
Chunk Scan Output Aliasing
tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py
Added out parameter for externally-allocated output reuse. Changed return value from (out, out_x) to out_x only. Modified initial state handling and replaced masked exponential computation. (+41/-49)
Combined Chunk Scan Integration
tensorrt_llm/_torch/modules/mamba/ssd_combined.py
Introduced is_int_pow_2() check and TRITON_22 flag. Replaced mamba_ssm_cache_dtype with state_dtype parameter. Added out parameter propagation and refactored return semantics to exclude out from returns. Updated shape assertions for initial states. (+91/-72)
Mamba Cache Management
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
Introduced dataclass-based State and SpeculativeState containers for unified cache management. Added intermediate state accessors (get_intermediate_ssm_states, get_intermediate_conv_states), mamba_layer_cache() method, and update_resources() for speculative state synchronization. Extended both MambaCacheManager and MambaHybridCacheManager. (+255/-25)

Sequence Diagram(s)

sequenceDiagram
    participant Client as Forward Pass
    participant Model as NemotronHModel
    participant Layer as NemotronHLayer
    participant Mixer as Mamba2Mixer
    participant Cache as MambaCacheManager

    Client->>Model: forward(input_ids, spec_metadata)
    Model->>Model: set num_hidden_layers limit
    
    loop For each active layer
        Model->>Layer: forward(position_ids, hidden_states, spec_metadata)
        Layer->>Mixer: forward(hidden_states, spec_metadata)
        
        Mixer->>Cache: mamba_layer_cache(layer_idx)
        Cache-->>Mixer: State{conv, temporal}
        
        alt Speculative Path (spec_metadata active)
            Mixer->>Mixer: causal_conv1d_update_triton(intermediate_buffers)
            Mixer->>Mixer: selective_state_update(disable_state_update=False)
            Cache->>Cache: update intermediate_ssm cache
        else Standard Path
            Mixer->>Mixer: causal_conv1d_update(conv_states)
            Mixer->>Mixer: selective_state_update(standard)
            Cache->>Cache: update main conv/temporal states
        end
        
        Mixer-->>Layer: hidden_states
        Layer-->>Model: hidden_states
    end
    
    Model-->>Client: output_tensor
Loading
sequenceDiagram
    participant MTP as NemotronHMTP
    participant Layer as NemotronHMTPDecoderLayer
    participant Proj as Projection
    participant Mixer as Mixer
    participant Cache as CacheManager

    MTP->>MTP: for each pattern layer
    MTP->>Layer: forward(fused_hiddens, spec_metadata)
    
    opt Has start projection
        Layer->>Proj: apply_start_projection()
        Proj-->>Layer: projected_states
    end
    
    Layer->>Mixer: forward(projected_states, spec_metadata)
    Mixer->>Cache: retrieve layer_cache
    Cache-->>Mixer: conv_states, temporal_states
    Mixer-->>Layer: mixed_output
    
    opt Has end norm
        Layer->>Layer: apply_end_norm()
    end
    
    Layer-->>MTP: updated_residuals
    MTP->>MTP: accumulate for next layer
    
    MTP-->>Client: final_output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely empty, containing only the template boilerplate with no actual content filled in for Description, Test Coverage, or PR Checklist items. Complete the PR description by explaining what MTP is, why it's being added for Nemotron Super, what test coverage exists, and check the PR checklist items.
Docstring Coverage ⚠️ Warning Docstring coverage is 24.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main feature: enabling MTP (multi-task processing) for Nemotron Super, which aligns with the substantial changes across multiple modeling files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (10)
tensorrt_llm/_torch/modules/mamba/selective_state_update.py (1)

360-379: Validate cache_steps and cache indices before kernel launch.

When intermediate_states_buffer is provided, cache_steps can be None/0 and indices can be negative/out-of-range, which risks out-of-bounds GPU writes. Add explicit checks for cache_steps, index bounds, and buffer capacity.

🛡️ Suggested validation
     if state_batch_indices is not None:
         assert state_batch_indices.shape == (batch, )
     assert out.shape == x.shape
+
+    if intermediate_states_buffer is not None:
+        assert cache_steps is not None and cache_steps >= T
+        if intermediate_state_indices is not None:
+            min_idx = int(intermediate_state_indices.min().item())
+            max_idx = int(intermediate_state_indices.max().item())
+            assert min_idx >= 0, "intermediate_state_indices must be >= 0"
+        elif state_batch_indices is not None:
+            valid = state_batch_indices != pad_slot_id
+            max_idx = int(state_batch_indices[valid].max().item()) if valid.any() else -1
+        else:
+            max_idx = batch - 1
+        required = (max_idx + 1) * cache_steps * nheads * dim * dstate
+        assert intermediate_states_buffer.numel() >= required
tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py (3)

4-5: Update NVIDIA copyright year range.

This file was modified in this PR; please extend the year range to include 2026 to satisfy the header requirement. As per coding guidelines, please update the header to the latest modification year.


501-545: Guard against out=None to avoid AttributeError.

out is optional in the signature but immediately dereferenced. This will crash if a caller hasn’t been updated. Either enforce a clear assertion or allocate a fallback buffer.

✅ Safe fallback allocation
-    assert out.shape == x.shape
+    if out is None:
+        out = torch.empty_like(x)
+    else:
+        assert out.shape == x.shape

419-425: Potential numerical blow-up from unclamped exponent in decay calculation.

The line cb *= tl.exp(dA_cs_m[:, None] - dA_cs_k[None, :]) can produce positive arguments to exp() when iteration index k is large enough that all k indices exceed all m indices. Since dA_log values represent log-space decay (≤ 0), their cumsum is monotone decreasing, which means dA_cs_m - dA_cs_k becomes positive in such cases, causing exp() to amplify beyond 1 and destabilizing the scan.

Either add a clamp to constrain the exponent to non-positive values, or document/assert that the index relationship guarantees dA_cs_m - dA_cs_k ≤ 0:

🛠️ Suggested safe clamp
-        cb *= tl.exp(dA_cs_m[:, None] - dA_cs_k[None, :])
+        cb *= tl.exp(tl.minimum(dA_cs_m[:, None] - dA_cs_k[None, :], 0.0))
tensorrt_llm/_torch/modules/mamba/ssd_combined.py (1)

4-5: Update NVIDIA copyright year range.

This file was modified in this PR; please extend the year range to include 2026. As per coding guidelines, update the header to the latest modification year.

tensorrt_llm/_torch/models/modeling_speculative.py (1)

1-5: Add NVIDIA copyright header (2026).

This file is missing the required NVIDIA SPDX header. Please add/update it to reflect 2026.
As per coding guidelines, please include the header.

tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (2)

1-2: Update NVIDIA copyright year to 2026.

The SPDX header still ends at 2024 despite new modifications. Please bump the year to 2026.
As per coding guidelines, the header should reflect the latest modification year.


83-88: Fix undefined self.tp_size (runtime AttributeError).

self.tp_size is used when allocating preallocated_ssm_out, but it’s never set in __init__. This will raise at runtime. Store tp_size on self (or use self.mapping.tp_size).

🔧 Proposed fix
         if config.mapping.enable_attention_dp:
             self.mapping = Mapping(
@@
             )
             tp_size = 1
         else:
             self.mapping = config.mapping
             tp_size = config.mapping.tp_size
+        self.tp_size = tp_size

Also applies to: 208-214

tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)

1-1: Update NVIDIA copyright year to 2026.

The header still ends at 2025 despite new changes in this file. Please update to 2026.
As per coding guidelines, the latest modification year must be reflected.

tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)

1-1: Update the NVIDIA copyright year to 2026.

The SPDX header still ends at 2024 even though this file was modified in 2026.

✏️ Suggested update
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines, please keep the header year current.
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py`:
- Around line 59-67: The MTP key remapping uses re.match which only matches at
the start so keys like "model.mtp.layers.0..." (after earlier backbone->model
replacement) are missed; update the logic in the block handling "mtp.layers."
(the variable key and the regex match call) to use re.search (or a regex that
allows a prefix) e.g. search for r'mtp\.layers\.(\d+)\.(.*)' and, when found,
extract sublayer_idx and rest and rebuild key as before (key =
f"model.layers.{config.num_hidden_layers}.layers.{sublayer_idx}.{rest}'),
otherwise keep the existing logger.warning behavior.
- Around line 1-3: Add the required NVIDIA copyright/SPDX header (updated year
2026) at the very top of the file before any imports; insert the standard NVIDIA
header block containing the copyright statement with year 2026 and the
SPDX-License-Identifier so the file
(tensorrt_llm._torch.models.checkpoints.hf.nemotron_h_weight_mapper.py) begins
with the header followed by the existing imports (e.g., the current "import
torch" and "import tensorrt_llm.logger as logger").

In `@tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py`:
- Around line 946-963: The parameter `metadata` on causal_conv1d_update is
unused and should be removed (or renamed to `_metadata` to intentionally mark it
unused); update the function signature of causal_conv1d_update to drop
`metadata` (or change it to `_metadata`) and adjust any internal uses or callers
accordingly (search for causal_conv1d_update references to update call sites),
and run linter to ensure Ruff warnings are resolved.
- Around line 357-370: The function causal_conv1d_fn currently accepts an unused
**kwargs which triggers lint warnings; remove **kwargs from the signature or
rename it to _kwargs (or add a targeted noqa) and update any callers if
necessary so the function signature matches; locate causal_conv1d_fn in
tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py and either delete the
**kwargs parameter or rename it to _kwargs (or add a noqa comment adjacent to
the parameter) to explicitly mark it as intentionally ignored.
- Around line 31-55: The kernel currently declares unused parameters
(stride_x_seq, stride_o_seq, HAS_CACHE and the other unused args around lines
96-99) and uses a blanket `# noqa`; fix by either renaming each unused argument
to start with an underscore (e.g. `_stride_x_seq`, `_stride_o_seq`,
`_HAS_CACHE`) so linters treat them as intentionally unused, or add targeted
per-argument noqa comments (`# noqa: ARG001`) on those specific parameters and
remove the blanket `# noqa`; apply the same change to the other unused
parameters referenced in the signature so Ruff no longer flags ARG001.
- Around line 550-610: Remove the unused cache_seqlens_ptr from the
_causal_conv1d_update_kernel signature and any internal references; if callers
require the same signature, instead rename the parameter to _cache_seqlens_ptr
to mark it intentionally unused. Also remove the broad "# ruff: noqa: E501" / "#
noqa" suppression and either delete it entirely or replace it with a targeted
suppression (e.g., "# noqa: E501" only where a long-line warning actually
occurs). Apply the same cleanup (remove/rename cache_seqlens_ptr and replace the
blanket noqa) to the other analogous function/signature mentioned in the diff.
- Around line 1-5: Add the standard NVIDIA SPDX copyright header (with current
year 2026) to the top of the new module causal_conv1d_triton.py; specifically
prepend the file with the NVIDIA copyright notice and the
SPDX-License-Identifier line required by our guidelines (e.g.,
"SPDX-License-Identifier: NVIDIA-Scale-2.0" or the project’s mandated SPDX tag),
ensuring the header replaces or precedes the existing comment block so the file
contains the official NVIDIA header and year.

In `@tensorrt_llm/_torch/modules/mamba/selective_state_update.py`:
- Around line 389-395: The tie_hdim computation reads dt_bias.stride(-1)
unconditionally which will crash when dt_bias is None; update the tie_hdim
expression in selective_state_update.py to guard dt_bias (e.g., only call
dt_bias.stride(-1) when dt_bias is not None, or compute a dt_bias_stride
variable defaulting to 0 and use that) so tie_hdim uses safe values for
A.stride(-1), A.stride(-2), dt.stride(-1) and the dt_bias stride only if dt_bias
exists; keep the rest of the logic identical and adjust the conditional form
near the existing tie_hdim definition.
- Around line 1-7: Add the standard NVIDIA source header (including the NVIDIA
copyright notice and the current modification year 2026) at the top of the
module selective_state_update.py (above the existing SPDX and copyright lines)
so the file uses the repo's required NVIDIA header format; ensure the header
text follows the project's canonical wording and includes the 2026 year and any
required legal phrases used elsewhere in the repo.
- Around line 329-358: The code calls out.dim() unconditionally even though out
may be None; update the shape-normalization block in selective_state_update (the
section handling x, dt, A, B, C, D, z, dt_bias, out) to guard against None by
wrapping the out-specific lines in "if out is not None:" (or initialize out
before normalization), i.e., replace the unconditional "if out.dim() == 2/3: out
= out.unsqueeze(1)" checks with a conditional block that only calls out.dim()
when out is not None so no attribute access occurs on None.

In `@tensorrt_llm/_torch/modules/mamba/ssd_combined.py`:
- Around line 237-271: The wrapper currently discards the output tensor from
_mamba_chunk_scan_combined_fwd and only returns state values; restore the
original return contract by ensuring the output tensor (out_x) is returned first
along with state(s). Specifically, if out is None allocate or accept the
produced out_x from _mamba_chunk_scan_combined_fwd and return (out_x, states)
when return_varlen_states is False and return_final_states is False; when
return_varlen_states is True return (out_x, varlen_states) or (out_x,
final_states, varlen_states) depending on return_final_states. Keep references
to out_x, varlen_states, final_states, return_varlen_states, return_final_states
and _mamba_chunk_scan_combined_fwd so callers receive the expected (out,
state...) tuple and avoid unused-variable linter warnings.

In `@tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py`:
- Around line 373-378: The update_resources method currently ignores the
kv_cache_dtype_byte_size and num_accepted_draft_tokens parameters; either
remove/suppress them or use them as the source of truth. Modify update_resources
(function name: update_resources) so that if num_accepted_draft_tokens is
provided you use it directly instead of recomputing from scheduled_batch, and if
kv_cache_dtype_byte_size is provided you use it to calculate/assign the KV cache
byte size instead of ignoring it; alternatively, if these parameters are
intentionally unused, rename them to _kv_cache_dtype_byte_size and
_num_accepted_draft_tokens (or remove them) to silence the unused-parameter
warning. Ensure scheduled_batch remains the fallback when passed-in values are
None.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/mamba/selective_state_update.py (1)

380-380: Use def for grid to satisfy lint.

Ruff flags lambda assignments (E731). Switching to def removes the lint warning without behavior change.

♻️ Suggested refactor
-    grid = lambda META: (triton.cdiv(dim, META["BLOCK_SIZE_M"]), batch, nheads)
+    def grid(META):
+        return (triton.cdiv(dim, META["BLOCK_SIZE_M"]), batch, nheads)
tensorrt_llm/_torch/modules/mamba/ssd_combined.py (1)

20-31: Use module-qualified packaging import to keep namespace.

Current import style violates the namespace guideline. Please import the module and qualify version.parse.

♻️ Suggested change
-import triton
-from einops import rearrange
-from packaging import version
+import triton
+import packaging
+from einops import rearrange
@@
-TRITON_22 = version.parse(triton.__version__) >= version.parse("2.2.0")
+TRITON_22 = packaging.version.parse(triton.__version__) >= packaging.version.parse("2.2.0")

As per coding guidelines.

tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)

296-314: Avoid introducing new instance state in forward.

self.intermediate_state_indices is created inside forward without being initialized in __init__. Either keep it local or initialize it in the constructor to comply with class-member initialization rules.
As per coding guidelines, all externally visible members should be initialized in __init__.

🔧 Example adjustment
 class Mamba2Mixer(nn.Module):
     def __init__(...):
         super().__init__()
+        self.intermediate_state_indices: Optional[torch.Tensor] = None
@@
-                self.intermediate_state_indices = torch.arange(
+                intermediate_state_indices = torch.arange(
                     num_decodes,
                     dtype=torch.int32,
                     device=state_indices_d.device)
@@
-                    intermediate_state_indices=self.intermediate_state_indices,
+                    intermediate_state_indices=intermediate_state_indices,
tensorrt_llm/_torch/models/modeling_nemotron_h.py (2)

17-45: Preserve module namespaces in the new imports.

The new direct imports (typing, speculative, modeling_deepseekv3, modeling_speculative, modeling_utils) break the “keep module namespace” rule; prefer module imports and qualify usages to avoid name collisions.

As per coding guidelines, please keep module namespaces for imports.


549-615: Initialize optional members and add a Google-style class docstring.

self.enorm, self.hnorm, self.eh_proj, and self.final_layernorm are only set conditionally; initializing them to None improves interface stability. The new public class also needs a Google-style docstring.

♻️ Suggested shape
 class NemotronHMTPDecoderLayer(NemotronHLayer):
+    """MTP decoder layer for NemotronH.
+
+    Args:
+        model_config: Model configuration.
+        layer_idx: Layer index.
+        aux_stream_dict: Auxiliary CUDA streams.
+        has_start_projections: Whether to apply start projections.
+        has_end_norm: Whether to apply end-layer norm.
+        layer_type: Layer kind token.
+    """
 
     def __init__(
         ...
     ) -> None:
 
         super().__init__(...)
 
         config = model_config.pretrained_config
         self.model_config = model_config
         self.has_start_projections = has_start_projections
         self.has_end_norm = has_end_norm
+        self.enorm = None
+        self.hnorm = None
+        self.eh_proj = None
+        self.final_layernorm = None
 
         if has_start_projections:
             ...
 
         if has_end_norm:
             self.final_layernorm = RMSNorm(...)
As per coding guidelines, please keep public classes documented and members initialized in the constructor.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 722978b and a7830c0.

📒 Files selected for processing (9)
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/modules/mamba/selective_state_update.py
  • tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py
  • tensorrt_llm/_torch/modules/mamba/ssd_combined.py
  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g., some_file.py)
Python classes should use PascalCase (e.g., class SomeClass)
Python functions and methods should use snake_case (e.g., def my_awesome_function():)
Python local variables should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL)
Python constants should use upper snake_case (e.g., MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format """<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic

Files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py
  • tensorrt_llm/_torch/modules/mamba/ssd_combined.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/modules/mamba/selective_state_update.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification

Files:

  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py
  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py
  • tensorrt_llm/_torch/modules/mamba/ssd_combined.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
  • tensorrt_llm/_torch/modules/mamba/selective_state_update.py
🧠 Learnings (8)
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

  • tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py
  • tensorrt_llm/_torch/modules/mamba/ssd_combined.py
📚 Learning: 2025-08-14T06:36:40.701Z
Learnt from: timlee0212
Repo: NVIDIA/TensorRT-LLM PR: 6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.701Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/modules/mamba/ssd_combined.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

  • tensorrt_llm/_torch/modules/mamba/ssd_combined.py
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
📚 Learning: 2025-12-12T03:27:08.565Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • tensorrt_llm/_torch/modules/mamba/selective_state_update.py
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/modules/mamba/selective_state_update.py
📚 Learning: 2025-08-15T06:46:54.897Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:54.897Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp addToken function, newly allocated blocks are unshared by design. The beam search path in addToken (when sequence.getNumTokens() > windowSize) is currently broken/non-functional with SWA, so the block allocation doesn't follow a shared-then-unshared pattern.

Applied to files:

  • tensorrt_llm/_torch/modules/mamba/selective_state_update.py
🧬 Code graph analysis (5)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
  • NemotronHMTP (664-759)
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
tensorrt_llm/_torch/speculative/interface.py (1)
  • SpecMetadata (186-361)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (2)
tensorrt_llm/_torch/models/modeling_utils.py (1)
  • config (527-528)
tests/integration/defs/perf/disagg/utils/logger.py (1)
  • warning (85-87)
tensorrt_llm/_torch/modules/mamba/ssd_combined.py (2)
tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py (1)
  • _chunk_scan_fwd (501-634)
tensorrt_llm/runtime/generation.py (1)
  • state_dtype (1355-1358)
tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (3)
tensorrt_llm/_torch/attention_backend/interface.py (1)
  • AttentionMetadata (54-420)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (10)
  • BaseResourceManager (62-85)
  • shutdown (84-85)
  • shutdown (411-412)
  • shutdown (1308-1313)
  • shutdown (1563-1564)
  • update_resources (78-79)
  • update_resources (593-609)
  • update_resources (1413-1424)
  • update_resources (1557-1558)
  • KVCacheManager (145-1267)
tensorrt_llm/_torch/pyexecutor/scheduler.py (1)
  • ScheduledRequests (21-42)
🪛 Ruff (0.14.11)
tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py

31-31: Unused function argument: stride_x_seq

(ARG001)


39-39: Unused function argument: stride_o_seq

(ARG001)


49-49: Unused function argument: HAS_CACHE

(ARG001)


96-96: Unused blanket noqa directive

Remove unused noqa directive

(RUF100)


369-369: Unused function argument: kwargs

(ARG001)


556-556: Unused function argument: cache_seqlens_ptr

(ARG001)


610-610: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


629-629: Unused blanket noqa directive

Remove unused noqa directive

(RUF100)


961-961: Unused function argument: metadata

(ARG001)

tensorrt_llm/_torch/modules/mamba/ssd_combined.py

242-242: Unpacked variable out_x is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


242-242: Unpacked variable dt_out is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


242-242: Unpacked variable dA_cumsum is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


242-242: Unpacked variable states is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

tensorrt_llm/_torch/models/modeling_nemotron_h.py

619-619: Unused method argument: positions

(ARG002)


671-671: Unused method argument: is_separate_draft_engine

(ARG002)


672-672: Unused method argument: prefix

(ARG002)


720-720: Unused method argument: layer_idx

(ARG002)


744-744: Unused method argument: all_rank_num_tokens

(ARG002)


745-745: Unused method argument: spec_metadata

(ARG002)


746-746: Unused method argument: kwargs

(ARG002)

tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py

131-131: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


137-137: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


147-148: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


155-156: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


377-377: Unused method argument: kv_cache_dtype_byte_size

(ARG002)


377-377: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


511-511: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

tensorrt_llm/_torch/modules/mamba/selective_state_update.py

71-71: Unused function argument: batch

(ARG001)


380-380: Do not assign a lambda expression, use a def

Rewrite grid as a def

(E731)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/_torch/models/modeling_speculative.py (2)

692-694: Nemotron H MTP routing looks correct.

Adding NemotronHMTP to the MTP dispatch keeps the model-type routing aligned with the new MTP path.


736-741: Verify is_separate_draft_engine flag for Nemotron H draft path.

Other draft-model paths pass is_separate_draft_engine=True, but this branch sets it to False. If NemotronHMTP uses this flag to alter caching/weight sharing, this could misconfigure the separate draft engine. Please confirm the intended behavior; if it should be True, update accordingly.

tensorrt_llm/_torch/models/modeling_nemotron_h.py (6)

414-455: Good guard: limit backbone forward to config.num_hidden_layers.

Slicing the layer list keeps appended MTP layers out of the backbone path, which is what we want for the base forward.


463-491: SpecDecOneEngineForCausalLM wiring looks coherent.

The new base class, wrapper forward, and load_weights preprocessing align cleanly with the spec-decoding engine flow.

Also applies to: 523-544


347-362: Verify mixer forward signatures accept spec_metadata.

The code passes spec_metadata to self.mixer(...). Confirm that all mixer implementations (Mamba2Mixer and others) either explicitly accept spec_metadata or use **kwargs in their forward signatures to avoid runtime TypeError.


492-518: Confirm whether self.model_nextn should be assigned the computed value.

The computed model_nextn at line 495 is used only as a local variable in the loop, while self.model_nextn remains 0. Verify whether this attribute should store the computed value for later use, or if keeping it at 0 is intentional.


616-661: Clarify Python version requirement in guidelines; positions parameter usage needs verification.

The code uses PEP 604 union syntax (torch.Tensor | None) and PEP 585 (tuple[...]), which are appropriate for Python 3.10+ projects. However, CODING_GUIDELINES.md states the project targets 3.8+, while setup.py requires 3.10+. These guidelines should be reconciled with the actual runtime requirement. Additionally, verify whether the positions parameter is genuinely unused in this method.


664-759: Fix the unused parameter claim and correct the QuantConfig import.

The parameter layer_idx is not unused—it is assigned to self.layer_idx and passed to the sublayer initialization. However, is_separate_draft_engine, prefix, all_rank_num_tokens, spec_metadata, and **kwargs should be reviewed for removal if not intended to be used. Additionally, per coding guidelines, maintain the namespace when importing QuantConfig; change from tensorrt_llm.models.modeling_utils import QuantConfig to use from tensorrt_llm.models import modeling_utils and reference it as modeling_utils.QuantConfig.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @sunnyqgg. Shall we add an accuracy test to tests/integration/defs/accuracy/test_llm_api_pytorch.py? We can choose any small mamba hybrid for pre-merge and put the bigger ones in post-merge

Signed-off-by: qgai <qgai@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)

1-2: Update the NVIDIA copyright year to include 2026.

This source file was modified in 2026, but the header still ends at 2024. Please update the year to reflect the latest meaningful modification. As per coding guidelines, please update the header year.

📝 Suggested change
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)

1-2: Update copyright year to include 2026.

Per coding guidelines, the copyright header should include the year of latest meaningful modification. This file has significant changes (MTP support).

🔧 Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Around line 621-623: Replace Python 3.10+ union and PEP585 generic usage in
the function signature: change parameters like "residual: torch.Tensor | None =
None" to "residual: Optional[torch.Tensor] = None" and change the return
annotation "-> tuple[torch.Tensor, torch.Tensor | None]" to "->
Tuple[torch.Tensor, Optional[torch.Tensor]]"; ensure you import Optional and
Tuple from typing at the top of the module (e.g., add "from typing import
Optional, Tuple") and update any other occurrences of X | None or built-in
generic "tuple[...]" in the same file to use typing.Optional and typing.Tuple so
the code is compatible with Python 3.8.

In `@tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py`:
- Around line 869-905: The non-eagle update kernel branch is missing handling
for KERNEL_WIDTH == 5; extend the conv loop and state-shift logic in the
function (where acc, matrix_x, matrix_w, col0/col1/col2 are used) to include the
j==4 case and shift an extra column: inside the static_range(KERNEL_WIDTH) loop
add the KERNEL_WIDTH == 5 branch mapping j==1..4 to w_col1..w_col4 and loading
matrix_x from x_base_1d + idx_token * stride_x_token with mask_x_1d for the last
column, and in the state-shifting block add the KERNEL_WIDTH == 5 clause that
sets col0=col1; col1=col2; col2=col3; col3=matrix_x (ensure w_col4 and col3
variables are declared/used consistently like the eagle branch).
- Around line 173-174: The variable loaded_x is assigned from tl.load(x_ptrs,
mask_x, 0.0) but never used because new_conv_state immediately re-loads the same
data; fix by collapsing the duplicate load—either remove loaded_x and keep the
single tl.load into new_conv_state, or assign new_conv_state = loaded_x so the
value is reused; update the code around the tl.load calls (references: loaded_x,
new_conv_state, tl.load, x_ptrs, mask_x) accordingly to eliminate the dead
store.
- Around line 481-487: In the grid function, guard against seq_lens_cpu being
empty before calling max(seq_lens_cpu): add an early check (e.g., if not
seq_lens_cpu:) and return a safe grid tuple (such as (0, 0, 0)) or otherwise
raise a clear ValueError with a descriptive message; update references in grid
that compute max_seq_len and the returned tuple so the function never calls
max() on an empty sequence.
- Around line 302-354: The convolution loop and state-shifting branches miss
cases for KERNEL_WIDTH == 5 causing incorrect results; update the inner j-loop
in the function using symbols acc_preload, matrix_w, matrix_x, w_col0..w_col3
(and add w_col4), col0..col3 (and add col4) so that when KERNEL_WIDTH == 5 you
set matrix_w = w_col1..w_col4 for j==1..4 and matrix_x = col1..col3 or perform
the tl.load for the j that reads the new input (mirror the pattern used for
KERNEL_WIDTH==4), and then extend the state shift block (the if KERNEL_WIDTH ==
... section that assigns col0/col1/col2) to handle KERNEL_WIDTH == 5 by shifting
col0=col1, col1=col2, col2=col3, col3=matrix_x; keep existing mask_1d,
SILU_ACTIVATION, tl.store usage unchanged and ensure any preloaded w_col4/col4
are initialized consistently with the other columns.

In `@tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py`:
- Around line 501-515: The function _chunk_scan_fwd currently defaults out=None
but then unconditionally dereferences out.shape, causing an AttributeError if
the caller omits out; fix by making out required (remove the default and update
any call sites) or by allocating/guarding when out is None: inside
_chunk_scan_fwd (e.g., at start of the function) check if out is None and if so
allocate a tensor with the same shape and dtype/device as x (or use
torch.empty_like(x)) before any assertions/assignments so the subsequent assert
out.shape == x.shape is valid; update function signature or initial guard
accordingly.
♻️ Duplicate comments (9)
tensorrt_llm/_torch/modules/mamba/ssd_combined.py (1)

237-271: Function breaks return contract — callers expect (out, state) tuple.

The wrapper unpacks out_x, dt_out, dA_cumsum, states, and final_states but discards them all, returning only state value(s). Callers (tests, production code) that expect out, state = mamba_chunk_scan_combined(...) will fail with unpacking errors. Additionally, static analysis correctly flags out_x, dt_out, dA_cumsum, and states as unused.

Restore the output tensor in the return values to maintain the expected API contract.

Suggested fix
+    if out is None:
+        out = torch.empty_like(x)
-    out_x, dt_out, dA_cumsum, states, final_states, *rest = (
+    _out_x, _dt_out, _dA_cumsum, _states, final_states, *rest = (
         _mamba_chunk_scan_combined_fwd(
             ...
             out=out,
             state_dtype=state_dtype,
         ))
     if not return_varlen_states:
         if not return_final_states:
-            return
+            return out
         else:
-            return final_states
+            return out, final_states
     else:
         varlen_states = rest[0]
-        return ((varlen_states) if not return_final_states else
-                (final_states, varlen_states))
+        return (out, varlen_states) if not return_final_states else (
+            out, final_states, varlen_states)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (2)

1-3: Add NVIDIA copyright header.

This file is missing the required NVIDIA SPDX header. As per coding guidelines, all TensorRT-LLM source files should contain a copyright header with the year of latest meaningful modification.

Suggested header
+ # SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # SPDX-License-Identifier: Apache-2.0
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #
+ # http://www.apache.org/licenses/LICENSE-2.0
+ #
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
+
  import torch

59-67: re.match won't match keys with model. prefix after remapping.

After the backbone -> model replacement on line 53, keys like model.mtp.layers.0.xxx won't match the pattern r'mtp\.layers\.(\d+)\.(.*)' because re.match only matches at the start of the string. Use re.search instead to find the pattern anywhere in the key.

Proposed fix
             if "mtp.layers." in key:
                 import re
-                match = re.match(r'mtp\.layers\.(\d+)\.(.*)', key)
+                match = re.search(r'mtp\.layers\.(\d+)\.(.*)', key)
                 if match:
                     sublayer_idx, rest = match.groups()
                     key = f"model.layers.{config.num_hidden_layers}.layers.{sublayer_idx}.{rest}"
tensorrt_llm/_torch/modules/mamba/selective_state_update.py (3)

1-7: Add the required NVIDIA copyright header (still missing).

This source file still lacks the NVIDIA header with the latest modification year, which is required for TensorRT‑LLM source files. As per coding guidelines, please add the standard header.


355-358: Guard out when it’s None.

out defaults to None, but out.dim() is accessed unconditionally. This will raise at runtime for callers that omit out.

🐛 Proposed fix
-    if out.dim() == 2:
-        out = out.unsqueeze(1)
-    if out.dim() == 3:
-        out = out.unsqueeze(1)
+    if out is None:
+        out = torch.empty_like(x)
+    elif out.dim() == 2:
+        out = out.unsqueeze(1)
+    elif out.dim() == 3:
+        out = out.unsqueeze(1)

389-390: Avoid dt_bias attribute access when it’s None.

dt_bias is optional, but dt_bias.stride(-1) is accessed unconditionally in tie_hdim. This will crash if dt_bias is omitted.

🐛 Proposed fix
-    tie_hdim = (A.stride(-1) == 0 and A.stride(-2) == 0 and dt.stride(-1) == 0
-                and dt_bias.stride(-1) == 0)
+    tie_hdim = (A.stride(-1) == 0 and A.stride(-2) == 0 and dt.stride(-1) == 0
+                and (dt_bias is None or dt_bias.stride(-1) == 0))
tensorrt_llm/_torch/modules/mamba/causal_conv1d_triton.py (2)

1-5: Add NVIDIA copyright header.

This new source file requires the standard NVIDIA SPDX header. As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification (2026).


946-963: Unused parameters should be removed or prefixed with underscore.

Parameters metadata (line 961) and **kwargs (line 369 in causal_conv1d_fn) are unused. Per Python conventions, prefix with _ or remove them.

tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (1)

34-43: Consider reusing existing utility function.

There may be an existing get_tensor_size_bytes utility in tensorrt_llm/_torch/pyexecutor/_util.py. Consider reusing it to avoid code duplication.

🧹 Nitpick comments (7)
tensorrt_llm/_torch/models/checkpoints/hf/nemotron_h_weight_mapper.py (1)

61-61: Consider moving import re to module level.

The re module import is inside the loop. For better readability and slight performance improvement (avoiding repeated import checks), consider moving it to the top of the file with other imports.

Suggested change
  import torch
+ import re
 
  import tensorrt_llm.logger as logger

Then remove line 61.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

5440-5451: Avoid permanently skipping the MTP test.

The unconditional skip means the test never runs even when the model is present. Consider skipping only when the model path is unavailable so developers can run it locally.

♻️ Suggested change
-    `@pytest.mark.skip`(reason="Skip MTP test due to no model path file in CI")
     `@skip_pre_blackwell`
     `@pytest.mark.skip_less_mpi_world_size`(8)
     def test_nvfp4_8gpus_mtp(self):
         # Test MTP (Multi-Token Prediction) accuracy with nvfp4-fp8kv model.
         # This test uses MTP with max_draft_len=3 and one_model mode.
         mtp_config = MTPDecodingConfig(
             num_nextn_predict_layers=3,
             mtp_eagle_one_model=True,
         )
         model_path = f"{llm_models_root()}/nemotron-super-sft-repeated-mtp-iter-0010600-nvfp4-fp8kv"
+        if not os.path.exists(model_path):
+            pytest.skip("MTP model path not available")
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)

297-315: Avoid storing per-call intermediate indices on the module.

self.intermediate_state_indices is only used within this forward pass. Keeping it local avoids unintended statefulness across concurrent/overlapping calls.

♻️ Suggested change
-                self.intermediate_state_indices = torch.arange(
+                intermediate_state_indices = torch.arange(
                     num_decodes,
                     dtype=torch.int32,
                     device=state_indices_d.device)
...
-                    intermediate_state_indices=self.intermediate_state_indices,
+                    intermediate_state_indices=intermediate_state_indices,
...
-                    intermediate_state_indices=self.intermediate_state_indices,
+                    intermediate_state_indices=intermediate_state_indices,

Also applies to: 371-382

tensorrt_llm/_torch/models/modeling_nemotron_h.py (2)

667-672: Consider removing or prefixing unused constructor parameters.

is_separate_draft_engine and prefix are unused. If they're for interface compatibility, prefix with _ to indicate intentional non-use.

♻️ Suggested fix
     def __init__(self,
                  model_config: ModelConfig[NemotronHConfig],
                  layer_idx: int,
                  aux_stream_dict: Dict[AuxStreamType, torch.cuda.Stream],
-                 is_separate_draft_engine: bool = False,
-                 prefix: str = ""):
+                 _is_separate_draft_engine: bool = False,
+                 _prefix: str = ""):

750-758: positions is passed to layer but not used in NemotronHMTPDecoderLayer.forward.

Line 754 passes positions=position_ids to the layer, but NemotronHMTPDecoderLayer.forward (line 619) accepts but ignores the positions parameter. Either use it or remove from both signatures for clarity.

tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (2)

373-416: Variable shadowing and implicit Optional type hints.

  1. Line 412: The local variable num_accepted_draft_tokens shadows the parameter of the same name. Since the parameter is never used, consider renaming the parameter to _num_accepted_draft_tokens or removing it if the interface allows.

  2. Lines 376-377: Use explicit Optional type hints per PEP 484:

♻️ Suggested fix
     def update_resources(
             self,
             scheduled_batch: ScheduledRequests,
-            attn_metadata: "AttentionMetadata" = None,
-            kv_cache_dtype_byte_size: float = None,
+            attn_metadata: Optional["AttentionMetadata"] = None,
+            _kv_cache_dtype_byte_size: Optional[float] = None,
             num_accepted_draft_tokens: Optional[torch.Tensor] = None):

507-516: Apply same Optional type hint fixes.

For consistency with PEP 484, update the type hints here as well:

♻️ Suggested fix
     def update_resources(
             self,
             scheduled_batch: ScheduledRequests,
-            attn_metadata: "AttentionMetadata" = None,
-            kv_cache_dtype_byte_size: float = None,
+            attn_metadata: Optional["AttentionMetadata"] = None,
+            kv_cache_dtype_byte_size: Optional[float] = None,
             num_accepted_draft_tokens: Optional[torch.Tensor] = None):

@IzzyPutterman
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32459 [ run ] triggered by Bot. Commit: 205503c

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32459 [ run ] completed with state SUCCESS. Commit: 205503c
/LLM/main/L0_MergeRequest_PR pipeline #25146 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg sunnyqgg requested review from a team as code owners January 19, 2026 16:49
@sunnyqgg sunnyqgg requested a review from achartier January 19, 2026 16:49
@IzzyPutterman
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32598 [ run ] triggered by Bot. Commit: 88ce470

@tensorrt-cicd
Copy link
Collaborator

PR_Github #32598 [ run ] completed with state SUCCESS. Commit: 88ce470
/LLM/main/L0_MergeRequest_PR pipeline #25234 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: qgai <qgai@nvidia.com>
Signed-off-by: qgai <qgai@nvidia.com>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #33449 [ run ] completed with state SUCCESS. Commit: 7121db8
/LLM/main/L0_MergeRequest_PR pipeline #25819 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@sunnyqgg
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33478 [ run ] triggered by Bot. Commit: 7121db8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33478 [ run ] completed with state SUCCESS. Commit: 7121db8
/LLM/main/L0_MergeRequest_PR pipeline #25845 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@sunnyqgg
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33498 [ run ] triggered by Bot. Commit: 7121db8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33498 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 7 PM PST on 1/25.

@sunnyqgg
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33516 [ run ] triggered by Bot. Commit: 7121db8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33516 [ run ] completed with state SUCCESS. Commit: 7121db8
/LLM/main/L0_MergeRequest_PR pipeline #25854 completed with status: 'SUCCESS'

Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sunnyqgg
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33578 [ reuse-pipeline ] triggered by Bot. Commit: a956572

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33580 [ reuse-pipeline ] triggered by Bot. Commit: a1166bb

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33578 [ reuse-pipeline ] completed with state ABORTED. Commit: a956572
Can't reuse PR_Github #33516 with status: SUCCESS

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33580 [ reuse-pipeline ] completed with state SUCCESS. Commit: a1166bb
Reusing PR_Github #33516 for commit a1166bb

@mikeiovine mikeiovine merged commit ff0dd60 into NVIDIA:main Jan 26, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants