[None][feat] sm100 weight-only kernel#10190

Njuapp · 2025-12-22T06:12:38Z

Summary by CodeRabbit

Release Notes

New Features
- Added support for weight-only quantized operations on newer GPU architectures, enabling efficient inference on latest hardware platforms.
- Expanded mixed-precision GEMM (matrix multiplication) functionality to support additional quantization configurations.
Performance
- Optimized kernel dispatch logic for improved execution on supported architectures, with architecture-specific tuning for better resource utilization.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Signed-off-by: Cheng Hang <chang@nvidia.com>

coderabbitai · 2025-12-22T06:17:55Z

📝 Walkthrough

Walkthrough

This PR adds SM100 (Blackwell) GPU architecture support to TensorRT LLM's CUTLASS-based mixed-precision weight-only GEMM pipeline. It introduces SM100-specific collective builders, MMA kernels, dispatchers, launchers, and heuristics, while adjusting architecture gating logic across existing paths. The changes enable weight-only quantized operations on Blackwell hardware through a new CUTLASS collective workflow.

Changes

Cohort / File(s)	Summary
SM100 Collective Builders & Core Kernels `cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hpp`, `sm100_umma_builder_weightonly.inl`	Introduces new SM100 weight-only GEMM collective builder specialization with template-driven stage count computation, SMEM carveout, and dispatcher wiring. Complex type deduction for element types, layout configurations, and multi-stage pipeline setup.
SM100 MMA Warp-Specialized Kernel `cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hpp`, `sm100_mma_warpspecialized_mixed_input.hpp`	Declares collective MMA specialization and implements full SM100 weight-only mixed-input pipeline with conversion modes (DirectConvert, ConvertAndScale, ConvertAndScaleWithZero), TMA descriptor handling, shared storage layouts, multi-pipeline data loading (Load2Transform, Load2Mma, Transform2Mma), and dequantization/transform logic.
SM100 Kernel Dispatch & Launching `cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h`, `fpA_intB_launcher_sm100.h`, `fpA_intB_launcher_sm100.inl`	Adds SM100-specific dispatch functions for epilogue/mainloop schedules, CTA/cluster shapes, and tile configs. Implements kernel launcher with type adapters, Cutlass collective builders, kernel argument validation, stride calculations, and GEMM execution flow.
Template & Dispatch Architecture Gating `cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h`, `cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h`	Routes SM100 to dedicated dispatch path; separates SM100 (exact) from SM80+ (greater than 1000) handling in architecture gating conditionals.
Layout & Heuristics for SM100 `cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h`, `cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp`, `cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp`	Adds SM100-specific LayoutDetailsB specializations (uint8_t and uint4b_t); expands candidate tile configs in dynamic cluster shape heuristics; adjusts weight preprocessing to skip interleave and bias transformation for SM100.
Weight-Only Batched GEMV Dispatchers `cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cu`, `kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu`, `kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cu`, `kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu`, `kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cu`, `kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cu`, `kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu`, `kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu`	New CUDA translation units instantiating weight-only GEMV dispatchers for various type and quantization combinations with specific tile/kernel configurations.
Kernel Launcher Architecture Routing `cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h`	Introduces dedicated SM100 dispatch branch with ColumnMajor layout and false interleave converter; separates SM90, SM100, and >SM100 paths with different kernel type coverage.
Python Configuration Generation `cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py`, `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py`, `tensorrt_llm/quantization/functional.py`	Adds SM100 mixed GEMM operation generation with 64x128x128B and 128x128x128B tile configs; updates launcher instantiation to select SM100-specific launcher; adjusts SM version gating (sm\_ > 100 for SM80 path, sm\_ == 100 for special handling); disables weight interleave for SM100 and above.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Areas requiring extra attention:

sm100_umma_builder_weightonly.inl & sm100_mma_warpspecialized_mixed_input.hpp: Dense template metaprogramming with complex type deduction, layout aliases, multi-stage pipeline configuration, and TMA descriptor handling. Verify element type conversions, SMEM carveout calculations, alignment constraints, and static_assert conditions for mode/schedule compatibility.
fpA_intB_launcher_sm100.inl: Kernel argument construction, stride calculation, type adapters, workspace validation, and runtime checks for weight scales/zero-points. Ensure consistency with collective builder expectations and proper error handling paths.
fpA_intB_gemm_template_sm100.h: Dispatch routing logic across epilogue schedules, mainloop schedules, CTA/cluster shapes, and tile configs. Verify all supported combinations are covered and error cases (unsupported configs) are properly handled.
Architecture gating logic (fpA_intB_gemm.h, cutlass_heuristic.cpp, kernelLauncher.h, functional.py): Multiple conditional checks for SM100 vs >SM100 vs other architectures. Ensure consistency across files (e.g., >= 100 vs > 100 vs == 100 logic) and verify fallback behavior for unsupported configurations.
generate_kernels.py: Validates that SM100 instantiations are correctly wired into the macro-based kernel generation and that the new generate_sm100_mixed_gemm_operations function properly enumerates supported type/quantization combinations.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete. It only contains the template with an @coderabbitai summary marker but lacks any actual description of changes, test coverage, or explanation of the implementation.	Fill in the Description section explaining the SM100 weight-only kernel implementation, add Test Coverage details, and ensure the PR Checklist reflects completed items.
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.70% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title is vague and lacks specificity. 'sm100 weight-only kernel' does not clearly indicate whether it adds a new feature, fixes a bug, or describes what the kernel does.	Expand the title to be more descriptive, e.g., '[None][feat] Add SM100 weight-only GEMM kernel support' to clarify the main change and type.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
2-2: Update the copyright year to include 2025.

The copyright header should reflect the year of the latest meaningful modification.

As per coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.
🔎 Proposed fix
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h (1)
2-2: Update copyright year to 2025.

The copyright header should reflect the year of the latest meaningful modification.
🔎 Proposed fix
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines.
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
185-189: Dead code: unreachable branch.

The condition elif self.arch > 100 on line 187 is unreachable because if self.arch >= 90 on line 185 already handles all architectures ≥ 90, including 100 and above. This branch will never execute.
🔎 Proposed fix: remove dead code
         if self.arch >= 90:
             return kernel_prefix + hopper_suffix
-        elif self.arch > 100:
-            raise ValueError(f"SM{self.arch} not supported yet.")
         return kernel_prefix

🧹 Nitpick comments (11)

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
482-505: Consider extracting CUDA kernel profiler config creation.

The pattern of creating a CUDA kernel profiler configuration is repeated across three architecture paths:

SM90 (lines 358-367)

SM100 (lines 482-505)

Older architectures (lines 619-627)

While each architecture may have different tile configuration requirements, the CUDA kernel creation logic is similar. Consider extracting this into a helper function to reduce duplication and improve maintainability.
Example refactor approach
// Helper function to add CUDA kernel profiler config
template<typename TileConfigType>
void add_cuda_kernel_profiler_config(
    std::vector<CutlassGemmConfig>& candidate_configs,
    TileConfigType tile_config,
    /* other architecture-specific params */)
{
    CutlassGemmConfig cudaKernelConfig(/* construct based on arch */);
    cudaKernelConfig.enableCudaKernel = true;
    candidate_configs.push_back(cudaKernelConfig);
}
This would centralize the CUDA kernel profiler logic while allowing architecture-specific customization.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h (1)
536-540: LGTM – SM100 correctly routed away from legacy kernel.

The change from >= 1000 to > 1000 intentionally excludes SM100 (Blackwell) from this legacy CUTLASS 2.x path, routing it instead to the new SM100 weight-only collective/kernel infrastructure being added in this PR. SM101+ falls back to SM80 as a conservative default.

Consider updating the comment on line 539 to explicitly mention SM100:
-        CUTLASS_NOT_IMPLEMENTED(); // Don't compile these for Hopper or later. Use CUTLASS 3.x kernels.
+        CUTLASS_NOT_IMPLEMENTED(); // Don't compile these for Hopper (SM90) or Blackwell (SM100). Use CUTLASS 3.x kernels.
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp (2)
1-15: Update copyright year to include 2025.

As per coding guidelines, the NVIDIA copyright header should include the year of its latest meaningful modification. This file is being modified in 2025.
🔎 Proposed fix
 /*
- * Copyright (c) 2020-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
  *
598-622: Consider using TLLM_LOG_TRACE instead of TLLM_LOG_INFO for consistency.

The existing preprocessing functions in this file use TLLM_LOG_TRACE (e.g., lines 190, 255, 393, 494, 516), but these new logging statements use TLLM_LOG_INFO. Using LOG_INFO for each preprocessing step could be verbose in production, especially for models with many weight tensors. Consider TLLM_LOG_TRACE or TLLM_LOG_DEBUG for consistency with the established pattern.
🔎 Proposed fix
     if (details.uses_imma_ldsm)
     {
-        TLLM_LOG_INFO("permute_B_rows_for_mixed_gemm");
+        TLLM_LOG_TRACE("permute_B_rows_for_mixed_gemm");
         permute_B_rows_for_mixed_gemm(dst_buf.data(), src_buf.data(), shape, quant_type, arch);
         src_buf.swap(dst_buf);
     }

     if (details.layoutB == LayoutDetails::Layout::COLUMN_MAJOR)
     {
-        TLLM_LOG_INFO("subbyte_transpose");
+        TLLM_LOG_TRACE("subbyte_transpose");
         subbyte_transpose(dst_buf.data(), src_buf.data(), shape, quant_type);
         src_buf.swap(dst_buf);
     }

     if (details.columns_interleaved > 1 && (arch != 90 && arch != 100))
     {
-        TLLM_LOG_INFO("interleave_column_major_tensor");
+        TLLM_LOG_TRACE("interleave_column_major_tensor");
         interleave_column_major_tensor(dst_buf.data(), src_buf.data(), shape, quant_type, details);
         src_buf.swap(dst_buf);
     }

     if (arch != 100)
     {
-        TLLM_LOG_INFO("add_bias_and_interleave_quantized_tensor_inplace");
+        TLLM_LOG_TRACE("add_bias_and_interleave_quantized_tensor_inplace");
         add_bias_and_interleave_quantized_tensor_inplace(src_buf.data(), num_elts, quant_type);
     }
-    TLLM_LOG_INFO("copy");
+    TLLM_LOG_TRACE("copy");
     std::copy(src_buf.begin(), src_buf.end(), preprocessed_quantized_weight);
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h (1)
1-15: Update copyright year to include 2025.

As per coding guidelines, the NVIDIA copyright header should include the year of its latest meaningful modification.
🔎 Proposed fix
 /*
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
  *
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl (1)
242-246: Inconsistent error output: uses std::cout alongside logger.

Line 244 uses std::cout for error output, while the rest of the file uses TLLM_LOG_ERROR. Consider using consistent logging.
🔎 Proposed fix
         if (can_implement != cutlass::Status::kSuccess)
         {
             std::string err_msg = "fpA_intB cutlass kernel will fail for params. Error: "
                 + std::string(cutlass::cutlassGetStatusString(can_implement));
-            std::cout << err_msg << std::endl;
+            TLLM_LOG_ERROR(err_msg);
             throw std::runtime_error("[TensorRT LLM Error][fpA_intB Runner] " + err_msg);
         }
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h (4)
38-38: Avoid using namespace directive in header files.

Placing using namespace cute; in a header file pollutes the namespace of every translation unit that includes this header. This can lead to name collisions and ambiguity.
🔎 Suggested fix
-using namespace cute;
Then qualify cute:: types explicitly where needed (e.g., cute::Shape, cute::Int, cute::_1, cute::_64, cute::_128).
59-64: Unreachable break after throw.

The break statement on line 63 is unreachable since throw already transfers control. The same pattern appears in the other dispatch functions (lines 89, 113, 147). While harmless, removing these dead statements improves clarity.
🔎 Suggested fix for this switch case
     default:
         throw std::runtime_error(
             "[TensorRT LLM Error][fpA_intB][sm100_dispatch_epilogue_schedules] Unsupported epilogue schedule for mixed "
             "type GEMM.");
-        break;
     }
51-58: Wrap case body in braces for compound statement.

Per coding guidelines, the body of a switch case should be a compound (brace-delimited) statement. The using declaration inside the case also requires a scope to avoid potential issues with variable declarations crossing case labels.
🔎 Suggested fix
     case tkc::EpilogueScheduleType::AUTO:
+    {
         // TODO: use heuristics to select the epilogue schedule, depending on the CTA shape and cluster shape
         using EpilogueSchedule = cutlass::epilogue::TmaWarpSpecialized1Sm;
         sm100_generic_mixed_gemm_kernelLauncher<ActivationType, WeightType, ScaleZeroType, BiasType, OutputType,
             QuantOp, EpilogueTag, CTAShape, ClusterShape, MainloopSchedule, EpilogueSchedule>(A, B, weight_scales,
             weight_zero_points, biases, alpha, C, m, n, k, group_size, gemm_config, workspace, workspace_bytes, stream,
             occupancy);
         break;
+    }
78-84: Wrap case body in braces.

Same issue as the epilogue dispatcher—the using declaration requires a scope.
🔎 Suggested fix
     case tkc::MainloopScheduleType::AUTO:
+    {
         // TODO: use heuristics to select the mainloop schedule, depending on the CTA shape and cluster shape
         using MainloopSchedule = cutlass::gemm::KernelTmaWarpSpecialized1SmSm100;
         sm100_dispatch_epilogue_schedules<ActivationType, WeightType, ScaleZeroType, BiasType, OutputType, QuantOp,
             EpilogueTag, CTAShape, ClusterShape, MainloopSchedule>(A, B, weight_scales, weight_zero_points, biases,
             alpha, C, m, n, k, group_size, gemm_config, workspace, workspace_bytes, stream, occupancy);
         break;
+    }
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl (1)

256-259: Consider documenting the dummy scale layout dimensions.

The magic numbers 128 and 64 in VoidShapeScale are used to create a placeholder when scaling is disabled. A brief comment explaining these choices would improve maintainability.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 237fd0e and ee43458.

📒 Files selected for processing (24)

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/quantization/functional.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
tensorrt_llm/quantization/functional.py
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h
tensorrt_llm/quantization/functional.py
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu

**/*.{cpp,h,cu,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu,cuh}: Closing braces of namespaces should have a comment saying the namespace it closes: } // namespace foo
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
For naming of constants in C++, follow the naming section conventions
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization in C++
Use the Allman indentation style in C++
Put the semicolon for an empty for or while loop in a new line in C++
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements) in C++
If and else should always be followed by brace-delimited statements, even if empty or a single statement in C++
C++ filenames should use camel case with first letter lowercase: thisIsASubDir and thisIsAFilename.cpp
All files involved in the compilation of a compilation target (.exe/.so) must have filenames that are case-insensitive unique in C++
All types (including class names) in C++ should use camel case with uppercase first letter: FooBarClass
Local variables, methods and namespaces in C++ should use camel case with first letter lowercase: localFooBar
Non-magic-number global variables that are non-static and not defined in anonymous namespace in C++ should use camel case prefixed by a lower case 'g': gDontUseGlobalFoos
Non-magic-number global variables that are static or defined in an anonymous namespace in C++ should use camel case prefixed by a lower case 's': sMutableStaticGlobal
Locally visible static variables in C++ should use camel case with lowercase prefix 's' as the first letter: static std::once_flag sFlag;
Public, private and protected class member variables in C++ should use camel case prefi...

Files:

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu

**/*.h

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.h: Use a preprocessor guard in C++ header files with the format TRTLLM_<FILENAME>_H derived from the filename in all caps
The preprocessor guard name in C++ must have prefix TRTLLM_ followed by the filename, all in caps. Only use the file name, not directory names
Do not use prefix with underscore in C++ preprocessor guard symbols as such symbols are reserved in C++ standard for compilers or implementation
Do not use trailing underscore in C++ preprocessor guard symbols (unlike Google C++ guideline)

Files:

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h

🧠 Learnings (20)

📓 Common learnings

Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

📚 Learning: 2025-11-14T11:22:03.729Z

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.

Applied to files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

📚 Learning: 2025-12-12T10:07:31.564Z

Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:31.564Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do not require a register_fake decorator. Mutation tracking is handled automatically without needing a FakeTensor kernel. This applies to Python custom op definitions in tensorrt_llm/_torch/custom_ops that use mutates_args and return None; verify you are not relying on register_fake in these cases.

Applied to files:

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py

📚 Learning: 2025-08-08T05:06:31.596Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hpp
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl

📚 Learning: 2025-08-21T21:48:35.135Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.

Applied to files:

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hpp
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hpp
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu

📚 Learning: 2025-08-08T05:10:38.906Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.

Applied to files:

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp

📚 Learning: 2025-09-23T15:13:48.819Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp

📚 Learning: 2025-08-21T02:41:10.565Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h:141-145
Timestamp: 2025-08-21T02:41:10.565Z
Learning: In TensorRT-LLM MOE GEMM kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h), the stride_act and stride_weight pointers in TmaWarpSpecializedGroupedGemmInput are intentionally declared as void* rather than typed pointers because the actual stride type is determined at runtime based on factors like the swap_ab flag and layout decisions. This runtime type determination makes compile-time type safety impossible, so void* is the correct approach.

Applied to files:

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu

📚 Learning: 2025-09-19T21:28:13.751Z

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu

📚 Learning: 2025-08-22T01:54:35.850Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h

📚 Learning: 2025-09-23T15:01:00.070Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

📚 Learning: 2025-09-23T14:58:05.372Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl

📚 Learning: 2025-09-23T15:12:38.312Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.

Applied to files:

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl

📚 Learning: 2025-08-08T22:03:28.403Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h:152-154
Timestamp: 2025-08-08T22:03:28.403Z
Learning: In CUTLASS, `TagToStrideC_t` template is defined for both pointer types (e.g., `Layout*`) and non-pointer types (e.g., `Layout`). When used with pointer types, it's often wrapped with `std::remove_pointer_t`, while non-pointer usage is direct. Both `cutlass::detail::TagToStrideC_t` and `cutlass::gemm::TagToStrideC_t` support both forms.

Applied to files:

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h

🧬 Code graph analysis (2)

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h (2)

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h (3)

tensorrt_llm (29-115)

kernels (31-91)

cutlass_kernels_oss (33-65)

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.h (1)

cutlass_kernels_oss (25-37)

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h (1)

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h (3)

tensorrt_llm (21-39)

kernels (23-38)

cutlass_kernels_oss (25-37)

🪛 Ruff (0.14.8)

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

821-821: DataType may be undefined, or defined from star imports

(F405)

821-821: DataType may be undefined, or defined from star imports

(F405)

821-821: DataType may be undefined, or defined from star imports

(F405)

821-821: DataType may be undefined, or defined from star imports

(F405)

821-821: DataType may be undefined, or defined from star imports

(F405)

822-822: DataType may be undefined, or defined from star imports

(F405)

822-822: DataType may be undefined, or defined from star imports

(F405)

822-822: DataType may be undefined, or defined from star imports

(F405)

822-822: DataType may be undefined, or defined from star imports

(F405)

823-823: DataType may be undefined, or defined from star imports

(F405)

824-824: DataType may be undefined, or defined from star imports

(F405)

824-824: DataType may be undefined, or defined from star imports

(F405)

824-824: DataType may be undefined, or defined from star imports

(F405)

824-824: DataType may be undefined, or defined from star imports

(F405)

824-824: DataType may be undefined, or defined from star imports

(F405)

825-825: DataType may be undefined, or defined from star imports

(F405)

825-825: DataType may be undefined, or defined from star imports

(F405)

825-825: DataType may be undefined, or defined from star imports

(F405)

825-825: DataType may be undefined, or defined from star imports

(F405)

826-826: DataType may be undefined, or defined from star imports

(F405)

827-827: DataType may be undefined, or defined from star imports

(F405)

827-827: DataType may be undefined, or defined from star imports

(F405)

827-827: DataType may be undefined, or defined from star imports

(F405)

827-827: DataType may be undefined, or defined from star imports

(F405)

827-827: DataType may be undefined, or defined from star imports

(F405)

828-828: DataType may be undefined, or defined from star imports

(F405)

828-828: DataType may be undefined, or defined from star imports

(F405)

828-828: DataType may be undefined, or defined from star imports

(F405)

828-828: DataType may be undefined, or defined from star imports

(F405)

829-829: DataType may be undefined, or defined from star imports

(F405)

858-858: Consider (*cta_shape_mn, cta_shape_k) instead of concatenation

Replace with (*cta_shape_mn, cta_shape_k)

(RUF005)

861-861: KernelScheduleType may be undefined, or defined from star imports

(F405)

862-862: EpilogueScheduleType may be undefined, or defined from star imports

(F405)

864-864: GemmKind may be undefined, or defined from star imports

(F405)

954-954: GemmKind may be undefined, or defined from star imports

(F405)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (32)

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)

411-420: LGTM: Tile configuration expansion looks correct.

The modifications to the tile_configs vector expand the candidate set for dynamic cluster shapes. The syntax and structure follow the existing pattern correctly.

482-505: LGTM: SM100 weight-only configuration implementation follows established patterns.

The new WEIGHT_ONLY branch correctly implements candidate configuration generation for SM100 architecture:

Selects appropriate tile configurations (64x128x128B and 128x128x128B)

Follows the same pattern as SM90 implementation (lines 358-367)

Includes CUDA kernel profiler configuration for tactic selection

Variable naming and structure adhere to coding guidelines

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu (1)

1-29: LGTM!

New dispatcher instantiation for FP16 input with per-channel Int8 weights follows the established pattern. Copyright header, namespace structure with closing comments, and macro usage all conform to project conventions.

tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2)

1322-1322: LGTM – SM100 support enabled.

Bumping MAX_SUPPORTED_SM_VERSION to 100 aligns with the PR's objective to add Blackwell (SM100) support for weight-only/mixed-input GEMM operations.

1355-1358: LGTM – improved error message.

The updated message correctly reflects the broader capability coverage (W4A16/W4A8) for this GEMM path.

tensorrt_llm/quantization/functional.py (1)

993-998: LGTM – row interleave correctly disabled for SM100+.

The added sm_ < 100 condition correctly disables the row permutation for Blackwell, which uses a different weight layout. This aligns with the SM100 weight-only kernel requirements.

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu (1)

1-29: LGTM!

New dispatcher instantiation for BF16 input with per-channel Int4 weights follows the established pattern and mirrors the parallel FP16Int8 dispatcher file structure.

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cu (1)

1-32: LGTM!

The file correctly instantiates weight-only CUDA dispatchers for FP16Int4Groupwise with both KTile=64 and KTile=128 configurations. The copyright header, include, namespace structure, and closing comments all follow the coding guidelines.

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cu (1)

1-32: LGTM!

The file correctly instantiates weight-only CUDA dispatchers for BF16Int4Groupwise with both KTile=64 and KTile=128 configurations. Structure and conventions match the FP16 variant appropriately.

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu (1)

1-29: LGTM!

The file correctly instantiates the weight-only CUDA dispatcher for BF16Int8PerChannel with KTile=64. The absence of KTile=128 is consistent with the w4a8 (4-bit weight, 8-bit activation) pattern seen in other files where KTile=128 is specifically for Int4 groupwise kernels.

cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp (1)

137-144: Verify arch > 100 fallback to Sm80 layout is intentional.

The logic uses Sm100 for arch == 100 but falls back to Sm80 for arch > 100. Combined with the earlier code that forces arch = 80 for arch >= 120 (GB20x), this means any hypothetical architectures 101-119 would use Sm80 layouts. Please confirm this fallback behavior is intentional.

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h (1)

86-98: LGTM!

The SM100 dispatch branch correctly uses ColumnMajor layout with ConverterInterleave=false, which aligns with the preprocessing changes in cutlass_preprocessors.cpp that skip interleaving and bias-interleave for arch 100. All expected kernel type combinations (FP16/BF16 × Int4/Int8 × Groupwise/PerChannel) are covered.

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cu (1)

1-29: LGTM!

The dispatcher instantiation file is correctly structured with:

Proper 2025 copyright header

Correct include for the dispatcher header

Proper namespace hierarchy with closing comments

Clean macro instantiation for FP16Int4PerChannel with ColumnMajor layout

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hpp (1)

1-42: LGTM!

The header follows CUTLASS extension conventions correctly:

Uses #pragma once as per CUTLASS style (per learnings about CUTLASS extensions)

Primary template with static_assert(dependent_false<...>) is the standard CUTLASS pattern to require specialization

Including the specialization file after the primary template declaration is the correct approach

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp (1)

57-59: Consider the scope of using namespace cute.

While using namespace cute; at namespace scope is common in CUTLASS code, it's worth noting this pulls all cute symbols into cutlass::gemm::collective. This is acceptable for CUTLASS extensions following upstream conventions.

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h (3)

43-43: LGTM!

The include for the new SM100 template header is correctly placed alongside the SM90 template.

462-467: LGTM!

The SM100 dispatch branch correctly routes to sm100_dispatch_gemm_to_cutlass for the Blackwell architecture path.

547-549: LGTM!

The config selection correctly uses BLACKWELL for SM100+ and falls back to HOPPER otherwise. This is the expected behavior for selecting candidate configurations.

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu (1)

1-29: LGTM!

The dispatcher instantiation file follows the established pattern correctly with:

Proper 2025 copyright header

Correct include for the dispatcher header

Proper namespace hierarchy with closing comments

Clean macro instantiation for FP16Int8Groupwise with ColumnMajor layout

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cu (1)

1-29: LGTM!

The file correctly instantiates the weight-only CUDA dispatcher for BF16/Int8 groupwise configuration. Copyright header, namespace structure, and closing comments all follow the coding guidelines.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hpp (1)

1-42: LGTM!

The collective builder template follows CUTLASS extension conventions: #pragma once header guard (per CUTLASS style), static_assert fallback for unsupported parameter combinations, and inl file inclusion for specializations.

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)

944-960: LGTM!

The SM100 launcher inl file path and inl_map entry follow the established pattern for SM90 integration.

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl (1)

61-274: SM100 launcher implementation is well-structured.

The implementation correctly:

Uses SFINAE to filter unsupported configurations

Handles type remapping for SM100 weight layouts

Configures collective builder with proper transpose trick

Validates quantization parameters at runtime

Provides descriptive error messages

The mainloop and epilogue schedule selection based on cluster shape (1SM vs 2SM) is appropriately handled.

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h (3)

93-115: LGTM with minor style suggestion.

The dispatch logic is correct. For consistency with the coding guidelines, consider wrapping the case body in braces. The TODO for additional cluster shapes is noted.

117-149: LGTM.

The tile size dispatch logic is straightforward. The Ktile calculation correctly derives the K-dimension tile count from a 128-byte tile width.

151-153: Namespace closing comments are correct.

cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl (6)

1-31: Copyright and header guard follow CUTLASS conventions.

The BSD-3-Clause license and #pragma once are appropriate for CUTLASS extension files. Based on learnings, CUTLASS extensions follow CUTLASS coding style conventions.

45-128: Stage count computation logic looks correct.

The two overloads handle explicit stage count vs. auto-carveout appropriately. The auto-carveout version correctly:

Accounts for TMEM layout constraints (line 77)

Calculates pipeline storage requirements (lines 89-111)

Derives stage counts from available capacity (lines 113-123)

Validates minimum stage requirements (line 125-126)

173-188: Static assertions provide good guardrails.

The assertions at lines 177, 180-182, and 188 correctly enforce the invariants for mixed-input weight-only GEMM:

Input widths must differ (mixed input)

Either A or B must be a tuple, or widths must differ

A must be the transformed operand

265-291: SMEM carveout accounting is comprehensive.

The breakdown correctly accounts for all pipeline and descriptor storage requirements. The calculations follow standard CUTLASS resource accounting patterns.

306-314: DispatchPolicy and CollectiveOp wiring looks correct.

The MainloopSm100TmaUmmaWarpSpecializedMixedInput dispatch policy is correctly parameterized with the computed stage counts, and CollectiveMmaSm100WeightOnly aggregates all the layout, copy, and element configurations.

316-316: Namespace closing comment is correct.

...nsions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl

...ensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp

...tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h

cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl

cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py

cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h

tensorrt_llm/quantization/functional.py

Signed-off-by: Cheng Hang <chang@nvidia.com>

Njuapp · 2025-12-22T08:23:06Z

/bot run

tensorrt-cicd · 2025-12-22T08:33:35Z

PR_Github #29395 [ run ] triggered by Bot. Commit: 8972214

tensorrt-cicd · 2025-12-22T10:36:59Z

PR_Github #29395 [ run ] completed with state FAILURE. Commit: 8972214
/LLM/main/L0_MergeRequest_PR pipeline #22583 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Cheng Hang <chang@nvidia.com>

Njuapp · 2025-12-22T15:25:24Z

/bot run

tensorrt-cicd · 2025-12-22T15:31:59Z

PR_Github #29435 [ run ] triggered by Bot. Commit: 7247de5

nv-guomingz

LGTM for doc part.

tensorrt-cicd · 2025-12-22T15:57:05Z

PR_Github #29435 [ run ] completed with state FAILURE. Commit: 7247de5
/LLM/main/L0_MergeRequest_PR pipeline #22618 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Cheng Hang <chang@nvidia.com>

Njuapp · 2025-12-30T09:15:07Z

/bot run

tensorrt-cicd · 2025-12-30T09:21:20Z

PR_Github #30160 [ run ] triggered by Bot. Commit: 6a89213

tensorrt-cicd · 2025-12-30T09:37:23Z

PR_Github #30160 [ run ] completed with state FAILURE. Commit: 6a89213
/LLM/main/L0_MergeRequest_PR pipeline #23208 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Cheng Hang <chang@nvidia.com>

Njuapp · 2025-12-30T14:06:11Z

/bot run

Signed-off-by: Cheng Hang <chang@nvidia.com>

Njuapp · 2025-12-30T14:14:36Z

/bot cancel

github-actions · 2025-12-30T14:14:44Z