[None][feat] sm100 weight-only kernel#10190
Conversation
Signed-off-by: Cheng Hang <chang@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds SM100 (Blackwell) GPU architecture support to TensorRT LLM's CUTLASS-based mixed-precision weight-only GEMM pipeline. It introduces SM100-specific collective builders, MMA kernels, dispatchers, launchers, and heuristics, while adjusting architecture gating logic across existing paths. The changes enable weight-only quantized operations on Blackwell hardware through a new CUTLASS collective workflow. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (2 warnings, 1 inconclusive)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 12
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
2-2: Update the copyright year to include 2025.The copyright header should reflect the year of the latest meaningful modification.
As per coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification.
🔎 Proposed fix
- * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved.cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h (1)
2-2: Update copyright year to 2025.The copyright header should reflect the year of the latest meaningful modification.
🔎 Proposed fix
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines.
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
185-189: Dead code: unreachable branch.The condition
elif self.arch > 100on line 187 is unreachable becauseif self.arch >= 90on line 185 already handles all architectures ≥ 90, including 100 and above. This branch will never execute.🔎 Proposed fix: remove dead code
if self.arch >= 90: return kernel_prefix + hopper_suffix - elif self.arch > 100: - raise ValueError(f"SM{self.arch} not supported yet.") return kernel_prefix
🧹 Nitpick comments (11)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (1)
482-505: Consider extracting CUDA kernel profiler config creation.The pattern of creating a CUDA kernel profiler configuration is repeated across three architecture paths:
- SM90 (lines 358-367)
- SM100 (lines 482-505)
- Older architectures (lines 619-627)
While each architecture may have different tile configuration requirements, the CUDA kernel creation logic is similar. Consider extracting this into a helper function to reduce duplication and improve maintainability.
Example refactor approach
// Helper function to add CUDA kernel profiler config template<typename TileConfigType> void add_cuda_kernel_profiler_config( std::vector<CutlassGemmConfig>& candidate_configs, TileConfigType tile_config, /* other architecture-specific params */) { CutlassGemmConfig cudaKernelConfig(/* construct based on arch */); cudaKernelConfig.enableCudaKernel = true; candidate_configs.push_back(cudaKernelConfig); }This would centralize the CUDA kernel profiler logic while allowing architecture-specific customization.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.h (1)
536-540: LGTM – SM100 correctly routed away from legacy kernel.The change from
>= 1000to> 1000intentionally excludes SM100 (Blackwell) from this legacy CUTLASS 2.x path, routing it instead to the new SM100 weight-only collective/kernel infrastructure being added in this PR. SM101+ falls back to SM80 as a conservative default.Consider updating the comment on line 539 to explicitly mention SM100:
- CUTLASS_NOT_IMPLEMENTED(); // Don't compile these for Hopper or later. Use CUTLASS 3.x kernels. + CUTLASS_NOT_IMPLEMENTED(); // Don't compile these for Hopper (SM90) or Blackwell (SM100). Use CUTLASS 3.x kernels.cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp (2)
1-15: Update copyright year to include 2025.As per coding guidelines, the NVIDIA copyright header should include the year of its latest meaningful modification. This file is being modified in 2025.
🔎 Proposed fix
/* - * Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved. *
598-622: Consider usingTLLM_LOG_TRACEinstead ofTLLM_LOG_INFOfor consistency.The existing preprocessing functions in this file use
TLLM_LOG_TRACE(e.g., lines 190, 255, 393, 494, 516), but these new logging statements useTLLM_LOG_INFO. UsingLOG_INFOfor each preprocessing step could be verbose in production, especially for models with many weight tensors. ConsiderTLLM_LOG_TRACEorTLLM_LOG_DEBUGfor consistency with the established pattern.🔎 Proposed fix
if (details.uses_imma_ldsm) { - TLLM_LOG_INFO("permute_B_rows_for_mixed_gemm"); + TLLM_LOG_TRACE("permute_B_rows_for_mixed_gemm"); permute_B_rows_for_mixed_gemm(dst_buf.data(), src_buf.data(), shape, quant_type, arch); src_buf.swap(dst_buf); } if (details.layoutB == LayoutDetails::Layout::COLUMN_MAJOR) { - TLLM_LOG_INFO("subbyte_transpose"); + TLLM_LOG_TRACE("subbyte_transpose"); subbyte_transpose(dst_buf.data(), src_buf.data(), shape, quant_type); src_buf.swap(dst_buf); } if (details.columns_interleaved > 1 && (arch != 90 && arch != 100)) { - TLLM_LOG_INFO("interleave_column_major_tensor"); + TLLM_LOG_TRACE("interleave_column_major_tensor"); interleave_column_major_tensor(dst_buf.data(), src_buf.data(), shape, quant_type, details); src_buf.swap(dst_buf); } if (arch != 100) { - TLLM_LOG_INFO("add_bias_and_interleave_quantized_tensor_inplace"); + TLLM_LOG_TRACE("add_bias_and_interleave_quantized_tensor_inplace"); add_bias_and_interleave_quantized_tensor_inplace(src_buf.data(), num_elts, quant_type); } - TLLM_LOG_INFO("copy"); + TLLM_LOG_TRACE("copy"); std::copy(src_buf.begin(), src_buf.end(), preprocessed_quantized_weight);cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h (1)
1-15: Update copyright year to include 2025.As per coding guidelines, the NVIDIA copyright header should include the year of its latest meaningful modification.
🔎 Proposed fix
/* - * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2025, NVIDIA CORPORATION. All rights reserved. *cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl (1)
242-246: Inconsistent error output: usesstd::coutalongside logger.Line 244 uses
std::coutfor error output, while the rest of the file usesTLLM_LOG_ERROR. Consider using consistent logging.🔎 Proposed fix
if (can_implement != cutlass::Status::kSuccess) { std::string err_msg = "fpA_intB cutlass kernel will fail for params. Error: " + std::string(cutlass::cutlassGetStatusString(can_implement)); - std::cout << err_msg << std::endl; + TLLM_LOG_ERROR(err_msg); throw std::runtime_error("[TensorRT LLM Error][fpA_intB Runner] " + err_msg); }cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h (4)
38-38: Avoidusing namespacedirective in header files.Placing
using namespace cute;in a header file pollutes the namespace of every translation unit that includes this header. This can lead to name collisions and ambiguity.🔎 Suggested fix
-using namespace cute;Then qualify
cute::types explicitly where needed (e.g.,cute::Shape,cute::Int,cute::_1,cute::_64,cute::_128).
59-64: Unreachablebreakafterthrow.The
breakstatement on line 63 is unreachable sincethrowalready transfers control. The same pattern appears in the other dispatch functions (lines 89, 113, 147). While harmless, removing these dead statements improves clarity.🔎 Suggested fix for this switch case
default: throw std::runtime_error( "[TensorRT LLM Error][fpA_intB][sm100_dispatch_epilogue_schedules] Unsupported epilogue schedule for mixed " "type GEMM."); - break; }
51-58: Wrap case body in braces for compound statement.Per coding guidelines, the body of a
switchcase should be a compound (brace-delimited) statement. Theusingdeclaration inside thecasealso requires a scope to avoid potential issues with variable declarations crossing case labels.🔎 Suggested fix
case tkc::EpilogueScheduleType::AUTO: + { // TODO: use heuristics to select the epilogue schedule, depending on the CTA shape and cluster shape using EpilogueSchedule = cutlass::epilogue::TmaWarpSpecialized1Sm; sm100_generic_mixed_gemm_kernelLauncher<ActivationType, WeightType, ScaleZeroType, BiasType, OutputType, QuantOp, EpilogueTag, CTAShape, ClusterShape, MainloopSchedule, EpilogueSchedule>(A, B, weight_scales, weight_zero_points, biases, alpha, C, m, n, k, group_size, gemm_config, workspace, workspace_bytes, stream, occupancy); break; + }
78-84: Wrap case body in braces.Same issue as the epilogue dispatcher—the
usingdeclaration requires a scope.🔎 Suggested fix
case tkc::MainloopScheduleType::AUTO: + { // TODO: use heuristics to select the mainloop schedule, depending on the CTA shape and cluster shape using MainloopSchedule = cutlass::gemm::KernelTmaWarpSpecialized1SmSm100; sm100_dispatch_epilogue_schedules<ActivationType, WeightType, ScaleZeroType, BiasType, OutputType, QuantOp, EpilogueTag, CTAShape, ClusterShape, MainloopSchedule>(A, B, weight_scales, weight_zero_points, biases, alpha, C, m, n, k, group_size, gemm_config, workspace, workspace_bytes, stream, occupancy); break; + }cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl (1)
256-259: Consider documenting the dummy scale layout dimensions.The magic numbers
128and64inVoidShapeScaleare used to create a placeholder when scaling is disabled. A brief comment explaining these choices would improve maintainability.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inlcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hppcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hppcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hppcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.hcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.hcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.hcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.pycpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.htensorrt_llm/_torch/custom_ops/torch_custom_ops.pytensorrt_llm/quantization/functional.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.pytensorrt_llm/quantization/functional.pycpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.pycpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.htensorrt_llm/quantization/functional.pycpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cucpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.pycpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu
**/*.{cpp,h,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,h,cu,cuh}: Closing braces of namespaces should have a comment saying the namespace it closes:} // namespace foo
Preferconstorconstexprvariables over#definewhenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared asconst
For naming of constants in C++, follow the naming section conventions
Except0(only used in comparison for checking signness/existence/emptiness) andnullptr,true,false, all other literals should only be used for variable initialization in C++
Use the Allman indentation style in C++
Put the semicolon for an emptyfororwhileloop in a new line in C++
The statement forming the body of aswitch,while,do .. whileorforstatement shall be a compound statement (use brace-delimited statements) in C++
If and else should always be followed by brace-delimited statements, even if empty or a single statement in C++
C++ filenames should use camel case with first letter lowercase:thisIsASubDirandthisIsAFilename.cpp
All files involved in the compilation of a compilation target (.exe/.so) must have filenames that are case-insensitive unique in C++
All types (including class names) in C++ should use camel case with uppercase first letter:FooBarClass
Local variables, methods and namespaces in C++ should use camel case with first letter lowercase:localFooBar
Non-magic-number global variables that are non-static and not defined in anonymous namespace in C++ should use camel case prefixed by a lower case 'g':gDontUseGlobalFoos
Non-magic-number global variables that are static or defined in an anonymous namespace in C++ should use camel case prefixed by a lower case 's':sMutableStaticGlobal
Locally visible static variables in C++ should use camel case with lowercase prefix 's' as the first letter:static std::once_flag sFlag;
Public, private and protected class member variables in C++ should use camel case prefi...
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cucpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu
**/*.h
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.h: Use a preprocessor guard in C++ header files with the formatTRTLLM_<FILENAME>_Hderived from the filename in all caps
The preprocessor guard name in C++ must have prefixTRTLLM_followed by the filename, all in caps. Only use the file name, not directory names
Do not use prefix with underscore in C++ preprocessor guard symbols as such symbols are reserved in C++ standard for compilers or implementation
Do not use trailing underscore in C++ preprocessor guard symbols (unlike Google C++ guideline)
Files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.hcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.hcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.hcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h
🧠 Learnings (20)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.pycpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.hcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.hcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-12-12T10:07:31.564Z
Learnt from: lirundong
Repo: NVIDIA/TensorRT-LLM PR: 9725
File: tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py:110-178
Timestamp: 2025-12-12T10:07:31.564Z
Learning: In PyTorch custom operators registered with torch.library.custom_op, mutable operators that return None and specify mutates_args do not require a register_fake decorator. Mutation tracking is handled automatically without needing a FakeTensor kernel. This applies to Python custom op definitions in tensorrt_llm/_torch/custom_ops that use mutates_args and return None; verify you are not relying on register_fake in these cases.
Applied to files:
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
📚 Learning: 2025-08-08T05:06:31.596Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.596Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hppcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.hcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl
📚 Learning: 2025-08-21T21:48:35.135Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:399-417
Timestamp: 2025-08-21T21:48:35.135Z
Learning: CUTLASS extensions in TensorRT-LLM (located under cpp/tensorrt_llm/cutlass_extensions/) are designed to integrate with and extend functionality in the external CUTLASS repository. When analyzing these extensions, their consumers and functionality wiring may exist in the CUTLASS codebase rather than within TensorRT-LLM itself.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hppcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.pycpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hppcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu
📚 Learning: 2025-08-08T05:10:38.906Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:0-0
Timestamp: 2025-08-08T05:10:38.906Z
Learning: The ScaledAccPerRowBiasPerColScaleScatter fusion in CUTLASS extensions (cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp) is specifically designed for per-column scaling factors only, so it uses a fixed Stride<_0,_1,int64_t> rather than conditional stride logic.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.hcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cucpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/fpA_intB_gemm.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.hcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.hcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.hcpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.pycpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inlcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp
📚 Learning: 2025-08-21T02:41:10.565Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h:141-145
Timestamp: 2025-08-21T02:41:10.565Z
Learning: In TensorRT-LLM MOE GEMM kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_gemm_kernels.h), the stride_act and stride_weight pointers in TmaWarpSpecializedGroupedGemmInput are intentionally declared as void* rather than typed pointers because the actual stride type is determined at runtime based on factors like the swap_ab flag and layout decisions. This runtime type determination makes compile-time type safety impossible, so void* is the correct approach.
Applied to files:
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cucpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu
📚 Learning: 2025-08-22T01:54:35.850Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.pycpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/config.cu), std::ostringstream is used but <sstream> doesn't need to be explicitly included because it's provided transitively through other headers like tensorrt_llm/common/cudaUtils.h or config.h. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.hcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.hcpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.pycpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.hcpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cppcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inlcpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cppcpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
📚 Learning: 2025-09-23T14:58:05.372Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:42-49
Timestamp: 2025-09-23T14:58:05.372Z
Learning: In TensorRT-LLM NCCL device kernels (cpp/tensorrt_llm/kernels/nccl_device/), the token partitioning intentionally uses ceil-like distribution (same token_per_rank for all ranks) to ensure all ranks launch the same number of blocks. This is required for optimal NCCL device API barrier performance, even though it may launch extra blocks for non-existent tokens on later ranks. Runtime bounds checking in the kernel (blockID validation) handles the overshoot cases.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device allreduce implementation (cpp/tensorrt_llm/thop/allreduceOp.cpp), the goto pattern in runNCCLAllReduceDeviceFusion is intentionally used for future extensibility, allowing multiple switch cases to fallback to the default handler. While not aesthetically ideal, this pattern supports adding more fusion cases later that can reuse the same fallback logic.
Applied to files:
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
📚 Learning: 2025-08-08T22:03:28.403Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_gemm_kernels.h:152-154
Timestamp: 2025-08-08T22:03:28.403Z
Learning: In CUTLASS, `TagToStrideC_t` template is defined for both pointer types (e.g., `Layout*`) and non-pointer types (e.g., `Layout`). When used with pointer types, it's often wrapped with `std::remove_pointer_t`, while non-pointer usage is direct. Both `cutlass::detail::TagToStrideC_t` and `cutlass::gemm::TagToStrideC_t` support both forms.
Applied to files:
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
🧬 Code graph analysis (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h (2)
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h (3)
tensorrt_llm(29-115)kernels(31-91)cutlass_kernels_oss(33-65)cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm90.h (1)
cutlass_kernels_oss(25-37)
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h (1)
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h (3)
tensorrt_llm(21-39)kernels(23-38)cutlass_kernels_oss(25-37)
🪛 Ruff (0.14.8)
cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py
821-821: DataType may be undefined, or defined from star imports
(F405)
821-821: DataType may be undefined, or defined from star imports
(F405)
821-821: DataType may be undefined, or defined from star imports
(F405)
821-821: DataType may be undefined, or defined from star imports
(F405)
821-821: DataType may be undefined, or defined from star imports
(F405)
822-822: DataType may be undefined, or defined from star imports
(F405)
822-822: DataType may be undefined, or defined from star imports
(F405)
822-822: DataType may be undefined, or defined from star imports
(F405)
822-822: DataType may be undefined, or defined from star imports
(F405)
823-823: DataType may be undefined, or defined from star imports
(F405)
824-824: DataType may be undefined, or defined from star imports
(F405)
824-824: DataType may be undefined, or defined from star imports
(F405)
824-824: DataType may be undefined, or defined from star imports
(F405)
824-824: DataType may be undefined, or defined from star imports
(F405)
824-824: DataType may be undefined, or defined from star imports
(F405)
825-825: DataType may be undefined, or defined from star imports
(F405)
825-825: DataType may be undefined, or defined from star imports
(F405)
825-825: DataType may be undefined, or defined from star imports
(F405)
825-825: DataType may be undefined, or defined from star imports
(F405)
826-826: DataType may be undefined, or defined from star imports
(F405)
827-827: DataType may be undefined, or defined from star imports
(F405)
827-827: DataType may be undefined, or defined from star imports
(F405)
827-827: DataType may be undefined, or defined from star imports
(F405)
827-827: DataType may be undefined, or defined from star imports
(F405)
827-827: DataType may be undefined, or defined from star imports
(F405)
828-828: DataType may be undefined, or defined from star imports
(F405)
828-828: DataType may be undefined, or defined from star imports
(F405)
828-828: DataType may be undefined, or defined from star imports
(F405)
828-828: DataType may be undefined, or defined from star imports
(F405)
829-829: DataType may be undefined, or defined from star imports
(F405)
858-858: Consider (*cta_shape_mn, cta_shape_k) instead of concatenation
Replace with (*cta_shape_mn, cta_shape_k)
(RUF005)
861-861: KernelScheduleType may be undefined, or defined from star imports
(F405)
862-862: EpilogueScheduleType may be undefined, or defined from star imports
(F405)
864-864: GemmKind may be undefined, or defined from star imports
(F405)
954-954: GemmKind may be undefined, or defined from star imports
(F405)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (32)
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp (2)
411-420: LGTM: Tile configuration expansion looks correct.The modifications to the tile_configs vector expand the candidate set for dynamic cluster shapes. The syntax and structure follow the existing pattern correctly.
482-505: LGTM: SM100 weight-only configuration implementation follows established patterns.The new WEIGHT_ONLY branch correctly implements candidate configuration generation for SM100 architecture:
- Selects appropriate tile configurations (64x128x128B and 128x128x128B)
- Follows the same pattern as SM90 implementation (lines 358-367)
- Includes CUDA kernel profiler configuration for tactic selection
- Variable naming and structure adhere to coding guidelines
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8PerChannelColumnMajorFalse.cu (1)
1-29: LGTM!New dispatcher instantiation for FP16 input with per-channel Int8 weights follows the established pattern. Copyright header, namespace structure with closing comments, and macro usage all conform to project conventions.
tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (2)
1322-1322: LGTM – SM100 support enabled.Bumping
MAX_SUPPORTED_SM_VERSIONto 100 aligns with the PR's objective to add Blackwell (SM100) support for weight-only/mixed-input GEMM operations.
1355-1358: LGTM – improved error message.The updated message correctly reflects the broader capability coverage (W4A16/W4A8) for this GEMM path.
tensorrt_llm/quantization/functional.py (1)
993-998: LGTM – row interleave correctly disabled for SM100+.The added
sm_ < 100condition correctly disables the row permutation for Blackwell, which uses a different weight layout. This aligns with the SM100 weight-only kernel requirements.cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4PerChannelColumnMajorFalse.cu (1)
1-29: LGTM!New dispatcher instantiation for BF16 input with per-channel Int4 weights follows the established pattern and mirrors the parallel FP16Int8 dispatcher file structure.
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4GroupwiseColumnMajorFalse.cu (1)
1-32: LGTM!The file correctly instantiates weight-only CUDA dispatchers for FP16Int4Groupwise with both KTile=64 and KTile=128 configurations. The copyright header, include, namespace structure, and closing comments all follow the coding guidelines.
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int4GroupwiseColumnMajorFalse.cu (1)
1-32: LGTM!The file correctly instantiates weight-only CUDA dispatchers for BF16Int4Groupwise with both KTile=64 and KTile=128 configurations. Structure and conventions match the FP16 variant appropriately.
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8PerChannelColumnMajorFalse.cu (1)
1-29: LGTM!The file correctly instantiates the weight-only CUDA dispatcher for BF16Int8PerChannel with KTile=64. The absence of KTile=128 is consistent with the w4a8 (4-bit weight, 8-bit activation) pattern seen in other files where KTile=128 is specifically for Int4 groupwise kernels.
cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_preprocessors.cpp (1)
137-144: Verify arch > 100 fallback to Sm80 layout is intentional.The logic uses Sm100 for
arch == 100but falls back to Sm80 forarch > 100. Combined with the earlier code that forcesarch = 80forarch >= 120(GB20x), this means any hypothetical architectures 101-119 would use Sm80 layouts. Please confirm this fallback behavior is intentional.cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelLauncher.h (1)
86-98: LGTM!The SM100 dispatch branch correctly uses
ColumnMajorlayout withConverterInterleave=false, which aligns with the preprocessing changes incutlass_preprocessors.cppthat skip interleaving and bias-interleave for arch 100. All expected kernel type combinations (FP16/BF16 × Int4/Int8 × Groupwise/PerChannel) are covered.cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int4PerChannelColumnMajorFalse.cu (1)
1-29: LGTM!The dispatcher instantiation file is correctly structured with:
- Proper 2025 copyright header
- Correct include for the dispatcher header
- Proper namespace hierarchy with closing comments
- Clean macro instantiation for FP16Int4PerChannel with ColumnMajor layout
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_mma_sm100_weightonly.hpp (1)
1-42: LGTM!The header follows CUTLASS extension conventions correctly:
- Uses
#pragma onceas per CUTLASS style (per learnings about CUTLASS extensions)- Primary template with
static_assert(dependent_false<...>)is the standard CUTLASS pattern to require specialization- Including the specialization file after the primary template declaration is the correct approach
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp (1)
57-59: Consider the scope ofusing namespace cute.While
using namespace cute;at namespace scope is common in CUTLASS code, it's worth noting this pulls allcutesymbols intocutlass::gemm::collective. This is acceptable for CUTLASS extensions following upstream conventions.cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template.h (3)
43-43: LGTM!The include for the new SM100 template header is correctly placed alongside the SM90 template.
462-467: LGTM!The SM100 dispatch branch correctly routes to
sm100_dispatch_gemm_to_cutlassfor the Blackwell architecture path.
547-549: LGTM!The config selection correctly uses BLACKWELL for SM100+ and falls back to HOPPER otherwise. This is the expected behavior for selecting candidate configurations.
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherFp16Int8GroupwiseColumnMajorFalse.cu (1)
1-29: LGTM!The dispatcher instantiation file follows the established pattern correctly with:
- Proper 2025 copyright header
- Correct include for the dispatcher header
- Proper namespace hierarchy with closing comments
- Clean macro instantiation for FP16Int8Groupwise with ColumnMajor layout
cpp/tensorrt_llm/kernels/weightOnlyBatchedGemv/kernelDispatcherBf16Int8GroupwiseColumnMajorFalse.cu (1)
1-29: LGTM!The file correctly instantiates the weight-only CUDA dispatcher for BF16/Int8 groupwise configuration. Copyright header, namespace structure, and closing comments all follow the coding guidelines.
cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/collective_builder_sm100_weightonly.hpp (1)
1-42: LGTM!The collective builder template follows CUTLASS extension conventions:
#pragma onceheader guard (per CUTLASS style), static_assert fallback for unsupported parameter combinations, and inl file inclusion for specializations.cpp/tensorrt_llm/kernels/cutlass_kernels/python/generate_kernels.py (1)
944-960: LGTM!The SM100 launcher inl file path and inl_map entry follow the established pattern for SM90 integration.
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl (1)
61-274: SM100 launcher implementation is well-structured.The implementation correctly:
- Uses SFINAE to filter unsupported configurations
- Handles type remapping for SM100 weight layouts
- Configures collective builder with proper transpose trick
- Validates quantization parameters at runtime
- Provides descriptive error messages
The mainloop and epilogue schedule selection based on cluster shape (1SM vs 2SM) is appropriately handled.
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/fpA_intB_gemm_template_sm100.h (3)
93-115: LGTM with minor style suggestion.The dispatch logic is correct. For consistency with the coding guidelines, consider wrapping the case body in braces. The TODO for additional cluster shapes is noted.
117-149: LGTM.The tile size dispatch logic is straightforward. The
Ktilecalculation correctly derives the K-dimension tile count from a 128-byte tile width.
151-153: Namespace closing comments are correct.cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl (6)
1-31: Copyright and header guard follow CUTLASS conventions.The BSD-3-Clause license and
#pragma onceare appropriate for CUTLASS extension files. Based on learnings, CUTLASS extensions follow CUTLASS coding style conventions.
45-128: Stage count computation logic looks correct.The two overloads handle explicit stage count vs. auto-carveout appropriately. The auto-carveout version correctly:
- Accounts for TMEM layout constraints (line 77)
- Calculates pipeline storage requirements (lines 89-111)
- Derives stage counts from available capacity (lines 113-123)
- Validates minimum stage requirements (line 125-126)
173-188: Static assertions provide good guardrails.The assertions at lines 177, 180-182, and 188 correctly enforce the invariants for mixed-input weight-only GEMM:
- Input widths must differ (mixed input)
- Either A or B must be a tuple, or widths must differ
- A must be the transformed operand
265-291: SMEM carveout accounting is comprehensive.The breakdown correctly accounts for all pipeline and descriptor storage requirements. The calculations follow standard CUTLASS resource accounting patterns.
306-314: DispatchPolicy and CollectiveOp wiring looks correct.The
MainloopSm100TmaUmmaWarpSpecializedMixedInputdispatch policy is correctly parameterized with the computed stage counts, andCollectiveMmaSm100WeightOnlyaggregates all the layout, copy, and element configurations.
316-316: Namespace closing comment is correct.
...nsions/include/cutlass_extensions/gemm/collective/builders/sm100_umma_builder_weightonly.inl
Outdated
Show resolved
Hide resolved
...ensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp
Outdated
Show resolved
Hide resolved
...ensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp
Show resolved
Hide resolved
...ensions/include/cutlass_extensions/gemm/collective/sm100_mma_warpspecialized_mixed_input.hpp
Outdated
Show resolved
Hide resolved
...tensorrt_llm/cutlass_extensions/include/cutlass_extensions/gemm/kernel/mixed_gemm_B_layout.h
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/cutlass_kernels/fpA_intB_gemm/launchers/fpA_intB_launcher_sm100.inl
Show resolved
Hide resolved
|
/bot run |
|
PR_Github #29395 [ run ] triggered by Bot. Commit: |
|
PR_Github #29395 [ run ] completed with state
|
Signed-off-by: Cheng Hang <chang@nvidia.com>
|
/bot run |
|
PR_Github #29435 [ run ] triggered by Bot. Commit: |
|
PR_Github #29435 [ run ] completed with state
|
Signed-off-by: Cheng Hang <chang@nvidia.com>
|
/bot run |
|
PR_Github #30160 [ run ] triggered by Bot. Commit: |
|
PR_Github #30160 [ run ] completed with state
|
|
/bot run |
|
/bot cancel |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot kill |
|
/bot run |
|
PR_Github #30183 [ kill ] triggered by Bot. Commit: |
|
PR_Github #30184 [ run ] triggered by Bot. Commit: |
|
PR_Github #30183 [ kill ] completed with state |
|
PR_Github #30182 [ run ] completed with state |
|
PR_Github #30184 [ run ] completed with state
|
|
/bot run |
|
PR_Github #30191 [ run ] triggered by Bot. Commit: |
|
PR_Github #30191 [ run ] completed with state
|
Signed-off-by: Cheng Hang <chang@nvidia.com>
|
/bot run |
|
PR_Github #30239 [ run ] triggered by Bot. Commit: |
|
PR_Github #30239 [ run ] completed with state
|
Signed-off-by: Cheng Hang <chang@nvidia.com>
|
/bot run |
|
PR_Github #30267 [ run ] triggered by Bot. Commit: |
|
PR_Github #30267 [ run ] completed with state
|
|
/bot run |
|
PR_Github #30279 [ run ] triggered by Bot. Commit: |
|
PR_Github #30279 [ run ] completed with state |
Shixiaowei02
left a comment
There was a problem hiding this comment.
Approved for the doc changes.
Signed-off-by: Cheng Hang <chang@nvidia.com> Signed-off-by: Daniil Kulko <kulkodaniil@gmail.com>
Summary by CodeRabbit
Release Notes
New Features
Performance
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.