[None][feat] Initial PR for trtllm-gen attention backend#10784
[None][feat] Initial PR for trtllm-gen attention backend#10784yihwang-nv merged 19 commits intoNVIDIA:mainfrom
Conversation
5d09597 to
0c0c796
Compare
|
/bot run |
|
Caution Docstrings generation - FAILED An unexpected error occurred while opening a pull request: Reference update failed - https://docs.github.com/rest/git/refs#create-a-reference |
|
Oops, something went wrong! Please try again later. 🐰 💔 |
📝 Walkthrough\n\n \n \n\n
📝 Walkthrough\n\n## Walkthrough\n\nIntroduces a feature-flag-controlled alternative attention backend (TRTLLM-Gen) that conditionally routes attention operations based on an environment variable. The new backend function is defined with extensive documentation but remains unimplemented (raises NotImplementedError).\n\n## Changes\n\n| Cohort / File(s) | Summary |\n|---|---|\n| Conditional routing in TrtllmAttentionWrappertensorrt_llm/_torch/attention_backend/trtllm.py | Adds module-level feature flag _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION (sourced from environment variable, default "0"). Conditionally imports trtllm_gen_attention from .trtllm_gen. In TrtllmAttentionWrapper.run(), routes attention calls to either trtllm_gen_attention() or existing thop.attention() based on flag state, maintaining argument parity. |\n| New attention backend stub tensorrt_llm/_torch/attention_backend/trtllm_gen.py | Adds new public attention() function with comprehensive docstring covering prefill/decode phases, paged KV cache, MLA, speculative decoding, and quantization support. Function signature accepts extensive tensor and parameter list but raises NotImplementedError, marking this as a placeholder awaiting implementation. |\n\n## Estimated code review effort\n\n🎯 3 (Moderate) | ⏱️ ~20 minutes\n\n🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/attention_backend/trtllm_gen.py`:
- Around line 1-4: This file is missing the required NVIDIA copyright header;
add the standard NVIDIA copyright block at the very top of trtllm_gen.py (before
any imports) with the appropriate year of latest meaningful modification and
owner information consistent with other TensorRT-LLM source files (include
SPDX/license tag if your project uses it) so the header matches the project
coding guidelines.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
28-34: Global variable naming convention.Per the coding guidelines, Python global variables should use upper snake_case with prefix
G(e.g.,G_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION). The leading underscore indicates internal use, which is fine, but theG_prefix is the project convention for globals.Suggested naming
-# Enable TRTLLM-Gen attention backend via environment variable. -_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = os.environ.get( - "TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "0") == "1" +# Enable TRTLLM-Gen attention backend via environment variable. +G_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = os.environ.get( + "TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "0") == "1" -if _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION: +if G_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION: from .trtllm_gen import attention as trtllm_gen_attentionAs per coding guidelines, Python global variables should use upper snake_case with prefix
G.
507-588: LGTM - Feature-gated dispatch with functional parity.The conditional dispatch correctly routes to
trtllm_gen_attentionwhen the feature flag is enabled, with identical arguments passed to both paths. This ensures functional parity once the TRTLLM-Gen backend is implemented.Note that enabling
TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION=1will currently raiseNotImplementedErrorsince the backend is a placeholder. Consider adding a log message when the flag is enabled to inform users that this is an experimental/unimplemented path.Optional: Add informational log at import time
if _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION: from .trtllm_gen import attention as trtllm_gen_attention + logger.info("TRTLLM-Gen attention backend enabled (experimental)")
|
PR_Github #32465 [ run ] triggered by Bot. Commit: |
|
PR_Github #32465 [ run ] completed with state
|
0c0c796 to
88db82c
Compare
|
/bot run |
|
PR_Github #32922 [ run ] triggered by Bot. Commit: |
|
PR_Github #32922 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33062 [ run ] triggered by Bot. Commit: |
|
PR_Github #33062 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33746 [ run ] triggered by Bot. Commit: |
|
PR_Github #33746 [ run ] completed with state
|
88db82c to
d21470e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #33801 [ run ] triggered by Bot. Commit: |
|
PR_Github #33801 [ run ] completed with state
|
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #35182 [ run ] triggered by Bot. Commit: |
|
PR_Github #35182 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #35195 [ run ] triggered by Bot. Commit: |
|
PR_Github #35195 [ run ] completed with state |
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #35220 [ run ] triggered by Bot. Commit: |
|
PR_Github #35220 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #35259 [ run ] triggered by Bot. Commit: |
|
PR_Github #35259 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #35297 [ run ] triggered by Bot. Commit: |
|
PR_Github #35297 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #35596 [ run ] triggered by Bot. Commit: |
|
PR_Github #35596 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #35607 [ run ] triggered by Bot. Commit: |
|
PR_Github #35607 [ run ] completed with state |
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
Description
This is the initial PR to introduce trtllm-gen attention backend, It will only use trtllm-gen fmha kernels and will be an experimental path of the
TrtllmAttentionbackend.This backend will enable iff
TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION=1.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.