[None][feat] Add priority-based KV cache offload filtering support#10751
[None][feat] Add priority-based KV cache offload filtering support#10751pcastonguay merged 17 commits intoNVIDIA:mainfrom
Conversation
d1c2d0d to
2ab02d6
Compare
jthomson04
left a comment
There was a problem hiding this comment.
Can save this for the end, but would also be nice to include an integration test in https://github.com/NVIDIA/TensorRT-LLM/blob/main/tests/integration/defs/llmapi/test_llm_api_connector.py
No worries. Added directly. PTAL |
|
/bot run |
|
PR_Github #33412 [ run ] triggered by Bot. Commit: |
|
PR_Github #33412 [ run ] completed with state
|
9029f30 to
7c74b75
Compare
|
/bot run |
|
PR_Github #33546 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis change introduces a new API method Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`:
- Around line 1686-1692: Replace the three-line Doxygen `///` comments above the
virtual method getPriorityByBlockId with the header-style single-line `//!`
comments: change each `/// `@brief``, `/// `@param``, and `/// `@return`` line to `//!
`@brief``, `//! `@param``, and `//! `@return`` respectively for the declaration of
executor::RetentionPriority getPriorityByBlockId(KVCacheBlock::IdType blockId,
SizeType32 windowSize) const = 0; so the header follows the project's `//!`
Doxygen style.
In `@cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp`:
- Around line 2578-2587: KVCacheManager::getPriorityByBlockId calls
mBlockManager.getBlockById which uses .at() and can throw std::out_of_range
before the warning/log and default return; update getPriorityByBlockId to guard
against invalid blockId/windowSize by either (A) performing an existence/bounds
check on mBlockManager (e.g., a hasBlock or validWindow/validId method) before
calling getBlockById and returning
tle::KvCacheRetentionConfig::kDefaultRetentionPriority when missing, or (B)
wrapping the getBlockById call in a try/catch for std::out_of_range and on
exception log the warning and return
tle::KvCacheRetentionConfig::kDefaultRetentionPriority; keep existing return
type tle::RetentionPriority and preserve the TLLM_LOG_WARNING message.
In `@tests/integration/defs/llmapi/test_llm_api_connector.py`:
- Around line 446-528: Remove the unused enforce_single_worker parameter from
the test function signatures (test_connector_priorities and
test_connector_priorities_default) and instead mark each test to use the fixture
by adding `@pytest.mark.usefixtures`("enforce_single_worker") above the function;
ensure pytest is imported at the top of the file if not already present so the
decorator resolves.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/kv_cache_connector.py (1)
318-325: Consider preserving the “None means default” contract.Right now
prioritiesis always a list (even when all defaults), soNoneis never used. If downstream treatsNoneas “feature disabled,” this could change behavior. Consider gating priorities (e.g., only populate when a retention config is present) or update the field comment to match behavior.tests/integration/defs/llmapi/test_llm_api_connector.py (1)
25-25: Keep module namespace on import.Codebase rule prefers module-qualified access (avoids name clashes and keeps namespace explicit).
As per coding guidelines, update to a module import and qualify usages.🔧 Suggested refactor
-from tensorrt_llm.llmapi.llm_utils import KvCacheRetentionConfig +import tensorrt_llm.llmapi.llm_utils as llm_utils @@ - retention_config = KvCacheRetentionConfig( + retention_config = llm_utils.KvCacheRetentionConfig( token_range_retention_priorities=[ - KvCacheRetentionConfig.TokenRangeRetentionConfig( + llm_utils.KvCacheRetentionConfig.TokenRangeRetentionConfig( token_start=0, token_end=32, priority=80, ), - KvCacheRetentionConfig.TokenRangeRetentionConfig( + llm_utils.KvCacheRetentionConfig.TokenRangeRetentionConfig( token_start=32, token_end=None, # Extend to end of sequence priority=10, ), ], decode_retention_priority=10, )Also applies to: 467-478
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
853-865: Avoid hardcoding the default priority value in the docstring.The default is defined in C++ and could change; the docstring risks drifting.
✏️ Suggested tweak
- Returns: - The retention priority of the block (0-100), or default priority (35) if not found. + Returns: + The retention priority of the block (0-100), or the default priority if not found.
|
PR_Github #33546 [ run ] completed with state
|
7c74b75 to
719ccbd
Compare
|
/bot run |
|
PR_Github #33621 [ run ] triggered by Bot. Commit: |
|
PR_Github #33621 [ run ] completed with state
|
|
/bot run |
|
PR_Github #33634 [ run ] triggered by Bot. Commit: |
|
PR_Github #33634 [ run ] completed with state
|
|
/bot run |
|
/bot run --disable-fail-fast |
|
PR_Github #34679 [ run ] triggered by Bot. Commit: |
|
PR_Github #34679 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
/bot run |
|
/bot run --disable-fail-fast |
|
PR_Github #34729 [ run ] triggered by Bot. Commit: |
|
/bot kill |
|
PR_Github #34797 [ kill ] triggered by Bot. Commit: |
|
PR_Github #34797 [ kill ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #34810 [ run ] triggered by Bot. Commit: |
|
PR_Github #34810 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #34872 [ run ] triggered by Bot. Commit: |
|
/bot kill |
|
PR_Github #34891 [ kill ] triggered by Bot. Commit: |
|
PR_Github #34872 [ run ] completed with state |
|
PR_Github #34891 [ kill ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #34894 [ run ] triggered by Bot. Commit: |
|
PR_Github #34894 [ run ] completed with state |
…VIDIA#10751) Signed-off-by: Yuewei Na <yna@nvidia.com> Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com> Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com>
…VIDIA#10751) Signed-off-by: Yuewei Na <yna@nvidia.com> Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com> Co-authored-by: Yuewei Na <nv-yna@users.noreply.github.com> Signed-off-by: Ahmet Inci <ainci@nvidia.com>
Summary
getPriorityByBlockIdmethod toKVCacheManagerto expose block retention prioritiesRequestDatawithprioritiesfield for the KV cache connectorupdate_and_build_datato pass to downstream KVBMThis enables KVBM to filter which blocks get offloaded based on retention priority (e.g., only offload high-priority system prompt blocks for explicit prompt caching).
Companion PR
Test plan
get_priority_by_block_idreturns correct priority valuesSummary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.