[TRTLLM-7738][feat] Adding implementation of KVCacheManagerV2#10736
[TRTLLM-7738][feat] Adding implementation of KVCacheManagerV2#10736lowsfer merged 1 commit intoNVIDIA:mainfrom
Conversation
yizhang-nv
left a comment
There was a problem hiding this comment.
Have the unit tests for kv cache manager v2 been added to the CI?
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive KV Cache Manager V2 system for TensorRT LLM, featuring multi-tier memory management (GPU/Host/Disk), a custom reference C extension (rawref), batched copy operations, page-based eviction, and lifecycle-aware cache orchestration across multiple Python and C++ modules. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User / Client
participant KVMgr as KVCacheManager
participant RadixTree as BlockRadixTree
participant StorageMgr as StorageManager
participant EvictionCtrl as EvictionController
participant GPU as GPU Memory
participant CopyEngine as CopyEngine
User->>KVMgr: create_kv_cache()
KVMgr->>KVMgr: init with config
activate KVMgr
KVMgr->>RadixTree: add_or_get_existing(lora_task_id)
RadixTree-->>KVMgr: RootBlock
deactivate KVMgr
User->>KVMgr: (KV cache created)
rect rgba(100, 150, 200, 0.5)
Note over User,KVMgr: Prefill Phase
User->>KVMgr: commit(tokens)
activate KVMgr
KVMgr->>StorageMgr: new_gpu_slots()
StorageMgr->>GPU: allocate pages
GPU-->>StorageMgr: slots
StorageMgr-->>KVMgr: slots
KVMgr->>RadixTree: match/add blocks
KVMgr->>RadixTree: mark committed
deactivate KVMgr
end
rect rgba(150, 200, 100, 0.5)
Note over User,KVMgr: Decode Phase with Eviction
User->>KVMgr: commit(new_tokens)
activate KVMgr
KVMgr->>StorageMgr: new_slots()
StorageMgr->>StorageMgr: check free slots
alt insufficient GPU memory
StorageMgr->>EvictionCtrl: evict(min_pages)
activate EvictionCtrl
EvictionCtrl->>EvictionCtrl: pop from LRU
EvictionCtrl-->>StorageMgr: evicted pages
deactivate EvictionCtrl
StorageMgr->>CopyEngine: migrate to host/disk
CopyEngine->>GPU: batched copy (GPU→Host)
GPU-->>CopyEngine: done
StorageMgr->>GPU: release evicted GPU slots
end
StorageMgr->>GPU: allocate new slots
StorageMgr-->>KVMgr: slots
KVMgr->>RadixTree: match blocks for reuse
KVMgr->>RadixTree: add new/committed blocks
deactivate KVMgr
end
User->>KVMgr: suspend()
activate KVMgr
KVMgr->>StorageMgr: evict held blocks
StorageMgr-->>KVMgr: freed slots
deactivate KVMgr
User->>KVMgr: resume()
KVMgr-->>User: ready
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
cpp/tensorrt_llm/nanobind/common/customCasters.h (1)
1-3: Update the copyright year to include 2026.The header still ends at 2025 even though this file was modified in 2026. As per coding guidelines, please update the NVIDIA copyright header to reflect the latest modification year.
tensorrt_llm/runtime/__init__.py (1)
1-1: Update copyright year to include 2025.The copyright header shows
2022-2024, but this file is being modified in 2025/2026. Per coding guidelines, the year of latest meaningful modification should be reflected.📝 Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.scripts/build_wheel.py (1)
1027-1039: Duplicate wheel build commands - the first build ignores mypyc setting.Lines 1028-1030 build the wheel without setting the
TRTLLM_ENABLE_MYPYCenvironment variable, then lines 1031-1039 set the environment variable and build again. This causes two wheel builds, where only the second respects the--mypycflag. The firstbuild_runcall should be removed.Suggested fix
extra_wheel_build_args = os.getenv("EXTRA_WHEEL_BUILD_ARGS", "") - build_run( - f'\"{venv_python}\" -m build {project_dir} --skip-dependency-check {extra_wheel_build_args} --no-isolation --wheel --outdir "{dist_dir}"' - ) env = os.environ.copy() if mypyc: env["TRTLLM_ENABLE_MYPYC"] = "1" else: env["TRTLLM_ENABLE_MYPYC"] = "0" build_run( - f'\"{venv_python}\" -m build {project_dir} --skip-dependency-check --no-isolation --wheel --outdir "{dist_dir}"', + f'\"{venv_python}\" -m build {project_dir} --skip-dependency-check {extra_wheel_build_args} --no-isolation --wheel --outdir "{dist_dir}"', env=env)
🤖 Fix all issues with AI agents
In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManagerV2Utils.h`:
- Around line 18-30: Replace the top-level `#pragma once` with a traditional
include guard named TRTLLM_KV_CACHE_MANAGER_V2_UTILS_H (add `#ifndef`
TRTLLM_KV_CACHE_MANAGER_V2_UTILS_H / `#define` ... / `#endif`) and add Doxygen
comments for the public class and its method: document the
KVCacheManagerV2UtilsBindings class and the static initBindings(nb::module_&
module) method with brief `@brief` and `@param` tags so the public interface is
documented; reference the class name KVCacheManagerV2UtilsBindings and the
method initBindings when adding the comments.
In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`:
- Around line 593-597: In the put method of the pool class (method put), avoid
appending an item after calling _destroy_func when the pool is at _max_size:
check the size and if the item must be destroyed, call self._destroy_func(item)
and return early instead of executing self.items.appendleft(item); alternatively
move the size-check before decrementing/adjusting and only append when not
destroyed so that _destroy_func is never followed by items.appendleft.
- Around line 395-401: This file is missing the required NVIDIA copyright
header; add the standard NVIDIA copyright header at the top of
tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py with the year of latest
meaningful modification, then run formatting/linting; ensure the header sits
above the module-level code that defines symbols like _CHUNKED_REGISTRATION,
_CHUNK_SIZE, and _num_registered_chunks so it is the first thing in the file,
and optionally add a brief inline comment near the existing
_CHUNKED_REGISTRATION declaration referencing the Linux bug (e.g., Ubuntu bug
`#2068818` or the LKML patch "mm/gup: stop leaking pinned pages in low memory
conditions") for future maintainers.
In `@tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c`:
- Around line 6-10: ReferenceTypeObject currently stores a raw integer object_id
and dereferences it in ReferenceType_call, causing use-after-free; change the
struct to hold a proper PyObject* weakref (created with PyWeakref_NewRef)
instead of Py_ssize_t object_id, update ReferenceType_call to retrieve the
target via PyWeakref_GetObject and check for Py_None, ensure you create the
weakref when initializing the ReferenceType (where object_id is currently
assigned) and add a tp_dealloc (and tp_traverse/tp_clear if necessary) to DECREF
the weakref to avoid leaks and keep correct lifetime semantics.
🟠 Major comments (28)
tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py-1-1 (1)
1-1: Add the NVIDIA copyright header for this new source file.Line 1 starts with imports but the required NVIDIA header (latest year 2026) is missing. As per coding guidelines, please add the standard header before any code.
tensorrt_llm/runtime/kv_cache_manager_v2/setup_mypyc.py-1-8 (1)
1-8: Add the NVIDIA copyright header for this new source file.This file begins with a docstring (Line 1) but is missing the required NVIDIA header with the latest modification year (2026). As per coding guidelines, please add the standard header before the docstring.
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.h-26-54 (1)
26-54: Make the header self‑contained for signed-size types.
ssize_tisn’t provided by<cstdint>, so this header may not compile on some toolchains. Consider switching tostd::ptrdiff_t(with<cstddef>) or another project‑standard signed size type and update the .cpp accordingly.cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.h-18-18 (1)
18-18: Add the required preprocessor guard.The project requires a
TRTLLM_<FILENAME>_Hinclude guard for headers;#pragma oncealone doesn’t satisfy the guideline.♻️ Proposed refactor
-#pragma once +#pragma once +#ifndef TRTLLM_KVCACHEMANAGERV2UTILS_H +#define TRTLLM_KVCACHEMANAGERV2UTILS_H + +// ... existing contents ... + +#endif // TRTLLM_KVCACHEMANAGERV2UTILS_HAs per coding guidelines, please add the mandated header guard.
tensorrt_llm/runtime/kv_cache_manager_v2/rawref/setup.py-1-1 (1)
1-1: Add the NVIDIA copyright header for this new source file.Line 1 starts with imports, but TensorRT‑LLM source files must include the NVIDIA header with the latest modification year (2026) before any code. As per coding guidelines, please add the standard header.
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.h-1-3 (1)
1-3: Update the copyright year to include 2026.The header currently ends at 2025 despite this 2026 change. As per coding guidelines, please update the NVIDIA copyright header to reflect the latest modification year.
tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py-37-48 (1)
37-48: Don't drop handles from tracking before a successful release.Line 40 removes the handle from
_outstanding_handlesbeforecuMemReleasesucceeds; on failure, tracking is lost and leaks become harder to diagnose. Also, the bareexcept:on line 43 violates coding guidelines—narrow the exception type toexcept Exception:. Move the handle removal to after the successful release attempt.🐛 Proposed fix
assert int(handle) in self._outstanding_handles - self._outstanding_handles.remove(int(handle)) try: _unwrap(drv.cuMemRelease(handle)) - except: + except Exception: print( f"Failed to release handle {handle}. num_oustanding = {len(self._outstanding_handles)}" ) raise + self._outstanding_handles.remove(int(handle))tensorrt_llm/runtime/kv_cache_manager_v2/setup_mypyc.py-53-65 (1)
53-65: Address the code inconsistency:_exceptions.pyis marked as excluded but remains in the modules list.Lines 53‑54 state
_exceptions.pyshould be excluded due to a mypyc limitation, yet it's listed inmodulesat line 64, and line 83 prints "Excluded: None" instead of listing the actual exclusions. Resolve this one of two ways:
Remove from modules if exclusion is intentional:
- Remove
"kv_cache_manager_v2/_exceptions.py"from the modules list (line 64)- Update line 83 to print
"Excluded: _exceptions.py"Keep in modules if inclusion is correct:
- Remove the exclusion comment (lines 53‑54) since mypyc supports classes inheriting from built-in Exception types
Note: The stated mypyc limitation with Exception classes is not accurate; mypyc documents Exception and BaseException as supported non-native base classes.
cpp/tensorrt_llm/nanobind/common/customCasters.h-319-335 (1)
319-335: HandlePyLong_AsVoidPtrerrors and remove the duplicate return.Line 327 calls
PyLong_AsVoidPtrwithout checking for errors. According to the Python C API, this function returns NULL on error (e.g., OverflowError) and sets a Python exception. The current code ignores this, returning true even if the conversion fails. Additionally, line 329 contains an unreachable duplicate return statement.Proposed fix
bool from_python([[maybe_unused]] handle src, uint8_t flags, cleanup_list* cleanup) { - value = reinterpret_cast<CUstream>(PyLong_AsVoidPtr(src.ptr())); - return true; - return true; + void* ptr = PyLong_AsVoidPtr(src.ptr()); + if (ptr == nullptr && PyErr_Occurred()) + { + PyErr_Clear(); + return false; + } + value = reinterpret_cast<CUstream>(ptr); + return true; }tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py-140-146 (1)
140-146: Add error handling in_push()with specific exception types to prevent resource leaks.If
cuMemSetAccessfails aftercuMemMapsucceeds, mapped virtual memory remains unmapped and the physical memory is not returned to the pool. Theextend()rollback cannot clean this up since_pm_stackis never updated.Add a try-except block in
_pushto unmap and close the physical memory on failure. Use specific exception types (CuError,CuOOMError) from_exceptionsinstead of bareExceptionper coding guidelines:Proposed fix
def _push(self, phy_mem: PhysMem) -> None: phys_mem_size = self.phys_mem_size assert phys_mem_size * (len(self._pm_stack) + 1) <= self._vm_size vm_ptr = drv.CUdeviceptr(self.address + phys_mem_size * len(self._pm_stack)) - _unwrap(drv.cuMemMap(vm_ptr, phys_mem_size, 0, phy_mem.handle, 0)) - _unwrap(drv.cuMemSetAccess(vm_ptr, phys_mem_size, (self._access_desc,), 1)) + try: + _unwrap(drv.cuMemMap(vm_ptr, phys_mem_size, 0, phy_mem.handle, 0)) + _unwrap(drv.cuMemSetAccess(vm_ptr, phys_mem_size, (self._access_desc,), 1)) + except (CuError, CuOOMError): + try: + _unwrap(drv.cuMemUnmap(vm_ptr, phys_mem_size)) + except (CuError, CuOOMError): + pass + phy_mem.close() + raise self._pm_stack.append(phy_mem)Also add the required NVIDIA copyright header at the top of the file.
tensorrt_llm/runtime/kv_cache_manager_v2/_core/__init__.py-1-5 (1)
1-5: Add NVIDIA SPDX header to this new module.All TensorRT-LLM .py files must carry the NVIDIA SPDX header with the latest modification year. As per coding guidelines, please add it.
♻️ Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + from .._common import BeamIndex from ._kv_cache import _KVCache from ._kv_cache_manager import KVCacheManagertensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.py-1-20 (1)
1-20: Add NVIDIA SPDX header to rawref package initializer.All TensorRT-LLM .py files must carry the NVIDIA SPDX header with the latest modification year. As per coding guidelines, please add it.
♻️ Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + """rawref - Mutable reference with singleton pattern. @@ """setup.py-277-287 (1)
277-287: Local-directory extraction whitelist is missingrawref/__init__.pywhile wheel extraction includes it.Wheel extraction whitelists
tensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.py(lines 278-282), but the local-directory branch (lines 216-220) does not. This causesrawref/__init__.pyto be filtered out when usingTRTLLM_USE_PRECOMPILEDwith a local directory, creating an inconsistency. Update the local-directory allowed_dirs to match the wheel extraction logic.♻️ Proposed fix
if dst_file.endswith(".py"): - allowed_dirs = ("tensorrt_llm/deep_gemm/", - "tensorrt_llm/deep_ep/", - "tensorrt_llm/flash_mla/") + allowed_dirs = ( + "tensorrt_llm/deep_gemm/", + "tensorrt_llm/deep_ep/", + "tensorrt_llm/flash_mla/", + "tensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.py", + )tensorrt_llm/runtime/kv_cache_manager_v2/_common.py-1-5 (1)
1-5: Add NVIDIA SPDX header for the new runtime module.This file is missing the required NVIDIA copyright header.
✅ Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +As per coding guidelines.
tests/unittest/kv_cache_manager_v2/test_kv_cache_manager_v2.py-1-14 (1)
1-14: Add NVIDIA SPDX header for this new test module.This file is missing the required NVIDIA copyright header.
✅ Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +As per coding guidelines.
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManagerV2Utils.cpp-1-4 (1)
1-4: Update the SPDX header year to 2026.This is a 2026 change; the header should reflect the latest modification year.
🛠️ Proposed fix
- * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.As per coding guidelines.
tests/unittest/kv_cache_manager_v2/test_kv_cache_manager_v2.py-103-113 (1)
103-113: Removepdb.set_trace()from GC callback to avoid hanging tests.A debugger breakpoint can stall CI and violates linting (Ruff T100). Prefer logging and assertions.
🛠️ Proposed fix
- if collected != 0 or uncollectable != 0: - import pdb - - pdb.set_trace() assert collected == 0 and uncollectable == 0cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cu-1-8 (1)
1-8: Add NVIDIA SPDX header for this new CUDA source.The required NVIDIA copyright header is missing. Please add the standard SPDX/Apache-2.0 header with the latest year.
✅ Proposed header
+/* + * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + * SPDX-License-Identifier: Apache-2.0 + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */As per coding guidelines, please add the standard header.
tensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.pyi-1-3 (1)
1-3: Add NVIDIA SPDX header to the stub file.The required NVIDIA copyright header is missing.
✅ Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +As per coding guidelines, please add the standard header.
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cu-115-162 (1)
115-162: ValidatenumBytesbefore narrowing touint32_t.Public APIs (
copyHostToDevice,copyDeviceToHost,copyDeviceToDevice) acceptssize_t, butlaunchBatchedCopytakesuint32_t. Negative values will wrap to large positive integers, and sizes >4 GiB will silently truncate. Add a range check at the wrapper layer.🛠️ Proposed fix
CUresult launchBatchedCopy(bool lowBandwidth, std::vector<MMTask> const& tasks, uint32_t nbBytes, cudaStream_t stream) { + TLLM_CHECK_WITH_INFO(numBytes >= 0 && numBytes <= std::numeric_limits<uint32_t>::max(), + "numBytes must be in [0, %u], got %zd.", std::numeric_limits<uint32_t>::max(), numBytes); + uint32_t const nbBytes = static_cast<uint32_t>(numBytes); constexpr uint32_t maxN = 256;tests/unittest/kv_cache_manager_v2/test_kv_cache_manager_v2.py-11-14 (1)
11-14: Fix Python 3.8 incompatibilities:randbytesand PEP 585/604 type hints.The file violates the 3.8+ requirement specified in
tensorrt_llm/runtime/kv_cache_manager_v2/setup_mypyc.py:
random.randbytes(line 11) is only available in Python 3.9+- Direct PEP 585/604 syntax is used:
int | Nonetype hints (PEP 604, Python 3.10+)list[...],dict[...]instantiations (PEP 585, Python 3.9+)Replace
randbyteswithos.urandom, addfrom __future__ import annotations, and convert generic type instantiations to non-parameterized forms ([]orlist(),{}ordict()).🛠️ Proposed fix
+from __future__ import annotations + import array import functools import gc import itertools import os import random import time import unittest from contextlib import contextmanager from importlib.util import find_spec -from random import randbytes from statistics import median from typing import TYPE_CHECKING, Iterator, NamedTuple, castThen replace direct generic instantiations:
self.past_sequences = list[list[TokenIdExt]]()→self.past_sequences = []self.seq_len_dict = dict[_KVCache, int]()→self.seq_len_dict = {}self.batch = list[Step]()→self.batch = []self.suspended = list[Step]()→self.suspended = []And replace usage:
- return randbytes(32) + return os.urandom(32)cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cu-54-75 (1)
54-75: Move pointer arithmetic inside the bounds check to avoid undefined behavior.The code creates pointers via
reinterpret_cast<Grain*>(task.dst)[idxGrain]and&reinterpret_cast<Grain const*>(task.src)[idxGrain]before verifyingidxGrain < idxGrainEnd. WhenidxGrain >= idxGrainEnd, this performs out-of-bounds pointer arithmetic, which is undefined behavior—even though the actual access is guarded. The loop intentionally runs extra iterations (the+ nbBufscomponent) for async group ordering, guaranteeing that some iterations exceed the bounds. These pointers are then passed tocp.asyncinstructions that will attempt to dereference them.Move the pointer creation inside the guard:
🛠️ Proposed fix
- Grain const& src = data[idxBuf][tid]; - uint32_t const idxGrain = idxGrainBeg + ctaSize * stIter; - Grain& dst = reinterpret_cast<Grain*>(task.dst)[idxGrain]; + Grain const& src = data[idxBuf][tid]; + uint32_t const idxGrain = idxGrainBeg + ctaSize * stIter; asm volatile("cp.async.wait_group %0;\n" ::"n"(nbBufs - 1) : "memory"); if (idxGrain < idxGrainEnd) { - dst = src; + reinterpret_cast<Grain*>(task.dst)[idxGrain] = src; } ... - Grain* const dst = &data[idxBuf][tid]; - uint32_t const idxGrain = idxGrainBeg + ctaSize * ldIter; - Grain const* const src = &reinterpret_cast<Grain const*>(task.src)[idxGrain]; - if (idxGrain < idxGrainEnd) - { - uint32_t const size = grainBytes; - asm volatile("cp.async.cg.shared.global [%0], [%1], %2, %3;\n" ::"l"(__cvta_generic_to_shared(dst)), - "l"(src), "n"(grainBytes), "r"(size) - : "memory"); - } + Grain* const dst = &data[idxBuf][tid]; + uint32_t const idxGrain = idxGrainBeg + ctaSize * ldIter; + if (idxGrain < idxGrainEnd) + { + Grain const* const src = &reinterpret_cast<Grain const*>(task.src)[idxGrain]; + uint32_t const size = grainBytes; + asm volatile("cp.async.cg.shared.global [%0], [%1], %2, %3;\n" ::"l"(__cvta_generic_to_shared(dst)), + "l"(src), "n"(grainBytes), "r"(size) + : "memory"); + }cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cpp-1-9 (1)
1-9: Add the required NVIDIA copyright header.This new C++ source file is missing the mandatory NVIDIA copyright header with the year of latest modification (2026). As per coding guidelines, please add it at the top of the file.
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cpp-117-143 (1)
117-143: Use smart pointers to prevent memory leak whencuLaunchHostFuncfails.If
cuLaunchHostFuncreturns an error, the heap-allocatedUserDatais never freed because the callback is never executed. This violates the coding guideline to use smart pointers for heap allocations. Apply the suggested fix pattern (usingstd::unique_ptrwith conditionalrelease()on success) to all four copy functions, and add#include <memory>to the includes.Suggested fix (apply to all copy* functions)
+#include <memory>CUresult copyDiskToDisk( std::vector<Task<DiskAddress, DiskAddress>> const& tasks, ssize_t numBytes, CUstream stream) noexcept { - auto const data = new UserData<DiskAddress, DiskAddress>{std::move(tasks), numBytes}; - return cuLaunchHostFunc(stream, hostFnDiskToDiskCopy, data); + auto data = std::make_unique<UserData<DiskAddress, DiskAddress>>(UserData{tasks, numBytes}); + auto* raw = data.get(); + CUresult const res = cuLaunchHostFunc(stream, hostFnDiskToDiskCopy, raw); + if (res == CUDA_SUCCESS) + { + data.release(); + } + return res; }tensorrt_llm/runtime/kv_cache_manager_v2/_copy_engine.py-221-246 (1)
221-246: Exception during__enter__may leave grains locked without releasing them.If an exception occurs after
grain.mutex.acquire()but before__enter__returns, the acquired mutexes won't be released since__exit__won't be called. Consider using a try/except to release locks on failure.🔒 Proposed fix
def __enter__(self) -> "StagingBuffer": manager = self.manager if self.min_size > manager.size: raise ValueError(f"Requested min_size {self.min_size} is too large for the manager") with manager.mutex: self._size = min(self.max_size, manager._suggest_next_max_size_unsafe()) self.start_grain = manager.next manager.next += self.num_grains assert manager.next <= manager.num_grains if manager.next == manager.num_grains: manager.next = 0 - def lock_and_consume_events() -> Iterator[CachedCudaEvent]: - for grain in self.grains: - grain.mutex.acquire() - yield grain.ready_event - grain.ready_event = CachedCudaEvent.NULL - - stream_wait_events(self.stream, lock_and_consume_events()) - return self + acquired_grains: list[GrainMetadata] = [] + try: + events: list[CachedCudaEvent] = [] + for grain in self.grains: + grain.mutex.acquire() + acquired_grains.append(grain) + events.append(grain.ready_event) + grain.ready_event = CachedCudaEvent.NULL + stream_wait_events(self.stream, events) + return self + except Exception: + for grain in reversed(acquired_grains): + grain.mutex.release() + raisetensorrt_llm/runtime/kv_cache_manager_v2/_copy_engine.py-323-345 (1)
323-345: Potential infinite loop ifnum_bytesexceeds staging buffer capacity.If
num_bytes > manager.size,buf.size // num_byteswill be 0, causingn = 0andremaining = remaining[0:]which equalsremaining, creating an infinite loop. Themin_sizecheck in__enter__raises an exception, but the exception message context could be clearer.🐛 Proposed fix - add explicit check
def transfer( self, dst_cache_tier: CacheTier, src_cache_tier: CacheTier, num_bytes: int, tasks: Sequence[CopyTask], stream: CudaStream, ) -> None: copier = get_copier(dst_cache_tier, src_cache_tier) if not isinstance(copier, tuple): return copier(tasks, num_bytes, stream) assert len(copier) == 2, "for now, we only support 2 copiers via host memory" manager = self.staging_buffer_manager + if num_bytes > manager.size: + raise ValueError(f"Transfer size {num_bytes} exceeds staging buffer capacity {manager.size}") remaining = tasks while remaining: with manager.new(num_bytes, num_bytes * len(remaining), stream) as buf:tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py-185-201 (1)
185-201: Binary search may skip valid values due to off-by-one error.When
is_enough(mid)isFalse, settingub = mid - 1may skipmidas a potential answer. The standard binary search pattern for finding the largest value that satisfies a condition should useub = midwhen the condition fails, or adjust the midpoint calculation.Additionally, the
zip()on line 188 lacksstrict=True, which could silently mask length mismatches.🐛 Proposed fix
assert is_enough(1) lb = 1 ub = div_up(model_max_seq_len, tokens_per_block) if is_enough(ub): return model_max_seq_len - while lb < ub: - mid = (lb + ub) // 2 + while lb + 1 < ub: + mid = (lb + ub) // 2 if is_enough(mid): lb = mid else: - ub = mid - 1 + ub = mid return min(lb * tokens_per_block, model_max_seq_len)And for the
zip():- for cnt, rem in zip(get_num_slots(num_blocks * tokens_per_block), remaining_slots) + for cnt, rem in zip(get_num_slots(num_blocks * tokens_per_block), remaining_slots, strict=True)tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py-860-891 (1)
860-891: Late-binding closure bug: loop variablelc_idxnot captured correctly.The nested functions
check_no_page_lc,check_has_page_lc, andcheck_no_page_stalecapturelc_idxby reference, not by value. This is a classic Python closure pitfall. When these predicates are called (viafind_index),lc_idxmay have changed to its final loop value.However, examining the code more closely:
find_indexis called immediately after the function definition within the same loop iteration, so the bug may not manifest in practice. Still, this pattern is fragile and should be fixed for correctness and maintainability.Proposed fix using default argument binding
for lc_idx, lc in swa_life_cycles: - def check_no_page_lc(b: tuple[Block, int]): - return not has_page(b[0], lc_idx) + def check_no_page_lc(b: tuple[Block, int], _lc_idx=lc_idx): + return not has_page(b[0], _lc_idx) n = find_index(matched[: lc.num_sink_blocks], check_no_page_lc) if n < lc.num_sink_blocks: @@ -873,16 +873,16 @@ for lc_idx, lc in swa_life_cycles: if lc.window_size is None: continue - def check_has_page_lc(b: tuple[Block, int]): - return has_page(b[0], lc_idx) + def check_has_page_lc(b: tuple[Block, int], _lc_idx=lc_idx): + return has_page(b[0], _lc_idx) n = find_index(reversed(matched), check_has_page_lc) if n != 0: matched = matched[:-n] break _, stale_end = _KVCache._get_stale_range(tokens_per_block, num_tokens, lc) - def check_no_page_stale(b: tuple[Block, int]): - return not has_page(b[0], lc_idx) + def check_no_page_stale(b: tuple[Block, int], _lc_idx=lc_idx): + return not has_page(b[0], _lc_idx)
🟡 Minor comments (18)
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/__init__.py-1-4 (1)
1-4: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files (including
.pyfiles) should contain an NVIDIA copyright header with the year of latest meaningful modification.📝 Suggested fix
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + from ._core import CacheLevelStorage # These are re-exported for external use __all__ = ["CacheLevelStorage"]tensorrt_llm/runtime/kv_cache_manager_v2/rawref/test_rawref.py-219-225 (1)
219-225: Replaceassert Falsewithraise AssertionError().Same issue as above -
assert Falseis removed with-Oflag.Suggested fix
# Try to set is_valid - should raise AttributeError try: r.is_valid = False - assert False, "Should not be able to set is_valid" + raise AssertionError("Should not be able to set is_valid") except AttributeError: print("✓ is_valid is read-only (AttributeError raised)")tensorrt_llm/runtime/kv_cache_manager_v2/rawref/test_rawref.py-200-206 (1)
200-206: Replaceassert Falsewithraise AssertionError().
assert Falsestatements are removed when Python runs with optimizations (-Oflag), which would cause the test to silently pass even when it should fail.Suggested fix
# Try to access object_id - should raise AttributeError try: _ = r.object_id - assert False, "Should not be able to access object_id" + raise AssertionError("Should not be able to access object_id") except AttributeError: print("✓ object_id is hidden (AttributeError raised)")tests/unittest/kv_cache_manager_v2/test_kv_cache_manager_v2.py-245-247 (1)
245-247: Prefix unusedreq_idwith underscore.Ruff RUF059 flags this unused unpacked variable.
🧹 Proposed fix
- req_id, kv_cache, prompt, decode_len = req + _req_id, kv_cache, prompt, decode_len = reqtests/unittest/kv_cache_manager_v2/kernels.py-133-177 (1)
133-177: Addfrom __future__ import annotationsto match codebase patterns.The file uses PEP 585 syntax (
tuple[Kernel, int],type[ctypes.Structure]) without the postponed annotations import, which is inconsistent with the widespread codebase pattern. While TensorRT-LLM requires Python 3.10+ (where this syntax is natively supported), the project consistently usesfrom __future__ import annotationsfor consistency. Add this import at the top of the file.Suggested fix
+from __future__ import annotations + import contextlib import ctypestensorrt_llm/runtime/kv_cache_manager_v2/_life_cycle_registry.py-30-33 (1)
30-33: Sort__slots__to satisfy Ruff RUF023.Ruff flags the slot list as unsorted.
🔧 Suggested ordering
- __slots__ = ("_life_cycle_list", "_life_cycle_id_dict") + __slots__ = ("_life_cycle_id_dict", "_life_cycle_list")tensorrt_llm/runtime/kv_cache_manager_v2/__init__.py-27-52 (1)
27-52: Sort__all__to satisfy Ruff RUF022.The list is currently unsorted per RUF022 requirements (e.g.,
NDEBUGappears in the middle of the list rather than at the start with other SCREAMING_SNAKE_CASE items). Ruff will flag and can auto-fix this withruff check --select RUF022 --fix.🔧 Suggested ordering
__all__ = [ - "LifeCycleId", - "LayerGroupId", - "TokenId", - "TokenIdExt", - "KVCacheManager", - "_KVCache", - "BeamIndex", - "LayerId", - "Priority", - "CacheLevel", - "CacheTier", - "CudaStream", - "MemAddress", - "NDEBUG", - "KVCacheManagerConfig", "AttentionLayerConfig", "BufferConfig", + "BeamIndex", + "CacheLevel", + "CacheTier", "CacheTierConfig", + "CudaStream", "DataRole", "DiskCacheTierConfig", + "GpuCacheTierConfig", "HostCacheTierConfig", - "CacheTierConfig", + "KVCacheManager", + "KVCacheManagerConfig", + "LayerGroupId", + "LayerId", + "LifeCycleId", + "MemAddress", + "NDEBUG", + "Priority", + "TokenId", + "TokenIdExt", + "_KVCache", "gen_multi_modal_tokens", + "rawref", - "rawref", ]tensorrt_llm/runtime/kv_cache_manager_v2/_page.py-1-5 (1)
1-5: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py-1-2 (1)
1-2: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
tensorrt_llm/runtime/kv_cache_manager_v2/_eviction_controller/_eviction_controller.py-1-5 (1)
1-5: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_config.py-1-3 (1)
1-3: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py-1-7 (1)
1-7: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/runtime/kv_cache_manager_v2/_copy_engine.py-1-17 (1)
1-17: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py-1-6 (1)
1-6: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py-1-7 (1)
1-7: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import array import enum from collections.abc import Sequencetensorrt_llm/runtime/kv_cache_manager_v2/_utils.py-1-11 (1)
1-11: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py-1-9 (1)
1-9: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py-499-503 (1)
499-503: Potential division by zero ifstatsis empty or all pool group totals are zero.The denominator
sum(sum(s.slot_size) * s.total for s in stats)can be zero ifnum_pool_groupsis zero (resulting in an emptystatslist) or if all pool groups havetotal == 0. While unlikely in normal operation, a defensive check would improve robustness.
2efa101 to
2be8f4a
Compare
|
/bot run |
|
PR_Github #32700 [ run ] triggered by Bot. Commit: |
2be8f4a to
a4737c5
Compare
|
/bot run |
|
PR_Github #32703 [ run ] triggered by Bot. Commit: |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
PR_Github #32703 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #32732 Bot args parsing error: usage: /bot [-h] |
|
PR_Github #32733 [ run ] triggered by Bot. Commit: |
|
PR_Github #32733 [ run ] completed with state
|
a4737c5 to
6551208
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #32854 [ run ] triggered by Bot. Commit: |
6551208 to
9638ea8
Compare
|
/bot run --disable-fail-fast --stage-list "A10-CPP-1, A10-TensorRT-3" |
|
PR_Github #33096 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33116 [ run ] triggered by Bot. Commit: |
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManagerV2Utils.cpp
Outdated
Show resolved
Hide resolved
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManagerV2Utils.cpp
Outdated
Show resolved
Hide resolved
|
/bot run --disable-fail-fast |
|
PR_Github #33165 [ run ] triggered by Bot. Commit: |
|
PR_Github #33165 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33241 [ run ] triggered by Bot. Commit: |
|
PR_Github #33241 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #33297 [ run ] triggered by Bot. Commit: |
|
PR_Github #33297 [ run ] completed with state
|
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py
Outdated
Show resolved
Hide resolved
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py
Outdated
Show resolved
Hide resolved
3712c52 to
654978e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #33341 [ run ] triggered by Bot. Commit: |
KVCacheManagerV2 is a new python-based implementation of the KV cache manager, featuring cleaner API, better abstraction and better code quality without the accumulated legacy. Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>
654978e to
63cb381
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #33373 [ run ] triggered by Bot. Commit: |
|
PR_Github #33373 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33418 [ run ] triggered by Bot. Commit: |
|
PR_Github #33418 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #33440 [ run ] triggered by Bot. Commit: |
|
PR_Github #33440 [ run ] completed with state |
…NVIDIA#10736)" This reverts commit 6f07fa8.
KVCacheManagerV2 is a new python-based implementation of the KV cache manager, featuring cleaner API, better abstraction and better code quality without the accumulated legacy.
Summary by CodeRabbit
New Features
Build & Configuration
✏️ Tip: You can customize this high-level summary in your review settings.