[TRTLLM-7738][feat] Adding implementation of KVCacheManagerV2#10736

lowsfer · 2026-01-16T03:18:02Z

KVCacheManagerV2 is a new python-based implementation of the KV cache manager, featuring cleaner API, better abstraction and better code quality without the accumulated legacy.

Summary by CodeRabbit

New Features
- Added KV Cache Manager V2 with multi-tier memory support (GPU, host, and disk storage).
- Introduced batched copy operations for efficient cross-tier memory transfers.
- Added dynamic eviction and memory management with sliding window attention support.
Build & Configuration
- Added optional mypyc compilation support for performance-critical modules.
- Extended build system to support new cache manager utilities and bindings.
- Updated Python packaging to include new runtime components.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

yizhang-nv

Have the unit tests for kv cache manager v2 been added to the CI?

coderabbitai · 2026-01-19T09:43:22Z

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive KV Cache Manager V2 system for TensorRT LLM, featuring multi-tier memory management (GPU/Host/Disk), a custom reference C extension (rawref), batched copy operations, page-based eviction, and lifecycle-aware cache orchestration across multiple Python and C++ modules.

Changes

Cohort / File(s)	Summary
C++ Memory Utilities `cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.*`	New source files implementing looped I/O helpers, CUDA host-callback orchestration for disk/host/GPU copies (copyDiskToDisk, copyHostToDevice, etc.), and templated memory transfer tasks
C++ Build Configuration `cpp/tensorrt_llm/batch_manager/CMakeLists.txt`, `cpp/tensorrt_llm/nanobind/CMakeLists.txt`	Added kvCacheManagerV2Utils source files to build targets
Nanobind Bindings `cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManagerV2Utils.*`, `cpp/tensorrt_llm/nanobind/bindings.cpp`, `cpp/tensorrt_llm/nanobind/common/customCasters.h`	Python bindings exposing memory copy functions and DiskAddress/Task types; added CUstream type caster for nanobind
Build Wheel Script `scripts/build_wheel.py`	Added build_kv_cache_manager_v2() function and --mypyc flag to support optional mypyc compilation
Package Configuration `setup.py`, `requirements.txt`	Extended packaging to handle kv_cache_manager_v2 modules, mypyc build artifacts, and added cuda-core, llist, dynamic_path_manager dependencies
Python Runtime Integration `tensorrt_llm/runtime/__init__.py`	Added dynamic path import and export of kv_cache_manager_v2 submodule
KV Cache Manager V2 Type Stubs & Config `tensorrt_llm/runtime/kv_cache_manager_v2/__init__.pyi`, `tensorrt_llm/runtime/kv_cache_manager_v2/_common.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_config.py`	Comprehensive type definitions (LifeCycleId, CacheTier, TokenId, etc.), dataclasses for cache tier/buffer/attention layer config, and DiskAddress/Address type abstractions
KV Cache Manager V2 Core Logic `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_core/__init__.py`	Central KV cache state machine (SeqBlock, _KVCache) with page management, token commit/resume/suspend flows, and KVCacheManager lifecycle orchestrator
Memory & Page Management `tensorrt_llm/runtime/kv_cache_manager_v2/_page.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py`	Page hierarchy (UncommittedPage, CommittedPage) with locking/eviction coordination, CUDA virtual memory pools with dynamic extend/shrink, and batched GPU migration
Storage Abstraction `tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_config.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_storage/__init__.py`	Multi-tier slot pools (GPU/Host/Disk) with allocation/release, coalesced buffer modeling, and storage configuration derived from manager config
Storage Orchestration `tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py`	Central storage manager coordinating slot allocation, eviction, page migration across cache levels, and memory address computation
Copy Engine & Eviction `tensorrt_llm/runtime/kv_cache_manager_v2/_copy_engine.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_eviction_controller/_eviction_controller.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_eviction_controller/__init__.py`	Batched multi-tier copy orchestration with staging buffers, LRU/prioritized eviction policies, and per-level eviction controllers
Utilities & Lifecycle `tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_life_cycle_registry.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/_exceptions.py`	Comprehensive CUDA/memory helpers (pools, typed collections, virtual mem, dynamic bitsets), lifecycle registry mapping layers to deduped LifeCycles, and custom exceptions (OOM, CuError, OutOfPages)
Token Sequence Tracking `tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py`	Blockchain-key radix tree for token sequences with partial matching, lifecycle-aware block nodes, and multi-modal token generation (gen_multi_modal_tokens)
Rawref C Extension `tensorrt_llm/runtime/kv_cache_manager_v2/rawref/*`	Custom mutable reference type (ref[T]) with singleton pattern via rawref, C implementation (rawrefmodule.c), setuptools config, tests, and type stubs
Build & Config `tensorrt_llm/runtime/kv_cache_manager_v2/Makefile`, `tensorrt_llm/runtime/kv_cache_manager_v2/__init__.py`, `tensorrt_llm/runtime/kv_cache_manager_v2/mypy_mypyc.ini`, `tensorrt_llm/runtime/kv_cache_manager_v2/setup_mypyc.py`	Makefile for rawref/mypyc builds, package initializer re-exporting public types, mypy config, and mypyc compilation setup
Git Ignore `.gitignore`	Added patterns for kv_cache_manager_v2 shared libraries and mypyc artifacts
Test Infrastructure `tests/unittest/kv_cache_manager_v2/*`	Comprehensive test suite: fake engine simulator (fake_engine.py), CUDA kernel helpers (kernels.py), and full unit tests (test_kv_cache_manager_v2.py) covering single/batch scenarios, eviction, and disaggregated operations

Sequence Diagram(s)

sequenceDiagram
    participant User as User / Client
    participant KVMgr as KVCacheManager
    participant RadixTree as BlockRadixTree
    participant StorageMgr as StorageManager
    participant EvictionCtrl as EvictionController
    participant GPU as GPU Memory
    participant CopyEngine as CopyEngine

    User->>KVMgr: create_kv_cache()
    KVMgr->>KVMgr: init with config
    activate KVMgr
    KVMgr->>RadixTree: add_or_get_existing(lora_task_id)
    RadixTree-->>KVMgr: RootBlock
    deactivate KVMgr

    User->>KVMgr: (KV cache created)
    rect rgba(100, 150, 200, 0.5)
    Note over User,KVMgr: Prefill Phase
    User->>KVMgr: commit(tokens)
    activate KVMgr
    KVMgr->>StorageMgr: new_gpu_slots()
    StorageMgr->>GPU: allocate pages
    GPU-->>StorageMgr: slots
    StorageMgr-->>KVMgr: slots
    KVMgr->>RadixTree: match/add blocks
    KVMgr->>RadixTree: mark committed
    deactivate KVMgr
    end

    rect rgba(150, 200, 100, 0.5)
    Note over User,KVMgr: Decode Phase with Eviction
    User->>KVMgr: commit(new_tokens)
    activate KVMgr
    KVMgr->>StorageMgr: new_slots()
    StorageMgr->>StorageMgr: check free slots
    alt insufficient GPU memory
        StorageMgr->>EvictionCtrl: evict(min_pages)
        activate EvictionCtrl
        EvictionCtrl->>EvictionCtrl: pop from LRU
        EvictionCtrl-->>StorageMgr: evicted pages
        deactivate EvictionCtrl
        StorageMgr->>CopyEngine: migrate to host/disk
        CopyEngine->>GPU: batched copy (GPU→Host)
        GPU-->>CopyEngine: done
        StorageMgr->>GPU: release evicted GPU slots
    end
    StorageMgr->>GPU: allocate new slots
    StorageMgr-->>KVMgr: slots
    KVMgr->>RadixTree: match blocks for reuse
    KVMgr->>RadixTree: add new/committed blocks
    deactivate KVMgr
    end

    User->>KVMgr: suspend()
    activate KVMgr
    KVMgr->>StorageMgr: evict held blocks
    StorageMgr-->>KVMgr: freed slots
    deactivate KVMgr

    User->>KVMgr: resume()
    KVMgr-->>User: ready

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.35% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description is minimal and relies on CodeRabbit AI summary. While it mentions the new implementation's benefits (cleaner API, better abstraction, code quality), it lacks detailed explanation of what changes, why they matter, and specific test coverage information required by the template.	Expand the description to include specific details about the implementation, architectural changes, test coverage, and checklist verification items.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly describes the main addition: implementing KVCacheManagerV2, a new KV cache manager feature. It is specific, follows the required format [JIRA][type], and accurately reflects the primary change.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

cpp/tensorrt_llm/nanobind/common/customCasters.h (1)

1-3: Update the copyright year to include 2026.

The header still ends at 2025 even though this file was modified in 2026. As per coding guidelines, please update the NVIDIA copyright header to reflect the latest modification year.
tensorrt_llm/runtime/__init__.py (1)
1-1: Update copyright year to include 2025.

The copyright header shows 2022-2024, but this file is being modified in 2025/2026. Per coding guidelines, the year of latest meaningful modification should be reflected.
📝 Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
scripts/build_wheel.py (1)
1027-1039: Duplicate wheel build commands - the first build ignores mypyc setting.

Lines 1028-1030 build the wheel without setting the TRTLLM_ENABLE_MYPYC environment variable, then lines 1031-1039 set the environment variable and build again. This causes two wheel builds, where only the second respects the --mypyc flag. The first build_run call should be removed.
Suggested fix
         extra_wheel_build_args = os.getenv("EXTRA_WHEEL_BUILD_ARGS", "")
-        build_run(
-            f'\"{venv_python}\" -m build {project_dir} --skip-dependency-check {extra_wheel_build_args} --no-isolation --wheel --outdir "{dist_dir}"'
-        )
         env = os.environ.copy()
         if mypyc:
             env["TRTLLM_ENABLE_MYPYC"] = "1"
         else:
             env["TRTLLM_ENABLE_MYPYC"] = "0"

         build_run(
-            f'\"{venv_python}\" -m build {project_dir} --skip-dependency-check --no-isolation --wheel --outdir "{dist_dir}"',
+            f'\"{venv_python}\" -m build {project_dir} --skip-dependency-check {extra_wheel_build_args} --no-isolation --wheel --outdir "{dist_dir}"',
             env=env)

🤖 Fix all issues with AI agents

In `@cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManagerV2Utils.h`:
- Around line 18-30: Replace the top-level `#pragma once` with a traditional
include guard named TRTLLM_KV_CACHE_MANAGER_V2_UTILS_H (add `#ifndef`
TRTLLM_KV_CACHE_MANAGER_V2_UTILS_H / `#define` ... / `#endif`) and add Doxygen
comments for the public class and its method: document the
KVCacheManagerV2UtilsBindings class and the static initBindings(nb::module_&
module) method with brief `@brief` and `@param` tags so the public interface is
documented; reference the class name KVCacheManagerV2UtilsBindings and the
method initBindings when adding the comments.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py`:
- Around line 593-597: In the put method of the pool class (method put), avoid
appending an item after calling _destroy_func when the pool is at _max_size:
check the size and if the item must be destroyed, call self._destroy_func(item)
and return early instead of executing self.items.appendleft(item); alternatively
move the size-check before decrementing/adjusting and only append when not
destroyed so that _destroy_func is never followed by items.appendleft.
- Around line 395-401: This file is missing the required NVIDIA copyright
header; add the standard NVIDIA copyright header at the top of
tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py with the year of latest
meaningful modification, then run formatting/linting; ensure the header sits
above the module-level code that defines symbols like _CHUNKED_REGISTRATION,
_CHUNK_SIZE, and _num_registered_chunks so it is the first thing in the file,
and optionally add a brief inline comment near the existing
_CHUNKED_REGISTRATION declaration referencing the Linux bug (e.g., Ubuntu bug
`#2068818` or the LKML patch "mm/gup: stop leaking pinned pages in low memory
conditions") for future maintainers.

In `@tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c`:
- Around line 6-10: ReferenceTypeObject currently stores a raw integer object_id
and dereferences it in ReferenceType_call, causing use-after-free; change the
struct to hold a proper PyObject* weakref (created with PyWeakref_NewRef)
instead of Py_ssize_t object_id, update ReferenceType_call to retrieve the
target via PyWeakref_GetObject and check for Py_None, ensure you create the
weakref when initializing the ReferenceType (where object_id is currently
assigned) and add a tp_dealloc (and tp_traverse/tp_clear if necessary) to DECREF
the weakref to avoid leaks and keep correct lifetime semantics.

🟠 Major comments (28)

tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py-1-1 (1)

1-1: Add the NVIDIA copyright header for this new source file.

Line 1 starts with imports but the required NVIDIA header (latest year 2026) is missing. As per coding guidelines, please add the standard header before any code.

tensorrt_llm/runtime/kv_cache_manager_v2/setup_mypyc.py-1-8 (1)

1-8: Add the NVIDIA copyright header for this new source file.

This file begins with a docstring (Line 1) but is missing the required NVIDIA header with the latest modification year (2026). As per coding guidelines, please add the standard header before the docstring.

cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.h-26-54 (1)

26-54: Make the header self‑contained for signed-size types.

ssize_t isn’t provided by <cstdint>, so this header may not compile on some toolchains. Consider switching to std::ptrdiff_t (with <cstddef>) or another project‑standard signed size type and update the .cpp accordingly.
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.h-18-18 (1)
18-18: Add the required preprocessor guard.

The project requires a TRTLLM_<FILENAME>_H include guard for headers; #pragma once alone doesn’t satisfy the guideline.
♻️ Proposed refactor
-#pragma once
+#pragma once
+#ifndef TRTLLM_KVCACHEMANAGERV2UTILS_H
+#define TRTLLM_KVCACHEMANAGERV2UTILS_H
+
+// ... existing contents ...
+
+#endif // TRTLLM_KVCACHEMANAGERV2UTILS_H
As per coding guidelines, please add the mandated header guard.
tensorrt_llm/runtime/kv_cache_manager_v2/rawref/setup.py-1-1 (1)

1-1: Add the NVIDIA copyright header for this new source file.

Line 1 starts with imports, but TensorRT‑LLM source files must include the NVIDIA header with the latest modification year (2026) before any code. As per coding guidelines, please add the standard header.

cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.h-1-3 (1)

1-3: Update the copyright year to include 2026.

The header currently ends at 2025 despite this 2026 change. As per coding guidelines, please update the NVIDIA copyright header to reflect the latest modification year.
tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py-37-48 (1)
37-48: Don't drop handles from tracking before a successful release.

Line 40 removes the handle from _outstanding_handles before cuMemRelease succeeds; on failure, tracking is lost and leaks become harder to diagnose. Also, the bare except: on line 43 violates coding guidelines—narrow the exception type to except Exception:. Move the handle removal to after the successful release attempt.
🐛 Proposed fix
        assert int(handle) in self._outstanding_handles
-       self._outstanding_handles.remove(int(handle))
        try:
            _unwrap(drv.cuMemRelease(handle))
-       except:
+       except Exception:
            print(
                f"Failed to release handle {handle}. num_oustanding = {len(self._outstanding_handles)}"
            )
            raise
+       self._outstanding_handles.remove(int(handle))
tensorrt_llm/runtime/kv_cache_manager_v2/setup_mypyc.py-53-65 (1)

53-65: Address the code inconsistency: _exceptions.py is marked as excluded but remains in the modules list.

Lines 53‑54 state _exceptions.py should be excluded due to a mypyc limitation, yet it's listed in modules at line 64, and line 83 prints "Excluded: None" instead of listing the actual exclusions. Resolve this one of two ways:

Remove from modules if exclusion is intentional:

Remove "kv_cache_manager_v2/_exceptions.py" from the modules list (line 64)

Update line 83 to print "Excluded: _exceptions.py"

Keep in modules if inclusion is correct:

Remove the exclusion comment (lines 53‑54) since mypyc supports classes inheriting from built-in Exception types

Note: The stated mypyc limitation with Exception classes is not accurate; mypyc documents Exception and BaseException as supported non-native base classes.
cpp/tensorrt_llm/nanobind/common/customCasters.h-319-335 (1)
319-335: Handle PyLong_AsVoidPtr errors and remove the duplicate return.

Line 327 calls PyLong_AsVoidPtr without checking for errors. According to the Python C API, this function returns NULL on error (e.g., OverflowError) and sets a Python exception. The current code ignores this, returning true even if the conversion fails. Additionally, line 329 contains an unreachable duplicate return statement.
Proposed fix
     bool from_python([[maybe_unused]] handle src, uint8_t flags, cleanup_list* cleanup)
     {
-        value = reinterpret_cast<CUstream>(PyLong_AsVoidPtr(src.ptr()));
-        return true;
-        return true;
+        void* ptr = PyLong_AsVoidPtr(src.ptr());
+        if (ptr == nullptr && PyErr_Occurred())
+        {
+            PyErr_Clear();
+            return false;
+        }
+        value = reinterpret_cast<CUstream>(ptr);
+        return true;
     }
tensorrt_llm/runtime/kv_cache_manager_v2/_cuda_virt_mem.py-140-146 (1)
140-146: Add error handling in _push() with specific exception types to prevent resource leaks.

If cuMemSetAccess fails after cuMemMap succeeds, mapped virtual memory remains unmapped and the physical memory is not returned to the pool. The extend() rollback cannot clean this up since _pm_stack is never updated.

Add a try-except block in _push to unmap and close the physical memory on failure. Use specific exception types (CuError, CuOOMError) from _exceptions instead of bare Exception per coding guidelines:
Proposed fix
     def _push(self, phy_mem: PhysMem) -> None:
         phys_mem_size = self.phys_mem_size
         assert phys_mem_size * (len(self._pm_stack) + 1) <= self._vm_size
         vm_ptr = drv.CUdeviceptr(self.address + phys_mem_size * len(self._pm_stack))
-        _unwrap(drv.cuMemMap(vm_ptr, phys_mem_size, 0, phy_mem.handle, 0))
-        _unwrap(drv.cuMemSetAccess(vm_ptr, phys_mem_size, (self._access_desc,), 1))
+        try:
+            _unwrap(drv.cuMemMap(vm_ptr, phys_mem_size, 0, phy_mem.handle, 0))
+            _unwrap(drv.cuMemSetAccess(vm_ptr, phys_mem_size, (self._access_desc,), 1))
+        except (CuError, CuOOMError):
+            try:
+                _unwrap(drv.cuMemUnmap(vm_ptr, phys_mem_size))
+            except (CuError, CuOOMError):
+                pass
+            phy_mem.close()
+            raise
         self._pm_stack.append(phy_mem)
Also add the required NVIDIA copyright header at the top of the file.
tensorrt_llm/runtime/kv_cache_manager_v2/_core/__init__.py-1-5 (1)
1-5: Add NVIDIA SPDX header to this new module.

All TensorRT-LLM .py files must carry the NVIDIA SPDX header with the latest modification year. As per coding guidelines, please add it.
♻️ Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 from .._common import BeamIndex
 from ._kv_cache import _KVCache
 from ._kv_cache_manager import KVCacheManager
tensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.py-1-20 (1)
1-20: Add NVIDIA SPDX header to rawref package initializer.

All TensorRT-LLM .py files must carry the NVIDIA SPDX header with the latest modification year. As per coding guidelines, please add it.
♻️ Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
 """rawref - Mutable reference with singleton pattern.
@@
 """
setup.py-277-287 (1)
277-287: Local-directory extraction whitelist is missing rawref/__init__.py while wheel extraction includes it.

Wheel extraction whitelists tensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.py (lines 278-282), but the local-directory branch (lines 216-220) does not. This causes rawref/__init__.py to be filtered out when using TRTLLM_USE_PRECOMPILED with a local directory, creating an inconsistency. Update the local-directory allowed_dirs to match the wheel extraction logic.
♻️ Proposed fix
                 if dst_file.endswith(".py"):
-                    allowed_dirs = ("tensorrt_llm/deep_gemm/",
-                                    "tensorrt_llm/deep_ep/",
-                                    "tensorrt_llm/flash_mla/")
+                    allowed_dirs = (
+                        "tensorrt_llm/deep_gemm/",
+                        "tensorrt_llm/deep_ep/",
+                        "tensorrt_llm/flash_mla/",
+                        "tensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.py",
+                    )
tensorrt_llm/runtime/kv_cache_manager_v2/_common.py-1-5 (1)
1-5: Add NVIDIA SPDX header for the new runtime module.

This file is missing the required NVIDIA copyright header.
✅ Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
As per coding guidelines.
tests/unittest/kv_cache_manager_v2/test_kv_cache_manager_v2.py-1-14 (1)
1-14: Add NVIDIA SPDX header for this new test module.

This file is missing the required NVIDIA copyright header.
✅ Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
As per coding guidelines.
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManagerV2Utils.cpp-1-4 (1)
1-4: Update the SPDX header year to 2026.

This is a 2026 change; the header should reflect the latest modification year.
🛠️ Proposed fix
- * SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines.
tests/unittest/kv_cache_manager_v2/test_kv_cache_manager_v2.py-103-113 (1)
103-113: Remove pdb.set_trace() from GC callback to avoid hanging tests.

A debugger breakpoint can stall CI and violates linting (Ruff T100). Prefer logging and assertions.
🛠️ Proposed fix
-            if collected != 0 or uncollectable != 0:
-                import pdb
-
-                pdb.set_trace()
             assert collected == 0 and uncollectable == 0
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cu-1-8 (1)
1-8: Add NVIDIA SPDX header for this new CUDA source.

The required NVIDIA copyright header is missing. Please add the standard SPDX/Apache-2.0 header with the latest year.
✅ Proposed header
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: Apache-2.0
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
As per coding guidelines, please add the standard header.
tensorrt_llm/runtime/kv_cache_manager_v2/rawref/__init__.pyi-1-3 (1)
1-3: Add NVIDIA SPDX header to the stub file.

The required NVIDIA copyright header is missing.
✅ Proposed header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
As per coding guidelines, please add the standard header.
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cu-115-162 (1)
115-162: Validate numBytes before narrowing to uint32_t.

Public APIs (copyHostToDevice, copyDeviceToHost, copyDeviceToDevice) accept ssize_t, but launchBatchedCopy takes uint32_t. Negative values will wrap to large positive integers, and sizes >4 GiB will silently truncate. Add a range check at the wrapper layer.
🛠️ Proposed fix
 CUresult launchBatchedCopy(bool lowBandwidth, std::vector<MMTask> const& tasks, uint32_t nbBytes, cudaStream_t stream)
 {
+    TLLM_CHECK_WITH_INFO(numBytes >= 0 && numBytes <= std::numeric_limits<uint32_t>::max(),
+        "numBytes must be in [0, %u], got %zd.", std::numeric_limits<uint32_t>::max(), numBytes);
+    uint32_t const nbBytes = static_cast<uint32_t>(numBytes);
     constexpr uint32_t maxN = 256;
tests/unittest/kv_cache_manager_v2/test_kv_cache_manager_v2.py-11-14 (1)
11-14: Fix Python 3.8 incompatibilities: randbytes and PEP 585/604 type hints.

The file violates the 3.8+ requirement specified in tensorrt_llm/runtime/kv_cache_manager_v2/setup_mypyc.py:

random.randbytes (line 11) is only available in Python 3.9+

Direct PEP 585/604 syntax is used:

int | None type hints (PEP 604, Python 3.10+)

list[...], dict[...] instantiations (PEP 585, Python 3.9+)

Replace randbytes with os.urandom, add from __future__ import annotations, and convert generic type instantiations to non-parameterized forms ([] or list(), {} or dict()).
🛠️ Proposed fix
+from __future__ import annotations
+
 import array
 import functools
 import gc
 import itertools
 import os
 import random
 import time
 import unittest
 from contextlib import contextmanager
 from importlib.util import find_spec
-from random import randbytes
 from statistics import median
 from typing import TYPE_CHECKING, Iterator, NamedTuple, cast
Then replace direct generic instantiations:

self.past_sequences = list[list[TokenIdExt]]() → self.past_sequences = []

self.seq_len_dict = dict[_KVCache, int]() → self.seq_len_dict = {}

self.batch = list[Step]() → self.batch = []

self.suspended = list[Step]() → self.suspended = []

And replace usage:
-            return randbytes(32)
+            return os.urandom(32)
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cu-54-75 (1)
54-75: Move pointer arithmetic inside the bounds check to avoid undefined behavior.

The code creates pointers via reinterpret_cast<Grain*>(task.dst)[idxGrain] and &reinterpret_cast<Grain const*>(task.src)[idxGrain] before verifying idxGrain < idxGrainEnd. When idxGrain >= idxGrainEnd, this performs out-of-bounds pointer arithmetic, which is undefined behavior—even though the actual access is guarded. The loop intentionally runs extra iterations (the + nbBufs component) for async group ordering, guaranteeing that some iterations exceed the bounds. These pointers are then passed to cp.async instructions that will attempt to dereference them.

Move the pointer creation inside the guard:
🛠️ Proposed fix
-            Grain const& src = data[idxBuf][tid];
-            uint32_t const idxGrain = idxGrainBeg + ctaSize * stIter;
-            Grain& dst = reinterpret_cast<Grain*>(task.dst)[idxGrain];
+            Grain const& src = data[idxBuf][tid];
+            uint32_t const idxGrain = idxGrainBeg + ctaSize * stIter;
             asm volatile("cp.async.wait_group %0;\n" ::"n"(nbBufs - 1) : "memory");
             if (idxGrain < idxGrainEnd)
             {
-                dst = src;
+                reinterpret_cast<Grain*>(task.dst)[idxGrain] = src;
             }
 ...
-        Grain* const dst = &data[idxBuf][tid];
-        uint32_t const idxGrain = idxGrainBeg + ctaSize * ldIter;
-        Grain const* const src = &reinterpret_cast<Grain const*>(task.src)[idxGrain];
-        if (idxGrain < idxGrainEnd)
-        {
-            uint32_t const size = grainBytes;
-            asm volatile("cp.async.cg.shared.global [%0], [%1], %2, %3;\n" ::"l"(__cvta_generic_to_shared(dst)),
-                         "l"(src), "n"(grainBytes), "r"(size)
-                         : "memory");
-        }
+        Grain* const dst = &data[idxBuf][tid];
+        uint32_t const idxGrain = idxGrainBeg + ctaSize * ldIter;
+        if (idxGrain < idxGrainEnd)
+        {
+            Grain const* const src = &reinterpret_cast<Grain const*>(task.src)[idxGrain];
+            uint32_t const size = grainBytes;
+            asm volatile("cp.async.cg.shared.global [%0], [%1], %2, %3;\n" ::"l"(__cvta_generic_to_shared(dst)),
+                         "l"(src), "n"(grainBytes), "r"(size)
+                         : "memory");
+        }
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cpp-1-9 (1)

1-9: Add the required NVIDIA copyright header.

This new C++ source file is missing the mandatory NVIDIA copyright header with the year of latest modification (2026). As per coding guidelines, please add it at the top of the file.
cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cpp-117-143 (1)
117-143: Use smart pointers to prevent memory leak when cuLaunchHostFunc fails.

If cuLaunchHostFunc returns an error, the heap-allocated UserData is never freed because the callback is never executed. This violates the coding guideline to use smart pointers for heap allocations. Apply the suggested fix pattern (using std::unique_ptr with conditional release() on success) to all four copy functions, and add #include <memory> to the includes.
Suggested fix (apply to all copy* functions)
+#include <memory>
 CUresult copyDiskToDisk(
     std::vector<Task<DiskAddress, DiskAddress>> const& tasks, ssize_t numBytes, CUstream stream) noexcept
 {
-    auto const data = new UserData<DiskAddress, DiskAddress>{std::move(tasks), numBytes};
-    return cuLaunchHostFunc(stream, hostFnDiskToDiskCopy, data);
+    auto data = std::make_unique<UserData<DiskAddress, DiskAddress>>(UserData{tasks, numBytes});
+    auto* raw = data.get();
+    CUresult const res = cuLaunchHostFunc(stream, hostFnDiskToDiskCopy, raw);
+    if (res == CUDA_SUCCESS)
+    {
+        data.release();
+    }
+    return res;
 }
tensorrt_llm/runtime/kv_cache_manager_v2/_copy_engine.py-221-246 (1)
221-246: Exception during __enter__ may leave grains locked without releasing them.

If an exception occurs after grain.mutex.acquire() but before __enter__ returns, the acquired mutexes won't be released since __exit__ won't be called. Consider using a try/except to release locks on failure.
🔒 Proposed fix
     def __enter__(self) -> "StagingBuffer":
         manager = self.manager
         if self.min_size > manager.size:
             raise ValueError(f"Requested min_size {self.min_size} is too large for the manager")
         with manager.mutex:
             self._size = min(self.max_size, manager._suggest_next_max_size_unsafe())
             self.start_grain = manager.next
             manager.next += self.num_grains
             assert manager.next <= manager.num_grains
             if manager.next == manager.num_grains:
                 manager.next = 0

-            def lock_and_consume_events() -> Iterator[CachedCudaEvent]:
-                for grain in self.grains:
-                    grain.mutex.acquire()
-                    yield grain.ready_event
-                    grain.ready_event = CachedCudaEvent.NULL
-
-            stream_wait_events(self.stream, lock_and_consume_events())
-            return self
+        acquired_grains: list[GrainMetadata] = []
+        try:
+            events: list[CachedCudaEvent] = []
+            for grain in self.grains:
+                grain.mutex.acquire()
+                acquired_grains.append(grain)
+                events.append(grain.ready_event)
+                grain.ready_event = CachedCudaEvent.NULL
+            stream_wait_events(self.stream, events)
+            return self
+        except Exception:
+            for grain in reversed(acquired_grains):
+                grain.mutex.release()
+            raise
tensorrt_llm/runtime/kv_cache_manager_v2/_copy_engine.py-323-345 (1)
323-345: Potential infinite loop if num_bytes exceeds staging buffer capacity.

If num_bytes > manager.size, buf.size // num_bytes will be 0, causing n = 0 and remaining = remaining[0:] which equals remaining, creating an infinite loop. The min_size check in __enter__ raises an exception, but the exception message context could be clearer.
🐛 Proposed fix - add explicit check
     def transfer(
         self,
         dst_cache_tier: CacheTier,
         src_cache_tier: CacheTier,
         num_bytes: int,
         tasks: Sequence[CopyTask],
         stream: CudaStream,
     ) -> None:
         copier = get_copier(dst_cache_tier, src_cache_tier)
         if not isinstance(copier, tuple):
             return copier(tasks, num_bytes, stream)
         assert len(copier) == 2, "for now, we only support 2 copiers via host memory"
         manager = self.staging_buffer_manager
+        if num_bytes > manager.size:
+            raise ValueError(f"Transfer size {num_bytes} exceeds staging buffer capacity {manager.size}")
         remaining = tasks
         while remaining:
             with manager.new(num_bytes, num_bytes * len(remaining), stream) as buf:
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py-185-201 (1)
185-201: Binary search may skip valid values due to off-by-one error.

When is_enough(mid) is False, setting ub = mid - 1 may skip mid as a potential answer. The standard binary search pattern for finding the largest value that satisfies a condition should use ub = mid when the condition fails, or adjust the midpoint calculation.

Additionally, the zip() on line 188 lacks strict=True, which could silently mask length mismatches.
🐛 Proposed fix
         assert is_enough(1)
         lb = 1
         ub = div_up(model_max_seq_len, tokens_per_block)
         if is_enough(ub):
             return model_max_seq_len
-        while lb < ub:
-            mid = (lb + ub) // 2
+        while lb + 1 < ub:
+            mid = (lb + ub) // 2
             if is_enough(mid):
                 lb = mid
             else:
-                ub = mid - 1
+                ub = mid
         return min(lb * tokens_per_block, model_max_seq_len)
And for the zip():
-                for cnt, rem in zip(get_num_slots(num_blocks * tokens_per_block), remaining_slots)
+                for cnt, rem in zip(get_num_slots(num_blocks * tokens_per_block), remaining_slots, strict=True)
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py-860-891 (1)
860-891: Late-binding closure bug: loop variable lc_idx not captured correctly.

The nested functions check_no_page_lc, check_has_page_lc, and check_no_page_stale capture lc_idx by reference, not by value. This is a classic Python closure pitfall. When these predicates are called (via find_index), lc_idx may have changed to its final loop value.

However, examining the code more closely: find_index is called immediately after the function definition within the same loop iteration, so the bug may not manifest in practice. Still, this pattern is fragile and should be fixed for correctness and maintainability.
Proposed fix using default argument binding
         for lc_idx, lc in swa_life_cycles:
 
-            def check_no_page_lc(b: tuple[Block, int]):
-                return not has_page(b[0], lc_idx)
+            def check_no_page_lc(b: tuple[Block, int], _lc_idx=lc_idx):
+                return not has_page(b[0], _lc_idx)
 
             n = find_index(matched[: lc.num_sink_blocks], check_no_page_lc)
             if n < lc.num_sink_blocks:
@@ -873,16 +873,16 @@
             for lc_idx, lc in swa_life_cycles:
                 if lc.window_size is None:
                     continue
 
-                def check_has_page_lc(b: tuple[Block, int]):
-                    return has_page(b[0], lc_idx)
+                def check_has_page_lc(b: tuple[Block, int], _lc_idx=lc_idx):
+                    return has_page(b[0], _lc_idx)
 
                 n = find_index(reversed(matched), check_has_page_lc)
                 if n != 0:
                     matched = matched[:-n]
                     break
                 _, stale_end = _KVCache._get_stale_range(tokens_per_block, num_tokens, lc)
 
-                def check_no_page_stale(b: tuple[Block, int]):
-                    return not has_page(b[0], lc_idx)
+                def check_no_page_stale(b: tuple[Block, int], _lc_idx=lc_idx):
+                    return not has_page(b[0], _lc_idx)

🟡 Minor comments (18)

tensorrt_llm/runtime/kv_cache_manager_v2/_storage/__init__.py-1-4 (1)
1-4: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files (including .py files) should contain an NVIDIA copyright header with the year of latest meaningful modification.
📝 Suggested fix
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 from ._core import CacheLevelStorage

 # These are re-exported for external use
 __all__ = ["CacheLevelStorage"]
tensorrt_llm/runtime/kv_cache_manager_v2/rawref/test_rawref.py-219-225 (1)
219-225: Replace assert False with raise AssertionError().

Same issue as above - assert False is removed with -O flag.
Suggested fix
     # Try to set is_valid - should raise AttributeError
     try:
         r.is_valid = False
-        assert False, "Should not be able to set is_valid"
+        raise AssertionError("Should not be able to set is_valid")
     except AttributeError:
         print("✓ is_valid is read-only (AttributeError raised)")
tensorrt_llm/runtime/kv_cache_manager_v2/rawref/test_rawref.py-200-206 (1)
200-206: Replace assert False with raise AssertionError().

assert False statements are removed when Python runs with optimizations (-O flag), which would cause the test to silently pass even when it should fail.
Suggested fix
     # Try to access object_id - should raise AttributeError
     try:
         _ = r.object_id
-        assert False, "Should not be able to access object_id"
+        raise AssertionError("Should not be able to access object_id")
     except AttributeError:
         print("✓ object_id is hidden (AttributeError raised)")
tests/unittest/kv_cache_manager_v2/test_kv_cache_manager_v2.py-245-247 (1)
245-247: Prefix unused req_id with underscore.

Ruff RUF059 flags this unused unpacked variable.
🧹 Proposed fix
-        req_id, kv_cache, prompt, decode_len = req
+        _req_id, kv_cache, prompt, decode_len = req
tests/unittest/kv_cache_manager_v2/kernels.py-133-177 (1)
133-177: Add from __future__ import annotations to match codebase patterns.

The file uses PEP 585 syntax (tuple[Kernel, int], type[ctypes.Structure]) without the postponed annotations import, which is inconsistent with the widespread codebase pattern. While TensorRT-LLM requires Python 3.10+ (where this syntax is natively supported), the project consistently uses from __future__ import annotations for consistency. Add this import at the top of the file.
Suggested fix
+from __future__ import annotations
+
 import contextlib
 import ctypes
tensorrt_llm/runtime/kv_cache_manager_v2/_life_cycle_registry.py-30-33 (1)
30-33: Sort __slots__ to satisfy Ruff RUF023.

Ruff flags the slot list as unsorted.
🔧 Suggested ordering
-    __slots__ = ("_life_cycle_list", "_life_cycle_id_dict")
+    __slots__ = ("_life_cycle_id_dict", "_life_cycle_list")
tensorrt_llm/runtime/kv_cache_manager_v2/__init__.py-27-52 (1)
27-52: Sort __all__ to satisfy Ruff RUF022.

The list is currently unsorted per RUF022 requirements (e.g., NDEBUG appears in the middle of the list rather than at the start with other SCREAMING_SNAKE_CASE items). Ruff will flag and can auto-fix this with ruff check --select RUF022 --fix.
🔧 Suggested ordering
 __all__ = [
-    "LifeCycleId",
-    "LayerGroupId",
-    "TokenId",
-    "TokenIdExt",
-    "KVCacheManager",
-    "_KVCache",
-    "BeamIndex",
-    "LayerId",
-    "Priority",
-    "CacheLevel",
-    "CacheTier",
-    "CudaStream",
-    "MemAddress",
-    "NDEBUG",
-    "KVCacheManagerConfig",
     "AttentionLayerConfig",
     "BufferConfig",
+    "BeamIndex",
+    "CacheLevel",
+    "CacheTier",
     "CacheTierConfig",
+    "CudaStream",
     "DataRole",
     "DiskCacheTierConfig",
+    "GpuCacheTierConfig",
     "HostCacheTierConfig",
-    "CacheTierConfig",
+    "KVCacheManager",
+    "KVCacheManagerConfig",
+    "LayerGroupId",
+    "LayerId",
+    "LifeCycleId",
+    "MemAddress",
+    "NDEBUG",
+    "Priority",
+    "TokenId",
+    "TokenIdExt",
+    "_KVCache",
     "gen_multi_modal_tokens",
+    "rawref",
-    "rawref",
 ]
tensorrt_llm/runtime/kv_cache_manager_v2/_page.py-1-5 (1)

1-5: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.

tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py-1-2 (1)

1-2: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.

tensorrt_llm/runtime/kv_cache_manager_v2/_eviction_controller/_eviction_controller.py-1-5 (1)

1-5: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.

tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_config.py-1-3 (1)

1-3: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.

tensorrt_llm/runtime/kv_cache_manager_v2/_block_radix_tree.py-1-7 (1)

1-7: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.

tensorrt_llm/runtime/kv_cache_manager_v2/_copy_engine.py-1-17 (1)

1-17: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.

tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py-1-6 (1)

1-6: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache.py-1-7 (1)
1-7: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest meaningful modification.
Proposed fix
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 import array
 import enum
 from collections.abc import Sequence
tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py-1-11 (1)

1-11: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.

tensorrt_llm/runtime/kv_cache_manager_v2/_storage/_core.py-1-9 (1)

1-9: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.

tensorrt_llm/runtime/kv_cache_manager_v2/_storage_manager.py-499-503 (1)

499-503: Potential division by zero if stats is empty or all pool group totals are zero.

The denominator sum(sum(s.slot_size) * s.total for s in stats) can be zero if num_pool_groups is zero (resulting in an empty stats list) or if all pool groups have total == 0. While unlikely in normal operation, a defensive check would improve robustness.

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManagerV2Utils.h

tensorrt_llm/runtime/kv_cache_manager_v2/_utils.py

tensorrt_llm/runtime/kv_cache_manager_v2/rawref/rawrefmodule.c

lowsfer · 2026-01-20T08:06:47Z

/bot run

tensorrt-cicd · 2026-01-20T08:12:43Z

PR_Github #32700 [ run ] triggered by Bot. Commit: 2be8f4a

lowsfer · 2026-01-20T08:23:42Z

/bot run

tensorrt-cicd · 2026-01-20T08:29:58Z

PR_Github #32703 [ run ] triggered by Bot. Commit: a4737c5

lowsfer · 2026-01-20T10:28:58Z

/bot help

github-actions · 2026-01-20T10:29:08Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt-cicd · 2026-01-20T10:31:02Z

PR_Github #32703 [ run ] completed with state SUCCESS. Commit: a4737c5
/LLM/main/L0_MergeRequest_PR pipeline #25310 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

lowsfer · 2026-01-20T10:34:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-20T10:34:17Z

PR_Github #32732 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --help

tensorrt-cicd · 2026-01-20T10:48:29Z

PR_Github #32733 [ run ] triggered by Bot. Commit: a4737c5

tensorrt-cicd · 2026-01-20T17:15:42Z

PR_Github #32733 [ run ] completed with state SUCCESS. Commit: a4737c5
/LLM/main/L0_MergeRequest_PR pipeline #25334 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

lowsfer · 2026-01-21T03:33:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-21T03:40:10Z

PR_Github #32854 [ run ] triggered by Bot. Commit: 6551208

lowsfer · 2026-01-21T07:35:15Z

/bot run --disable-fail-fast --stage-list "A10-CPP-1, A10-TensorRT-3"

tensorrt-cicd · 2026-01-22T07:26:47Z

PR_Github #33096 [ run ] completed with state SUCCESS. Commit: 3712c52
/LLM/main/L0_MergeRequest_PR pipeline #25582 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

lowsfer · 2026-01-22T07:30:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-22T07:36:34Z

PR_Github #33116 [ run ] triggered by Bot. Commit: 3712c52

cpp/tensorrt_llm/batch_manager/kvCacheManagerV2Utils.cpp

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManagerV2Utils.cpp

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManagerV2Utils.cpp

lowsfer · 2026-01-22T12:08:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-22T12:14:45Z

PR_Github #33165 [ run ] triggered by Bot. Commit: 3712c52

tensorrt-cicd · 2026-01-22T17:56:35Z

PR_Github #33165 [ run ] completed with state SUCCESS. Commit: 3712c52
/LLM/main/L0_MergeRequest_PR pipeline #25629 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

lowsfer · 2026-01-23T01:38:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-23T01:44:15Z

PR_Github #33241 [ run ] triggered by Bot. Commit: 3712c52

tensorrt-cicd · 2026-01-23T02:49:09Z

PR_Github #33241 [ run ] completed with state ABORTED. Commit: 3712c52
LLM/main/L0_MergeRequest_PR #25681 (Blue Ocean) completed with status: ABORTED

ZhanruiSunCh · 2026-01-23T03:45:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-23T03:52:25Z

PR_Github #33297 [ run ] triggered by Bot. Commit: 3712c52

tensorrt-cicd · 2026-01-23T05:54:24Z

PR_Github #33297 [ run ] completed with state SUCCESS. Commit: 3712c52
/LLM/main/L0_MergeRequest_PR pipeline #25704 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

tensorrt_llm/runtime/kv_cache_manager_v2/_core/_kv_cache_manager.py

lowsfer · 2026-01-23T08:16:48Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-23T08:22:58Z

PR_Github #33341 [ run ] triggered by Bot. Commit: 654978e

KVCacheManagerV2 is a new python-based implementation of the KV cache manager, featuring cleaner API, better abstraction and better code quality without the accumulated legacy. Signed-off-by: Yao Yao <lowsfer@users.noreply.github.com>

lowsfer · 2026-01-23T10:34:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-23T10:39:37Z

PR_Github #33373 [ run ] triggered by Bot. Commit: 63cb381

tensorrt-cicd · 2026-01-23T19:01:18Z

PR_Github #33373 [ run ] completed with state SUCCESS. Commit: 63cb381
/LLM/main/L0_MergeRequest_PR pipeline #25759 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

lowsfer · 2026-01-24T00:16:58Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-24T00:23:46Z

PR_Github #33418 [ run ] triggered by Bot. Commit: 63cb381

tensorrt-cicd · 2026-01-24T03:23:04Z

PR_Github #33418 [ run ] completed with state SUCCESS. Commit: 63cb381
/LLM/main/L0_MergeRequest_PR pipeline #25795 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

lowsfer · 2026-01-24T07:56:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-01-24T08:02:26Z

PR_Github #33440 [ run ] triggered by Bot. Commit: 63cb381

tensorrt-cicd · 2026-01-24T09:48:37Z

PR_Github #33440 [ run ] completed with state SUCCESS. Commit: 63cb381
/LLM/main/L0_MergeRequest_PR pipeline #25811 completed with status: 'SUCCESS'

…NVIDIA#10736)" This reverts commit 6f07fa8.

lowsfer requested a review from a team as a code owner January 16, 2026 03:18

lowsfer requested review from eopXD, thorjohnsen and yizhang-nv January 16, 2026 03:18

yizhang-nv approved these changes Jan 16, 2026

View reviewed changes

yizhang-nv reviewed Jan 16, 2026

View reviewed changes

coderabbitai bot reviewed Jan 19, 2026

View reviewed changes

juney-nvidia approved these changes Jan 20, 2026

View reviewed changes

lowsfer force-pushed the kvCacheManagerV2 branch 3 times, most recently from 2efa101 to 2be8f4a Compare January 20, 2026 08:06

lowsfer enabled auto-merge (squash) January 20, 2026 08:07

lowsfer force-pushed the kvCacheManagerV2 branch from 2be8f4a to a4737c5 Compare January 20, 2026 08:21

lowsfer force-pushed the kvCacheManagerV2 branch from a4737c5 to 6551208 Compare January 21, 2026 03:33

lowsfer force-pushed the kvCacheManagerV2 branch from 6551208 to 9638ea8 Compare January 21, 2026 06:55

liji-nv reviewed Jan 22, 2026

View reviewed changes

jiaganc reviewed Jan 23, 2026

View reviewed changes

lowsfer force-pushed the kvCacheManagerV2 branch from 3712c52 to 654978e Compare January 23, 2026 08:16

lowsfer force-pushed the kvCacheManagerV2 branch from 654978e to 63cb381 Compare January 23, 2026 10:33

lowsfer merged commit 6f07fa8 into NVIDIA:main Jan 24, 2026
5 checks passed

yifeizhang-c added a commit to yifeizhang-c/TensorRT-LLM that referenced this pull request Feb 2, 2026

Revert "[TRTLLM-7738][feat] Adding implementation of KVCacheManagerV2 (…

0944594

…NVIDIA#10736)" This reverts commit 6f07fa8.

Comments

Conversation

lowsfer commented Jan 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

yizhang-nv left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 19, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lowsfer commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

lowsfer commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

lowsfer commented Jan 20, 2026

Uh oh!

github-actions bot commented Jan 20, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

lowsfer commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

lowsfer commented Jan 21, 2026

Uh oh!

tensorrt-cicd commented Jan 21, 2026

Uh oh!

lowsfer commented Jan 21, 2026

Uh oh!

tensorrt-cicd commented Jan 22, 2026

Uh oh!

lowsfer commented Jan 22, 2026

Uh oh!

tensorrt-cicd commented Jan 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lowsfer commented Jan 22, 2026

Uh oh!

tensorrt-cicd commented Jan 22, 2026

Uh oh!

tensorrt-cicd commented Jan 22, 2026

Uh oh!

lowsfer commented Jan 23, 2026

Uh oh!

tensorrt-cicd commented Jan 23, 2026

Uh oh!

tensorrt-cicd commented Jan 23, 2026

Uh oh!

ZhanruiSunCh commented Jan 23, 2026

Uh oh!

tensorrt-cicd commented Jan 23, 2026

Uh oh!

lowsfer commented Jan 16, 2026 •

edited by coderabbitai bot

Loading