Skip to main navigation menu Skip to main content Skip to site footer

TALE: Token-Adaptive Low-Rank KVCache Approximation with Reconstruction Elimination

Abstract

KVCache, by storing key-value pairs for reuse, has been crucial for enhancing inference efficiency, for large language models (LLMs). However, the increasing memory demands of KVCache, especially with recent trends of longer input sequences, present a major challenge. In this work, we propose an innovative token-adaptive low-rank approximation strategy for KVCache compression. By applying varying ranks based on token significance, our method compresses KVCache efficiently while retaining critical information. Moreover, we introduce a lazy approximation technique, which approximates lazily only when needed, alongside a reconstruction-free design to bypass costly recalculations. Combined with multi-level quantization, this method reduces KVCache size by 9.1x on the Llama-3.1-8B model, with minimal performance degradation on complex tasks such as GSM8K. Moreover, our custom attention implementation shows up to 2x latency reduction compared to the conventional method in long context scenarios. The code is publicly available.

Article at MIT Press Presented at EMNLP 2025