TALE: Token-Adaptive Low-Rank KVCache Approximation with Reconstruction Elimination
Abstract
KVCache, by storing key-value pairs for reuse, has been crucial for enhancing inference efficiency, for large language models (LLMs). However, the increasing memory demands of KVCache, especially with recent trends of longer input sequences, present a major challenge. In this work, we propose an innovative token-adaptive low-rank approximation strategy for KVCache compression. By applying varying ranks based on token significance, our method compresses KVCache efficiently while retaining critical information. Moreover, we introduce a lazy approximation technique, which approximates lazily only when needed, alongside a reconstruction-free design to bypass costly recalculations. Combined with multi-level quantization, this method reduces KVCache size by 9.1x on the Llama-3.1-8B model, with minimal performance degradation on complex tasks such as GSM8K. Moreover, our custom attention implementation shows up to 2x latency reduction compared to the conventional method in long context scenarios. The code is publicly available.