TALE: Token-Adaptive Low-Rank KVCache Approximation with Reconstruction Elimination

Jaeseong Lee; seung-won Hwang; Aurick Qiao; Daniel Campos; Zhewei Yao; Yuxiong He

Vol. 13 (2025)

TACL approved

TALE: Token-Adaptive Low-Rank KVCache Approximation with Reconstruction Elimination

Published 2025-12-25

Jaeseong Lee
seung-won Hwang
Aurick Qiao
Daniel Campos
Zhewei Yao
Yuxiong He

Jaeseong Lee
Seoul National University

seung-won Hwang
Seoul National University

Aurick Qiao
Snowflake AI Research

Daniel Campos
Snowflake AI Research

Zhewei Yao
Snowflake AI Research

Yuxiong He
Snowflake AI Research

Abstract

KVCache, by storing key-value pairs for reuse, has been crucial for enhancing inference efficiency, for large language models (LLMs). However, the increasing memory demands of KVCache, especially with recent trends of longer input sequences, present a major challenge. In this work, we propose an innovative token-adaptive low-rank approximation strategy for KVCache compression. By applying varying ranks based on token significance, our method compresses KVCache efficiently while retaining critical information. Moreover, we introduce a lazy approximation technique, which approximates lazily only when needed, alongside a reconstruction-free design to bypass costly recalculations. Combined with multi-level quantization, this method reduces KVCache size by 9.1x on the Llama-3.1-8B model, with minimal performance degradation on complex tasks such as GSM8K. Moreover, our custom attention implementation shows up to 2x latency reduction compared to the conventional method in long context scenarios. The code is publicly available.

Article at MIT Press Presented at EMNLP 2025