Large language models that rely on chain-of-thought (CoT) or self-reflection can crack tough reasoning tasks — but at the cost of very long outputs that bloat the key–value (KV) cache during inference. Traditional cache-compression schemes, tuned for long prompts, tumble on these generated traces, keeping only ≈60 % of the original accuracy when restricted to 10 % of the cache.
R-KV — Redundancy-aware KV-cache compression for Reasoning models — solves this by ranking tokens on-the-fly for both importance and non-redundancy, retaining only the informative, diverse ones.
| Metric | Full KV | R-KV (ours) |
|---|---|---|
| KV cache kept | 100 % | 10 % |
| Accuracy | 100 % | ≈100 % |
| Throughput ↑ | 1× | 6.6× |
| Memory saved ↓ | – | 90 % |
At 16 % cache, noise removal even nudges accuracy to 105 % of the full baseline.
Up to 90 % KV-cache memory savings with zero—sometimes negative—accuracy loss.
Plug-and-play: a lightweight wrapper for any autoregressive LLM.
Training-free: drop straight into inference or RL roll-outs—no finetuning required.
Chain-of-thought (CoT) and self-reflection unlock impressive reasoning, but they explode the KV cache. A single DeepSeek-R1-Distill-8B run on a tough math problem can:
Existing compression tools focus on long prompts and falter on long generations—often pruning the wrong tokens because redundant self-checks still attend heavily to themselves.
R-KV tackles redundant key/value (KV) tokens by compressing the KV cache on-the-fly while the model is decoding, keeping only tokens that are important and non-redundant.
| Part | What happens | Key idea |
|---|---|---|
| 1. Decoding-time KV staging | Newly generated tokens are written to a buffer B_buffer. | Separate buffer lets us decide what to keep after seeing a chunk of text. |
| 2. Importance scoring | Use attention weights from the last α observation tokens to score each candidate token. | High attention ⇒ token is critical for future predictions. |
| 3. Redundancy estimation | Compute cosine similarity between key vectors; older near-duplicates are marked redundant. | Keeps semantics while pruning repetition. |
| 4. Joint selection | Final score Z = λ·Importance − (1-λ)·Redundancy. Top-k tokens + observation tokens are retained in the budgeted cache. | One knob (λ) trades memory vs. quality. |
Our method surpassed baselines by a large margin in challenging math benchmarks, and surprisingly even outperformed full KV.
R-KV keeps two small, fixed-size buffers, so memory usage remains constant, unlike FullKV whose memory grows linearly:
| Buffer | Purpose | Shape |
|---|---|---|
| KV budget | Retained tokens | b × B_budget × L × H × d |
| KV buffer | Fresh tokens | b × B_buffer × L × H × d |
| Part | Complexity |
|---|---|
| Importance scoring | O(α × B_budget) |
| Redundancy scoring | O(B_budget²) |
| Attention (compressed) | O((B_budget + B_buffer) × B_buffer) |
| Attention (FullKV) | O(B_full × B_buffer) |
For long sequences (B_full ≫ B_budget), the tiny cache more than offsets scoring overhead.
The figure below shows which tokens are picked by R-KV and the pure-attention baseline SnapKV at the same decoding step.
Grey = not selected | Light orange → Dark red = selected tokens (deeper red = chosen by more attention heads)
By combining attention strength with redundancy filtering, R-KV retains the important context and removes noise, successfully completing the task.
If you find R-KV useful in your research, please cite us:
@article{cai2025r,
title = {R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration},
author = {Cai, Zefan and Xiao, Wen and Sun, Hanshi and Luo, Cheng and Zhang, Yikai and Wan, Ke and Li, Yucheng and Zhou, Yeyang and Chang, Li-Wen and Gu, Jiuxiang and others},
journal = {arXiv preprint arXiv:2505.24133},
year = {2025}
}