Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

Rectified Sparse Attention (ReSA)

📝 TL;DR

Rectified Sparse Attention (ReSA) improves sparse decoding by periodically refreshing the KV cache, achieving near-lossless quality and up to 2.42x speedup at 256K context length.

📝 Abstract

Efficient long-sequence generation is a critical challenge for Large Language Models. While recent sparse decoding methods improve efficiency, they suffer from KV cache misalignment, where approximation errors accumulate and degrade generation quality. In this work, we propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification. By refreshing the KV cache at fixed intervals using a dense forward pass, ReSA bounds error accumulation and preserves alignment with the pretraining distribution. Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality with significantly improved efficiency. Notably, ReSA delivers up to 2.42x end-to-end speedup under decoding at 256K sequence length, making it a practical solution for scalable long-context inference.

Method Overview

ReSA Overview

Block Selection

Kernel

Performance Comparison on Math Reasoning Tasks

Setting Gaokao2023En Minerva OlympiadBench AIME24 AMC23 Avg
R1-Qwen-Distill 1.5B
Dense 71.6 28.7 40.8 27.4 65.6 46.82
Sparse 67.9 29.0 38.7 21.3 60.6 43.50
ReSA 71.8 28.1 39.5 23.0 65.4 45.56
Avg Length 4915.8 6390.8 8991.6 12126.4 7866.4 8058.2
R1-Qwen-Distill 7B
Dense 73.8 40.4 52.3 48.1 89.0 60.72
Sparse 72.9 38.1 48.4 46.1 83.1 57.72
ReSA 73.5 39.7 52.3 51.1 86.0 60.52
Avg Length 2889.9 4018.7 7520.0 10474.5 5732.2 6127.1

Table: Performance comparison on math reasoning tasks. ReSA maintains near-lossless performance, whereas sparse decoding alone degrades as decoding progresses.

🚀 Getting Started

Usage

Pretrained Model Preparation

Download the pretrained model to /path/to/pretrained/. For e.g.:

huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --local-dir /path/to/pretrained/ --local-dir-use-symlinks False

Math Evaluation

Evaluate on math data with:

bash scripts/local_eval_math.sh

Replace /path/to/pretrained/ in the script with your pretrained model path. Replace /path/to/output/ with the path to save evaluation results. Pass in the configuration of ReSA using save_feature.

Collect Math Evaluation Results

Install the needed packages:

bash scripts/setup_math_eval.sh

Collect math evaluation results from the output file:

bash scripts/math_eval_result.sh /path/to/output/ file_name

Replace file_name with your math result file, for e.g., DeepSeek-R1-Distill-Qwen-1.5B_resa_0.1_32_local.jsonl

📚 References

(To be completed)

Acknowledgements

This project incorporates code from Qwen2.5-Math. We appreciate the contributions of the original authors.