Investigate PagedAttention KV-cache memory management for faster inference

New research just came out on using a technique inspired by kernel virtual memory and pages to manage the KV cache.

Results? Way faster inference!

https://vllm.ai/

They claim up to 24x the throughput (measured in requests handled per second) compared to huggingface's transformers library

![afbeelding](https://github.com/ggerganov/llama.cpp/assets/4040870/0388704e-5924-42d6-a3d4-2e4625268389)

## How?

Inference is bottlenecked by memory, most notably the KV cache. They say the KV cache's most notable features are

- That it's very large
- That it's dynamic, size depends on sequence length which is variable. Existing systems waste 60-80% of memory due to fragmentation and over-reservation

PagedAttention is an alternative approach to managing the KV cache which is inspired by virtual memory, pages and blocks. By allocating the space dynamically with this approach, only up to about 4% of memory will be wasted, instead of the aforementioned 60-80.

For further details, better refer to their [website](https://vllm.ai/) and [Github](https://github.com/vllm-project/vllm).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate PagedAttention KV-cache memory management for faster inference #1955

How?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate PagedAttention KV-cache memory management for faster inference #1955

Description

How?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions