You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
They claim up to 24x the throughput (measured in requests handled per second) compared to huggingface's transformers library
How?
Inference is bottlenecked by memory, most notably the KV cache. They say the KV cache's most notable features are
That it's very large
That it's dynamic, size depends on sequence length which is variable. Existing systems waste 60-80% of memory due to fragmentation and over-reservation
PagedAttention is an alternative approach to managing the KV cache which is inspired by virtual memory, pages and blocks. By allocating the space dynamically with this approach, only up to about 4% of memory will be wasted, instead of the aforementioned 60-80.
For further details, better refer to their website and Github.
okpatil4u, howard0su, pomoke, rupeshs, DeepUpSonnemann and 23 morevladfaust