Hydragen: High-Throughput LLM Inference with Shared Prefixes
Abstract
Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end LLM throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a high batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Efficient LLM inference solution on Intel GPU (2023)
- DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference (2024)
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (2024)
- Gated Linear Attention Transformers with Hardware-Efficient Training (2023)
- SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Wait. Wait. Wait. Forget faster batches. Doesn't it lead to literal infinite context length?
During the inference, we have KV cache which can be stored in CPU/disk/parallel-universe. and we have Q=1(duh, autoregression).
Then using the formula we pretty much can calculate SDPA by parts never taking from KV cache more than we can chew.
Unless I borked napkin math it checks out and it's trivial to calculate SDPA(Q=1, K1||K2||K3) without need to store K1||K2||K3 in GPU.
Yeah I haven't seen this decomposition before, and it looks like incorporating the V directly into the decomposition lets them avoid materializing even the (d x n) attention scores
It sounds like for the batched prefix prefill to work efficiently, your requests all have to come in ~ the same time right? E.g. if you set a batch size of 16, in order to get the faster batch inference, you do need all 16 requests to be ready to go (or you need to wait for enough requests to fill the batch).
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper