Post
1740
New preprint out with colleagues from MIT and IBM Research
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (2405.12981)
We introduce a simple mechanism of sharing keys and values across layers, reducing the memory needed for KV cache during inference!!
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (2405.12981)
We introduce a simple mechanism of sharing keys and values across layers, reducing the memory needed for KV cache during inference!!