KV Cache Quantization - what is the default precision

by deepmage121 - opened Dec 5, 2024

Dec 5, 2024

Hi! This might a generic question for elsewhere, but I wanted to confirm if it was possible --
When I use this model with vLLM (directly pulled from huggingface), what will be the default precision of KV Cache if used -- float16, and for a w8a8 model for eg - would the KV cache precision be dependent on activation precision?

alexmarques

Neural Magic org Dec 5, 2024

Hi. This is definitely a good question. Our quantized models do not use KV cache quantization at the moment, unless otherwise noted explicitly. For w8a8 models the activations are only quantized when executing the operations with weights (q_proj, k_proj, v_proj, out_proj, gate_proj, up_proj, down_proj).

alexmarques changed discussion status to closed Dec 5, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment