KV Cache Quantization - what is the default precision

#2
by deepmage121 - opened

Hi! This might a generic question for elsewhere, but I wanted to confirm if it was possible --
When I use this model with vLLM (directly pulled from huggingface), what will be the default precision of KV Cache if used -- float16, and for a w8a8 model for eg - would the KV cache precision be dependent on activation precision?

Neural Magic org

Hi. This is definitely a good question. Our quantized models do not use KV cache quantization at the moment, unless otherwise noted explicitly. For w8a8 models the activations are only quantized when executing the operations with weights (q_proj, k_proj, v_proj, out_proj, gate_proj, up_proj, down_proj).

alexmarques changed discussion status to closed

Sign up or log in to comment