KV Cache Quantization - what is the default precision
#2
by
deepmage121
- opened
Hi! This might a generic question for elsewhere, but I wanted to confirm if it was possible --
When I use this model with vLLM (directly pulled from huggingface), what will be the default precision of KV Cache if used -- float16, and for a w8a8 model for eg - would the KV cache precision be dependent on activation precision?
Hi. This is definitely a good question. Our quantized models do not use KV cache quantization at the moment, unless otherwise noted explicitly. For w8a8 models the activations are only quantized when executing the operations with weights (q_proj, k_proj, v_proj, out_proj, gate_proj, up_proj, down_proj).
alexmarques
changed discussion status to
closed