Any plans to use MQA (multi-query attention) or GQA (grouped-query attention) in the future?
#9
by
graefics
- opened
This model uses MHA (multi-head attention, i.e. num_attention_heads == num_key_value_heads
). This is unlike Llama, which uses GQA.
The problem with MHA is that the KV-caches are very big (because KV-cache size is proportional to num_key_value_heads
).
Therefore, do you have any plans to use MQA or GQA for future model releases? Thanks!