Gemma-7B-it GGUF

This is a quantized version of the google/gemma-7b-it model using llama.cpp.

This model card corresponds to the 7B base version of the Gemma model. You can also visit the model card of the 2B base model, 2B instruct model, and 7B base model.

Model Page: Gemma

Terms of Use: Terms

⚡ Quants

q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_s: Uses Q3_K for all tensors
q4_0: Original quant method, 4-bit.
q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
q4_k_s: Uses Q4_K for all tensors
q5_0: Higher accuracy, higher resource usage and slower inference.
q5_1: Even higher accuracy, resource usage and slower inference.
q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
q5_k_s: Uses Q5_K for all tensors
q6_k: Uses Q8_K for all tensors
q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

This model can be used with the latest version of llama.cpp and LM Studio >0.2.16.