metadata
license: gemma
language:
- en
tags:
- conversational
quantized_by: qnixsynapse
Llamacpp Quantizations of official gguf of gemma-2-9b-it from kaggle repo
Using llama.cpp PR 8156 for quantization.
Original model: https://huggingface.co/google/gemma-2-9b-it
Downloading using huggingface-cli
First, make sure you have hugginface-cli installed:
pip install -U "huggingface_hub[cli]"
Then, you can target the specific file you want:
huggingface-cli download qnixsynapse/Gemma-V2-9B-Instruct-GGUF --include "<desired model file name>" --local-dir ./
or you can download directly.
Prompt format
The prompt format is same as Gemma v1 however not included with gguf file. This can be edited with gguf script to add a new key chat_template
later.
<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model
The model should stop either at <eos>
or <end_of_turn>
. If it doesn't then stop tokens needs to be added to the gguf metadata.
Quants
Currently only two quants are available:
quant | size |
---|---|
Q4_K_S | 5.5GB |
Q3_K_M | 4.8GB |
If Q4_K_S is causing OOM when offloading all the layers to the GPU, consider decreasing batch size or use Q3_K_M.
Minimum VRAM needed: 8GB