|
--- |
|
license: gemma |
|
language: |
|
- en |
|
tags: |
|
- conversational |
|
quantized_by: qnixsynapse |
|
--- |
|
|
|
## Llamacpp Quantizations of official gguf of gemma-2-9b-it from kaggle repo |
|
Using <a href="https://github.com/ggerganov/llama.cpp/">llama.cpp</a> PR <a href="https://github.com/ggerganov/llama.cpp/pull/8156">8156</a> for quantization. |
|
|
|
Original model: https://huggingface.co/google/gemma-2-9b-it |
|
|
|
|
|
## Downloading using huggingface-cli |
|
|
|
First, make sure you have hugginface-cli installed: |
|
|
|
``` |
|
pip install -U "huggingface_hub[cli]" |
|
``` |
|
|
|
Then, you can target the specific file you want: |
|
|
|
``` |
|
huggingface-cli download qnixsynapse/Gemma-V2-9B-Instruct-GGUF --include "<desired model file name>" --local-dir ./ |
|
``` |
|
|
|
or you can download directly. |
|
|
|
|
|
## Prompt format |
|
|
|
The prompt format is same as Gemma v1 however not included with gguf file. This can be edited with gguf script to add a new key `chat_template` later. |
|
|
|
``` |
|
<bos><start_of_turn>user |
|
{prompt}<end_of_turn> |
|
<start_of_turn>model |
|
|
|
``` |
|
|
|
The model should stop either at `<eos>` or `<end_of_turn>`. If it doesn't then stop tokens needs to be added to the gguf metadata. |
|
|
|
## Quants |
|
Currently only two quants are available: |
|
| quant | size | |
|
|-------|-------| |
|
| Q4_K_S| 5.5GB| |
|
|Q3_K_M | 4.8GB| |
|
|
|
If Q4_K_S is causing OOM when offloading all the layers to the GPU, consider decreasing batch size or use Q3_K_M. |
|
|
|
Minimum VRAM needed: 8GB |