qnixsynapse's picture
Fix formatting
45b9b19 verified
metadata
license: gemma
language:
  - en
tags:
  - conversational
quantized_by: qnixsynapse

Llamacpp Quantizations of official gguf of gemma-2-9b-it from kaggle repo

Using llama.cpp PR 8156 for quantization.

Original model: https://huggingface.co/google/gemma-2-9b-it

Downloading using huggingface-cli

First, make sure you have hugginface-cli installed:

pip install -U "huggingface_hub[cli]"

Then, you can target the specific file you want:

huggingface-cli download qnixsynapse/Gemma-V2-9B-Instruct-GGUF --include "<desired model file name>" --local-dir ./

or you can download directly.

Prompt format

The prompt format is same as Gemma v1 however not included with gguf file. This can be edited with gguf script to add a new key chat_template later.

<bos><start_of_turn>user
{prompt}<end_of_turn>
<start_of_turn>model

The model should stop either at <eos> or <end_of_turn>. If it doesn't then stop tokens needs to be added to the gguf metadata.

Quants

Currently only two quants are available:

quant size
Q4_K_S 5.5GB
Q3_K_M 4.8GB

If Q4_K_S is causing OOM when offloading all the layers to the GPU, consider decreasing batch size or use Q3_K_M.

Minimum VRAM needed: 8GB