qnixsynapse
/

Gemma-V2-9B-Instruct-GGUF

Inference Endpoints

Model card Files Files and versions Community

Gemma-V2-9B-Instruct-GGUF / README.md

qnixsynapse's picture

Fix formatting

45b9b19 verified 5 months ago

|

1.37 kB

	---
	license: gemma
	language:
	- en
	tags:
	- conversational
	quantized_by: qnixsynapse
	---

	## Llamacpp Quantizations of official gguf of gemma-2-9b-it from kaggle repo
	Using <a href="https://github.com/ggerganov/llama.cpp/">llama.cpp</a> PR <a href="https://github.com/ggerganov/llama.cpp/pull/8156">8156</a> for quantization.

	Original model: https://huggingface.co/google/gemma-2-9b-it


	## Downloading using huggingface-cli

	First, make sure you have hugginface-cli installed:

	```
	pip install -U "huggingface_hub[cli]"
	```

	Then, you can target the specific file you want:

	```
	huggingface-cli download qnixsynapse/Gemma-V2-9B-Instruct-GGUF --include "<desired model file name>" --local-dir ./
	```

	or you can download directly.


	## Prompt format

	The prompt format is same as Gemma v1 however not included with gguf file. This can be edited with gguf script to add a new key `chat_template` later.

	```
	<bos><start_of_turn>user
	{prompt}<end_of_turn>
	<start_of_turn>model

	```

	The model should stop either at `<eos>` or `<end_of_turn>`. If it doesn't then stop tokens needs to be added to the gguf metadata.

	## Quants
	Currently only two quants are available:
	\| quant \| size \|
	\|-------\|-------\|
	\| Q4_K_S\| 5.5GB\|
	\|Q3_K_M \| 4.8GB\|

	If Q4_K_S is causing OOM when offloading all the layers to the GPU, consider decreasing batch size or use Q3_K_M.

	Minimum VRAM needed: 8GB