inference: false
language:
- en
tags:
- gemma
- text-generation-inference
pipeline_tag: text-generation
license: other
license_name: gemma-terms-of-use
license_link: https://ai.google.dev/gemma/terms
Google's Gemma-2b-it GGUF
These files are GGUF format model files for Googles's Gemma-2b-it.
GGUF files are for CPU + GPU inference using llama.cpp and libraries and UIs which support this format, such as:
How to run in llama.cpp
I use the following command line, adjust for your tastes and needs:
./main -t 2 -ngl 18 -m gemma-2b-it.q8_0.gguf -p '<start_of_turn>user\nWhat is love?\n<end_of_turn>\n<start_of_turn>model\n' --no-penalize-nl -e --color --temp 0.95 -c 1024 -n 512 --repeat_penalty 1.2 --top_p 0.95 --top_k 50
Change -t 2
to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use -t 8
.
Change -ngl 18
to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
If you want to have a chat-style conversation, replace the -p <PROMPT>
argument with -i -ins
, you can use --interactive-first
to start in interactive mode:
./main -t 2 -ngl 18 -m gemma-2b-it.q8_0.gguf --in-prefix '<start_of_turn>user\n' --in-suffix '<end_of_turn>\n<start_of_turn>model\n' -i -ins --no-penalize-nl -e --color --temp 0.95 -c 1024 -n 512 --repeat_penalty 1.2 --top_p 0.95 --top_k 50
Compatibility
I have uploded both the original llama.cpp quant methods (q4_0, q4_1, q5_0, q5_1, q8_0
) as well as the k-quant methods (q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K
).
Please refer to llama.cpp and TheBloke's GGUF models for further explanation.
How to run in text-generation-webui
Further instructions here: text-generation-webui/docs/llama.cpp-models.md.
Thanks
Thanks to Google for providing checkpoints of the model.
Thanks to Georgi Gerganov and all of the awesome people in the AI community.