Mozilla
/

gemma-2-9b-it-llamafile

Transformers

llamafile

Inference Endpoints

Model card Files Files and versions Community

jartine commited on 6 days ago

Commit

981f7a5

•

1 Parent(s): 045cfed

Update README.md

Browse files

Files changed (1) hide show

README.md +116 -48

README.md CHANGED Viewed

@@ -25,83 +25,139 @@ Gemma v2 is a large language model released by Google on Jun 27th 2024.
 The model is packaged into executable weights, which we call
 [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
-easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD, and
-NetBSD for AMD64 and ARM64.
-## License
-The llamafile software is open source and permissively licensed. However
-the weights embedded inside the llamafiles are governed by Google's
-Gemma License and Gemma Prohibited Use Policy. This is not an open
-source license. It's about as restrictive as it gets. There's a great
-many things you're not allowed to do with Gemma. The terms of the
-license and its list of unacceptable uses can be changed by Google at
-any time. Therefore we wouldn't recommend using these llamafiles for
-anything other than evaluating the quality of Google's engineering.
-See the [LICENSE](LICENSE) file for further details.
 ## Quickstart
-Running the following on a desktop OS will launch a tab in your web
-browser with a chatbot interface.
 ```
-wget https://huggingface.co/jartine/gemma-2-9b-it-llamafile/resolve/main/gemma-2-9b-it.Q6_K.llamafile
 chmod +x gemma-2-9b-it.Q6_K.llamafile
 ./gemma-2-9b-it.Q6_K.llamafile
 ```
-You then need to fill out the prompt / history template (see below).
-This model has a max context window size of 8k tokens. By default, a
-context window size of 512 tokens is used. You may increase this to the
-maximum by passing the `-c 0` flag.
-On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
-the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
-driver needs to be installed. If the prebuilt DSOs should fail, the CUDA
-or ROCm SDKs may need to be installed, in which case llamafile builds a
-native module just for your system.
-For further information, please see the [llamafile
-README](https://github.com/mozilla-ocho/llamafile/).
 Having **trouble?** See the ["Gotchas"
-section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)
 of the README.
-## Prompting
-When using the browser GUI, you need to fill out the following fields.
-Prompt template (note: this is for chat; Gemma doesn't have a system role):
 ```
-{{history}}
-<start_of_turn>{{char}}
 ```
-History template:
 ```
-<start_of_turn>{{name}}
-{{message}}<end_of_turn>
 ```
-Here's an example of how to prompt Gemma v2 on the command line:
 ```
-./gemma-2-9b-it.Q6_K.llamafile --special -p '<start_of_turn>user
-The Belobog Academy has discovered a new, invasive species of algae that can double itself in one day, and in 30 days fills a whole reservoir - contaminating the water supply. How many days would it take for the algae to fill half of the reservoir?<end_of_turn>
-<start_of_turn>model
-'
 ```
 ## About llamafile
-llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
-It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
 binaries that run on the stock installs of six OSes for both ARM64 and
 AMD64.
@@ -109,13 +165,25 @@ AMD64.
 This model works well with any quantization format. Q6\_K is the best
 choice overall here. We tested that, with [our 27b Gemma2
-llamafiles](https://huggingface.co/jartine/gemma-2-27b-it-llamafile),
 that the llamafile implementation of Gemma2 is able to to produce
 identical responses to the Gemma2 model that's hosted by Google on
 aistudio.google.com. Therefore we'd assume these 9b llamafiles are also
 faithful to Google's intentions. If you encounter any divergences, then
 try using the BF16 weights, which have the original fidelity.
 ---
 # Gemma 2 model card

 The model is packaged into executable weights, which we call
 [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
+easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD 7.3,
+and NetBSD for AMD64 and ARM64.
 ## Quickstart
+To get started, you need both the Gemma weights, and the llamafile
+software. Both of them are included in a single file, which can be
+downloaded and run as follows:
 ```
+wget https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/resolve/main/gemma-2-9b-it.Q6_K.llamafile
 chmod +x gemma-2-9b-it.Q6_K.llamafile
 ./gemma-2-9b-it.Q6_K.llamafile
 ```
+The default mode of operation for these llamafiles is our new command
+line chatbot interface.
+![Screenshot of Gemma 2b llamafile on MacOS](llamafile-gemma.png)
 Having **trouble?** See the ["Gotchas"
+section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
 of the README.
+## Usage
+By default, llamafile launches a chatbot in the terminal, and a server
+in the background. The chatbot is mostly self-explanatory. You can type
+`/help` for further details. See the [llamafile v0.8.15 release
+notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
+for documentation on our newest chatbot features.
+To instruct Gemma to do role playing, you can customize the system
+prompt as follows:
 ```
+./gemma-2-9b-it.Q6_K.llamafile --chat -p "you are mosaic's godzilla"
 ```
+To view the man page, run:
 ```
+./gemma-2-9b-it.Q6_K.llamafile --help
 ```
+To send a request to the OpenAI API compatible llamafile server, try:
 ```
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+     "model": "gemma-9b-it",
+     "messages": [{"role": "user", "content": "Say this is a test!"}],
+     "temperature": 0.0
+   }'
+```
+If you don't want the chatbot and you only want to run the server:
+```
+./gemma-2-9b-it.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
+```
+An advanced CLI mode is provided that's useful for shell scripting. You
+can use it by passing the `--cli` flag. For additional help on how it
+may be used, pass the `--help` flag.
+```
+./gemma-2-9b-it.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
+```
+You then need to fill out the prompt / history template (see below).
+For further information, please see the [llamafile
+README](https://github.com/mozilla-ocho/llamafile/).
+## Troubleshooting
+Having **trouble?** See the ["Gotchas"
+section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
+of the README.
+On Linux, the way to avoid run-detector errors is to install the APE
+interpreter.
+```sh
+sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
+sudo chmod +x /usr/bin/ape
+sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
+sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
 ```
+On Windows there's a 4GB limit on executable sizes. This means you
+should download the Q2\_K llamafile. For better quality, consider
+instead downloading the official llamafile release binary from
+<https://github.com/Mozilla-Ocho/llamafile/releases>, renaming it to
+have the .exe file extension, and then saying:
+```
+.\llamafile-0.8.15.exe -m gemma-2-9b-it.Q6_K.llamafile
+```
+That will overcome the Windows 4GB file size limit, allowing you to
+benefit from bigger better models.
+## Context Window
+This model has a max context window size of 8k tokens. By default, a
+context window size of 8192 tokens is used. You may limit the context
+window size by passing the `-c N` flag.
+## GPU Acceleration
+On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
+the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
+driver needs to be installed if you own an NVIDIA GPU. On Windows, if
+you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
+the flags `--recompile --gpu amd` the first time you run your llamafile.
+On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
+perform matrix multiplications. This is open source software, but it
+doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
+installed on your system, then you can pass the `--recompile` flag to
+build a GGML CUDA library just for your system that uses cuBLAS. This
+ensures you get maximum performance.
+For further information, please see the [llamafile
+README](https://github.com/mozilla-ocho/llamafile/).
 ## About llamafile
+llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
+uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
 binaries that run on the stock installs of six OSes for both ARM64 and
 AMD64.
 This model works well with any quantization format. Q6\_K is the best
 choice overall here. We tested that, with [our 27b Gemma2
+llamafiles](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile),
 that the llamafile implementation of Gemma2 is able to to produce
 identical responses to the Gemma2 model that's hosted by Google on
 aistudio.google.com. Therefore we'd assume these 9b llamafiles are also
 faithful to Google's intentions. If you encounter any divergences, then
 try using the BF16 weights, which have the original fidelity.
+## See Also
+- <https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile>
+- <https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile>
+## License
+The llamafile software is open source and permissively licensed. However
+the weights embedded inside the llamafiles are governed by Google's
+Gemma License and Gemma Prohibited Use Policy. See the
+[LICENSE](LICENSE) file for further details.
 ---
 # Gemma 2 model card