jartine commited on
Commit
981f7a5
1 Parent(s): 045cfed

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -48
README.md CHANGED
@@ -25,83 +25,139 @@ Gemma v2 is a large language model released by Google on Jun 27th 2024.
25
 
26
  The model is packaged into executable weights, which we call
27
  [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
28
- easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD, and
29
- NetBSD for AMD64 and ARM64.
30
-
31
- ## License
32
-
33
- The llamafile software is open source and permissively licensed. However
34
- the weights embedded inside the llamafiles are governed by Google's
35
- Gemma License and Gemma Prohibited Use Policy. This is not an open
36
- source license. It's about as restrictive as it gets. There's a great
37
- many things you're not allowed to do with Gemma. The terms of the
38
- license and its list of unacceptable uses can be changed by Google at
39
- any time. Therefore we wouldn't recommend using these llamafiles for
40
- anything other than evaluating the quality of Google's engineering.
41
-
42
- See the [LICENSE](LICENSE) file for further details.
43
 
44
  ## Quickstart
45
 
46
- Running the following on a desktop OS will launch a tab in your web
47
- browser with a chatbot interface.
 
48
 
49
  ```
50
- wget https://huggingface.co/jartine/gemma-2-9b-it-llamafile/resolve/main/gemma-2-9b-it.Q6_K.llamafile
51
  chmod +x gemma-2-9b-it.Q6_K.llamafile
52
  ./gemma-2-9b-it.Q6_K.llamafile
53
  ```
54
 
55
- You then need to fill out the prompt / history template (see below).
 
56
 
57
- This model has a max context window size of 8k tokens. By default, a
58
- context window size of 512 tokens is used. You may increase this to the
59
- maximum by passing the `-c 0` flag.
60
-
61
- On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
62
- the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
63
- driver needs to be installed. If the prebuilt DSOs should fail, the CUDA
64
- or ROCm SDKs may need to be installed, in which case llamafile builds a
65
- native module just for your system.
66
-
67
- For further information, please see the [llamafile
68
- README](https://github.com/mozilla-ocho/llamafile/).
69
 
70
  Having **trouble?** See the ["Gotchas"
71
- section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)
72
  of the README.
73
 
74
- ## Prompting
75
 
76
- When using the browser GUI, you need to fill out the following fields.
 
 
 
 
77
 
78
- Prompt template (note: this is for chat; Gemma doesn't have a system role):
 
79
 
80
  ```
81
- {{history}}
82
- <start_of_turn>{{char}}
83
  ```
84
 
85
- History template:
86
 
87
  ```
88
- <start_of_turn>{{name}}
89
- {{message}}<end_of_turn>
90
  ```
91
 
92
- Here's an example of how to prompt Gemma v2 on the command line:
93
 
94
  ```
95
- ./gemma-2-9b-it.Q6_K.llamafile --special -p '<start_of_turn>user
96
- The Belobog Academy has discovered a new, invasive species of algae that can double itself in one day, and in 30 days fills a whole reservoir - contaminating the water supply. How many days would it take for the algae to fill half of the reservoir?<end_of_turn>
97
- <start_of_turn>model
98
- '
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  ```
100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  ## About llamafile
102
 
103
- llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
104
- It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
105
  binaries that run on the stock installs of six OSes for both ARM64 and
106
  AMD64.
107
 
@@ -109,13 +165,25 @@ AMD64.
109
 
110
  This model works well with any quantization format. Q6\_K is the best
111
  choice overall here. We tested that, with [our 27b Gemma2
112
- llamafiles](https://huggingface.co/jartine/gemma-2-27b-it-llamafile),
113
  that the llamafile implementation of Gemma2 is able to to produce
114
  identical responses to the Gemma2 model that's hosted by Google on
115
  aistudio.google.com. Therefore we'd assume these 9b llamafiles are also
116
  faithful to Google's intentions. If you encounter any divergences, then
117
  try using the BF16 weights, which have the original fidelity.
118
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  ---
120
 
121
  # Gemma 2 model card
 
25
 
26
  The model is packaged into executable weights, which we call
27
  [llamafiles](https://github.com/Mozilla-Ocho/llamafile). This makes it
28
+ easy to use the model on Linux, MacOS, Windows, FreeBSD, OpenBSD 7.3,
29
+ and NetBSD for AMD64 and ARM64.
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Quickstart
32
 
33
+ To get started, you need both the Gemma weights, and the llamafile
34
+ software. Both of them are included in a single file, which can be
35
+ downloaded and run as follows:
36
 
37
  ```
38
+ wget https://huggingface.co/Mozilla/gemma-2-9b-it-llamafile/resolve/main/gemma-2-9b-it.Q6_K.llamafile
39
  chmod +x gemma-2-9b-it.Q6_K.llamafile
40
  ./gemma-2-9b-it.Q6_K.llamafile
41
  ```
42
 
43
+ The default mode of operation for these llamafiles is our new command
44
+ line chatbot interface.
45
 
46
+ ![Screenshot of Gemma 2b llamafile on MacOS](llamafile-gemma.png)
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  Having **trouble?** See the ["Gotchas"
49
+ section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
50
  of the README.
51
 
52
+ ## Usage
53
 
54
+ By default, llamafile launches a chatbot in the terminal, and a server
55
+ in the background. The chatbot is mostly self-explanatory. You can type
56
+ `/help` for further details. See the [llamafile v0.8.15 release
57
+ notes](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.15)
58
+ for documentation on our newest chatbot features.
59
 
60
+ To instruct Gemma to do role playing, you can customize the system
61
+ prompt as follows:
62
 
63
  ```
64
+ ./gemma-2-9b-it.Q6_K.llamafile --chat -p "you are mosaic's godzilla"
 
65
  ```
66
 
67
+ To view the man page, run:
68
 
69
  ```
70
+ ./gemma-2-9b-it.Q6_K.llamafile --help
 
71
  ```
72
 
73
+ To send a request to the OpenAI API compatible llamafile server, try:
74
 
75
  ```
76
+ curl http://localhost:8080/v1/chat/completions \
77
+ -H "Content-Type: application/json" \
78
+ -d '{
79
+ "model": "gemma-9b-it",
80
+ "messages": [{"role": "user", "content": "Say this is a test!"}],
81
+ "temperature": 0.0
82
+ }'
83
+ ```
84
+
85
+ If you don't want the chatbot and you only want to run the server:
86
+
87
+ ```
88
+ ./gemma-2-9b-it.Q6_K.llamafile --server --nobrowser --host 0.0.0.0
89
+ ```
90
+
91
+ An advanced CLI mode is provided that's useful for shell scripting. You
92
+ can use it by passing the `--cli` flag. For additional help on how it
93
+ may be used, pass the `--help` flag.
94
+
95
+ ```
96
+ ./gemma-2-9b-it.Q6_K.llamafile --cli -p 'four score and seven' --log-disable
97
+ ```
98
+
99
+ You then need to fill out the prompt / history template (see below).
100
+
101
+ For further information, please see the [llamafile
102
+ README](https://github.com/mozilla-ocho/llamafile/).
103
+
104
+ ## Troubleshooting
105
+
106
+ Having **trouble?** See the ["Gotchas"
107
+ section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
108
+ of the README.
109
+
110
+ On Linux, the way to avoid run-detector errors is to install the APE
111
+ interpreter.
112
+
113
+ ```sh
114
+ sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf
115
+ sudo chmod +x /usr/bin/ape
116
+ sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
117
+ sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"
118
  ```
119
 
120
+ On Windows there's a 4GB limit on executable sizes. This means you
121
+ should download the Q2\_K llamafile. For better quality, consider
122
+ instead downloading the official llamafile release binary from
123
+ <https://github.com/Mozilla-Ocho/llamafile/releases>, renaming it to
124
+ have the .exe file extension, and then saying:
125
+
126
+ ```
127
+ .\llamafile-0.8.15.exe -m gemma-2-9b-it.Q6_K.llamafile
128
+ ```
129
+
130
+ That will overcome the Windows 4GB file size limit, allowing you to
131
+ benefit from bigger better models.
132
+
133
+ ## Context Window
134
+
135
+ This model has a max context window size of 8k tokens. By default, a
136
+ context window size of 8192 tokens is used. You may limit the context
137
+ window size by passing the `-c N` flag.
138
+
139
+ ## GPU Acceleration
140
+
141
+ On GPUs with sufficient RAM, the `-ngl 999` flag may be passed to use
142
+ the system's NVIDIA or AMD GPU(s). On Windows, only the graphics card
143
+ driver needs to be installed if you own an NVIDIA GPU. On Windows, if
144
+ you have an AMD GPU, you should install the ROCm SDK v6.1 and then pass
145
+ the flags `--recompile --gpu amd` the first time you run your llamafile.
146
+
147
+ On NVIDIA GPUs, by default, the prebuilt tinyBLAS library is used to
148
+ perform matrix multiplications. This is open source software, but it
149
+ doesn't go as fast as closed source cuBLAS. If you have the CUDA SDK
150
+ installed on your system, then you can pass the `--recompile` flag to
151
+ build a GGML CUDA library just for your system that uses cuBLAS. This
152
+ ensures you get maximum performance.
153
+
154
+ For further information, please see the [llamafile
155
+ README](https://github.com/mozilla-ocho/llamafile/).
156
+
157
  ## About llamafile
158
 
159
+ llamafile is a new format introduced by Mozilla on Nov 20th 2023. It
160
+ uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
161
  binaries that run on the stock installs of six OSes for both ARM64 and
162
  AMD64.
163
 
 
165
 
166
  This model works well with any quantization format. Q6\_K is the best
167
  choice overall here. We tested that, with [our 27b Gemma2
168
+ llamafiles](https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile),
169
  that the llamafile implementation of Gemma2 is able to to produce
170
  identical responses to the Gemma2 model that's hosted by Google on
171
  aistudio.google.com. Therefore we'd assume these 9b llamafiles are also
172
  faithful to Google's intentions. If you encounter any divergences, then
173
  try using the BF16 weights, which have the original fidelity.
174
 
175
+ ## See Also
176
+
177
+ - <https://huggingface.co/Mozilla/gemma-2-2b-it-llamafile>
178
+ - <https://huggingface.co/Mozilla/gemma-2-27b-it-llamafile>
179
+
180
+ ## License
181
+
182
+ The llamafile software is open source and permissively licensed. However
183
+ the weights embedded inside the llamafiles are governed by Google's
184
+ Gemma License and Gemma Prohibited Use Policy. See the
185
+ [LICENSE](LICENSE) file for further details.
186
+
187
  ---
188
 
189
  # Gemma 2 model card