Text Generation
Transformers
GGUF
Safetensors
mistral
quantized
2-bit
3-bit
4-bit precision
5-bit
6-bit
8-bit precision
GGUF
gemma
arxiv:2305.14314
arxiv:2312.11805
arxiv:2009.03300
arxiv:1905.07830
arxiv:1911.11641
arxiv:1904.09728
arxiv:1905.10044
arxiv:1907.10641
arxiv:1811.00937
arxiv:1809.02789
arxiv:1911.01547
arxiv:1705.03551
arxiv:2107.03374
arxiv:2108.07732
arxiv:2110.14168
arxiv:2304.06364
arxiv:2206.04615
arxiv:1804.06876
arxiv:2110.08193
arxiv:2009.11462
arxiv:2101.11718
arxiv:1804.09301
arxiv:2109.07958
arxiv:2203.09509
Inference Endpoints
has_space
text-generation-inference
mikkelgravgaard
commited on
Commit
•
9b6d791
1
Parent(s):
978b352
Docs: fix example filenames
Browse filesExample filenames had`-GGUF` in them, which doesn't match the actual filenames and made copy/pasting examples not work.
README.md
CHANGED
@@ -103,7 +103,7 @@ The following clients/libraries will automatically download models for you, prov
|
|
103 |
|
104 |
### In `text-generation-webui`
|
105 |
|
106 |
-
Under Download Model, you can enter the model repo: [MaziyarPanahi/gemma-7b-GGUF](https://huggingface.co/MaziyarPanahi/gemma-7b-GGUF) and below it, a specific filename to download, such as: gemma-7b
|
107 |
|
108 |
Then click Download.
|
109 |
|
@@ -118,7 +118,7 @@ pip3 install huggingface-hub
|
|
118 |
Then you can download any individual model file to the current directory, at high speed, with a command like this:
|
119 |
|
120 |
```shell
|
121 |
-
huggingface-cli download MaziyarPanahi/gemma-7b-GGUF gemma-7b
|
122 |
```
|
123 |
</details>
|
124 |
<details>
|
@@ -141,7 +141,7 @@ pip3 install hf_transfer
|
|
141 |
And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
|
142 |
|
143 |
```shell
|
144 |
-
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download MaziyarPanahi/gemma-7b-GGUF gemma-7b
|
145 |
```
|
146 |
|
147 |
Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
|
@@ -152,7 +152,7 @@ Windows Command Line users: You can set the environment variable by running `set
|
|
152 |
Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
|
153 |
|
154 |
```shell
|
155 |
-
./main -ngl 35 -m gemma-7b
|
156 |
{system_message}<|im_end|>
|
157 |
<|im_start|>user
|
158 |
{prompt}<|im_end|>
|
@@ -209,7 +209,7 @@ from llama_cpp import Llama
|
|
209 |
|
210 |
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
|
211 |
llm = Llama(
|
212 |
-
model_path="./gemma-7b
|
213 |
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
|
214 |
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
|
215 |
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
|
@@ -229,7 +229,7 @@ output = llm(
|
|
229 |
|
230 |
# Chat Completion API
|
231 |
|
232 |
-
llm = Llama(model_path="./gemma-7b
|
233 |
llm.create_chat_completion(
|
234 |
messages = [
|
235 |
{"role": "system", "content": "You are a story writing assistant."},
|
|
|
103 |
|
104 |
### In `text-generation-webui`
|
105 |
|
106 |
+
Under Download Model, you can enter the model repo: [MaziyarPanahi/gemma-7b-GGUF](https://huggingface.co/MaziyarPanahi/gemma-7b-GGUF) and below it, a specific filename to download, such as: gemma-7b.Q4_K_M.gguf.
|
107 |
|
108 |
Then click Download.
|
109 |
|
|
|
118 |
Then you can download any individual model file to the current directory, at high speed, with a command like this:
|
119 |
|
120 |
```shell
|
121 |
+
huggingface-cli download MaziyarPanahi/gemma-7b-GGUF gemma-7b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
|
122 |
```
|
123 |
</details>
|
124 |
<details>
|
|
|
141 |
And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
|
142 |
|
143 |
```shell
|
144 |
+
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download MaziyarPanahi/gemma-7b-GGUF gemma-7b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
|
145 |
```
|
146 |
|
147 |
Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
|
|
|
152 |
Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
|
153 |
|
154 |
```shell
|
155 |
+
./main -ngl 35 -m gemma-7b.Q4_K_M.gguf --color -c 32768 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<|im_start|>system
|
156 |
{system_message}<|im_end|>
|
157 |
<|im_start|>user
|
158 |
{prompt}<|im_end|>
|
|
|
209 |
|
210 |
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
|
211 |
llm = Llama(
|
212 |
+
model_path="./gemma-7b.Q4_K_M.gguf", # Download the model file first
|
213 |
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
|
214 |
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
|
215 |
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
|
|
|
229 |
|
230 |
# Chat Completion API
|
231 |
|
232 |
+
llm = Llama(model_path="./gemma-7b.Q4_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
|
233 |
llm.create_chat_completion(
|
234 |
messages = [
|
235 |
{"role": "system", "content": "You are a story writing assistant."},
|