TheBloke commited on
Commit
cc120e5
1 Parent(s): 09a030b

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -21
README.md CHANGED
@@ -7,7 +7,7 @@ license: apache-2.0
7
  model_creator: YeungNLP
8
  model_name: Firefly Mixtral 8X7B
9
  model_type: mixtral
10
- prompt_template: '{prompt}
11
 
12
  '
13
  quantized_by: TheBloke
@@ -40,27 +40,23 @@ quantized_by: TheBloke
40
 
41
  This repo contains GPTQ model files for [YeungNLP's Firefly Mixtral 8X7B](https://huggingface.co/YeungNLP/firefly-mixtral-8x7b).
42
 
43
- Mixtral GPTQs currently require:
44
- * Transformers 4.36.0 or later
45
- * either, AutoGPTQ 0.6 compiled from source, or
46
- * Transformers 4.37.0.dev0 compiled from Github with: `pip3 install git+https://github.com/huggingface/transformers`
47
-
48
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
49
 
50
  <!-- description end -->
51
  <!-- repositories-available start -->
52
  ## Repositories available
53
 
 
54
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/firefly-mixtral-8x7b-GPTQ)
55
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/firefly-mixtral-8x7b-GGUF)
56
  * [YeungNLP's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/YeungNLP/firefly-mixtral-8x7b)
57
  <!-- repositories-available end -->
58
 
59
  <!-- prompt-template start -->
60
- ## Prompt template: None
61
 
62
  ```
63
- {prompt}
64
 
65
  ```
66
 
@@ -73,8 +69,14 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
73
 
74
  GPTQ models are currently supported on Linux (NVidia/AMD) and Windows (NVidia only). macOS users: please use GGUF models.
75
 
76
- Mixtral GPTQs currently have special requirements - see Description above.
 
 
 
 
 
77
 
 
78
  <!-- README_GPTQ.md-compatible clients end -->
79
 
80
  <!-- README_GPTQ.md-provided-files start -->
@@ -182,12 +184,6 @@ Note that using Git with HF repos is strongly discouraged. It will be much slowe
182
  <!-- README_GPTQ.md-text-generation-webui start -->
183
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
184
 
185
- **NOTE**: Requires:
186
-
187
- * Transformers 4.36.0, or Transformers 4.37.0.dev0 from Github
188
- * Either AutoGPTQ 0.6 compiled from source and `Loader: AutoGPTQ`,
189
- * or, `Loader: Transformers`, if you installed Transformers from Github: `pip3 install git+https://github.com/huggingface/transformers`
190
-
191
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
192
 
193
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
@@ -214,18 +210,50 @@ It is strongly recommended to use the text-generation-webui one-click-installers
214
  <!-- README_GPTQ.md-use-from-tgi start -->
215
  ## Serving this model from Text Generation Inference (TGI)
216
 
217
- Not currently supported for Mixtral models.
 
 
218
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
  <!-- README_GPTQ.md-use-from-tgi end -->
220
  <!-- README_GPTQ.md-use-from-python start -->
221
  ## Python code example: inference from this GPTQ model
222
 
223
  ### Install the necessary packages
224
 
225
- Requires: Transformers 4.37.0.dev0 from Github, Optimum 1.16.0 or later, and AutoGPTQ 0.5.1 or later.
226
 
227
  ```shell
228
- pip3 install --upgrade "git+https://github.com/huggingface/transformers" optimum
229
  # If using PyTorch 2.1 + CUDA 12.x:
230
  pip3 install --upgrade auto-gptq
231
  # or, if using PyTorch 2.1 + CUDA 11.x:
@@ -238,7 +266,8 @@ If you are using PyTorch 2.0, you will need to install AutoGPTQ from source. Lik
238
  pip3 uninstall -y auto-gptq
239
  git clone https://github.com/PanQiWei/AutoGPTQ
240
  cd AutoGPTQ
241
- DISABLE_QIGEN=1 pip3 install .
 
242
  ```
243
 
244
  ### Example Python code
@@ -258,7 +287,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
258
 
259
  prompt = "Write a story about llamas"
260
  system_message = "You are a story writing assistant"
261
- prompt_template=f'''{prompt}
262
  '''
263
 
264
  print("\n\n*** Generate:")
@@ -289,8 +318,11 @@ print(pipe(prompt_template)[0]['generated_text'])
289
  <!-- README_GPTQ.md-compatibility start -->
290
  ## Compatibility
291
 
292
- The files provided are tested to work with AutoGPTQ 0.6 (compiled from source) and Transformers 4.37.0 (installed from Github).
 
 
293
 
 
294
  <!-- README_GPTQ.md-compatibility end -->
295
 
296
  <!-- footer start -->
 
7
  model_creator: YeungNLP
8
  model_name: Firefly Mixtral 8X7B
9
  model_type: mixtral
10
+ prompt_template: '[INST] {prompt} [/INST]
11
 
12
  '
13
  quantized_by: TheBloke
 
40
 
41
  This repo contains GPTQ model files for [YeungNLP's Firefly Mixtral 8X7B](https://huggingface.co/YeungNLP/firefly-mixtral-8x7b).
42
 
 
 
 
 
 
43
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
44
 
45
  <!-- description end -->
46
  <!-- repositories-available start -->
47
  ## Repositories available
48
 
49
+ * [AWQ model(s) for GPU inference.](https://huggingface.co/TheBloke/firefly-mixtral-8x7b-AWQ)
50
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/firefly-mixtral-8x7b-GPTQ)
51
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/firefly-mixtral-8x7b-GGUF)
52
  * [YeungNLP's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/YeungNLP/firefly-mixtral-8x7b)
53
  <!-- repositories-available end -->
54
 
55
  <!-- prompt-template start -->
56
+ ## Prompt template: Mistral
57
 
58
  ```
59
+ [INST] {prompt} [/INST]
60
 
61
  ```
62
 
 
69
 
70
  GPTQ models are currently supported on Linux (NVidia/AMD) and Windows (NVidia only). macOS users: please use GGUF models.
71
 
72
+ These GPTQ models are known to work in the following inference servers/webuis.
73
+
74
+ - [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
75
+ - [KoboldAI United](https://github.com/henk717/koboldai)
76
+ - [LoLLMS Web UI](https://github.com/ParisNeo/lollms-webui)
77
+ - [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
78
 
79
+ This may not be a complete list; if you know of others, please let me know!
80
  <!-- README_GPTQ.md-compatible clients end -->
81
 
82
  <!-- README_GPTQ.md-provided-files start -->
 
184
  <!-- README_GPTQ.md-text-generation-webui start -->
185
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
186
 
 
 
 
 
 
 
187
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
188
 
189
  It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
 
210
  <!-- README_GPTQ.md-use-from-tgi start -->
211
  ## Serving this model from Text Generation Inference (TGI)
212
 
213
+ It's recommended to use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggingface/text-generation-inference:1.1.0`
214
+
215
+ Example Docker parameters:
216
 
217
+ ```shell
218
+ --model-id TheBloke/firefly-mixtral-8x7b-GPTQ --port 3000 --quantize gptq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
219
+ ```
220
+
221
+ Example Python code for interfacing with TGI (requires huggingface-hub 0.17.0 or later):
222
+
223
+ ```shell
224
+ pip3 install huggingface-hub
225
+ ```
226
+
227
+ ```python
228
+ from huggingface_hub import InferenceClient
229
+
230
+ endpoint_url = "https://your-endpoint-url-here"
231
+
232
+ prompt = "Tell me about AI"
233
+ prompt_template=f'''[INST] {prompt} [/INST]
234
+ '''
235
+
236
+ client = InferenceClient(endpoint_url)
237
+ response = client.text_generation(prompt,
238
+ max_new_tokens=128,
239
+ do_sample=True,
240
+ temperature=0.7,
241
+ top_p=0.95,
242
+ top_k=40,
243
+ repetition_penalty=1.1)
244
+
245
+ print(f"Model output: {response}")
246
+ ```
247
  <!-- README_GPTQ.md-use-from-tgi end -->
248
  <!-- README_GPTQ.md-use-from-python start -->
249
  ## Python code example: inference from this GPTQ model
250
 
251
  ### Install the necessary packages
252
 
253
+ Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
254
 
255
  ```shell
256
+ pip3 install --upgrade transformers optimum
257
  # If using PyTorch 2.1 + CUDA 12.x:
258
  pip3 install --upgrade auto-gptq
259
  # or, if using PyTorch 2.1 + CUDA 11.x:
 
266
  pip3 uninstall -y auto-gptq
267
  git clone https://github.com/PanQiWei/AutoGPTQ
268
  cd AutoGPTQ
269
+ git checkout v0.5.1
270
+ pip3 install .
271
  ```
272
 
273
  ### Example Python code
 
287
 
288
  prompt = "Write a story about llamas"
289
  system_message = "You are a story writing assistant"
290
+ prompt_template=f'''[INST] {prompt} [/INST]
291
  '''
292
 
293
  print("\n\n*** Generate:")
 
318
  <!-- README_GPTQ.md-compatibility start -->
319
  ## Compatibility
320
 
321
+ The files provided are tested to work with Transformers. For non-Mistral models, AutoGPTQ can also be used directly.
322
+
323
+ [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama architecture models (including Mistral, Yi, DeepSeek, SOLAR, etc) in 4-bit. Please see the Provided Files table above for per-file compatibility.
324
 
325
+ For a list of clients/servers, please see "Known compatible clients / servers", above.
326
  <!-- README_GPTQ.md-compatibility end -->
327
 
328
  <!-- footer start -->