bofenghuang commited on
Commit
e6b5ed2
·
1 Parent(s): 0c6e02d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -11
README.md CHANGED
@@ -28,18 +28,9 @@ All previous versions are accessible through branches.
28
  - **V1.0**: Trained on 420K chat data.
29
  - **V2.0**: Trained on 520K data. Check out our [release blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) for more details.
30
 
31
-
32
- ## Quantized Models
33
-
34
- The quantized versions of this model are generously provided by [TheBloke](https://huggingface.co/TheBloke)!
35
-
36
- - AWQ: [TheBloke/Vigogne-2-7B-Chat-AWQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-AWQ)
37
- - GTPQ: [TheBloke/Vigogne-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GPTQ)
38
- - GGUF: [TheBloke/Vigogne-2-7B-Chat-GGUF](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GGUF)
39
-
40
  ## Prompt Template
41
 
42
- We utilized prefix tokens `<user>` and `<assistant>` to distinguish between user and assistant utterances.
43
 
44
  You can apply this formatting using the [chat template](https://huggingface.co/docs/transformers/main/chat_templating) through the `apply_chat_template()` method.
45
 
@@ -73,6 +64,18 @@ You will get
73
 
74
  ## Usage
75
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  ```python
77
  from typing import Dict, List, Optional
78
  import torch
@@ -139,10 +142,51 @@ response, history = chat("Quand il peut dépasser le lapin ?", history=history)
139
  response, history = chat("Écris une histoire imaginative qui met en scène une compétition de course entre un escargot et un lapin.", history=history)
140
  ```
141
 
142
- You can also utilize the Google Colab Notebook below for inferring with the Vigogne chat models.
143
 
144
  <a href="https://colab.research.google.com/github/bofenghuang/vigogne/blob/main/notebooks/infer_chat.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
146
  ## Limitations
147
 
148
  Vigogne is still under development, and there are many limitations that have to be addressed. Please note that it is possible that the model generates harmful or biased content, incorrect information or generally unhelpful answers.
 
28
  - **V1.0**: Trained on 420K chat data.
29
  - **V2.0**: Trained on 520K data. Check out our [release blog](https://github.com/bofenghuang/vigogne/blob/main/blogs/2023-08-17-vigogne-chat-v2_0.md) for more details.
30
 
 
 
 
 
 
 
 
 
 
31
  ## Prompt Template
32
 
33
+ We utilized prefix tokens `<user>:` and `<assistant>:` to distinguish between user and assistant utterances.
34
 
35
  You can apply this formatting using the [chat template](https://huggingface.co/docs/transformers/main/chat_templating) through the `apply_chat_template()` method.
36
 
 
64
 
65
  ## Usage
66
 
67
+ ### Inference using the quantized versions
68
+
69
+ The quantized versions of this model are generously provided by [TheBloke](https://huggingface.co/TheBloke)!
70
+
71
+ - AWQ for GPU inference: [TheBloke/Vigogne-2-7B-Chat-AWQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-AWQ)
72
+ - GTPQ for GPU inference: [TheBloke/Vigogne-2-7B-Chat-GPTQ](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GPTQ)
73
+ - GGUF for CPU+GPU inference: [TheBloke/Vigogne-2-7B-Chat-GGUF](https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GGUF)
74
+
75
+ These versions facilitate testing and development with various popular frameworks, including [AutoAWQ](https://github.com/casper-hansen/AutoAWQ), [vLLM](https://github.com/vllm-project/vllm), [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ), [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa), [llama.cpp](https://github.com/ggerganov/llama.cpp), [text-generation-webui](https://github.com/oobabooga/text-generation-webui), and more.
76
+
77
+ ### Inference using the unquantized model with 🤗 Transformers
78
+
79
  ```python
80
  from typing import Dict, List, Optional
81
  import torch
 
142
  response, history = chat("Écris une histoire imaginative qui met en scène une compétition de course entre un escargot et un lapin.", history=history)
143
  ```
144
 
145
+ You can also use the Google Colab Notebook provided below.
146
 
147
  <a href="https://colab.research.google.com/github/bofenghuang/vigogne/blob/main/notebooks/infer_chat.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
148
 
149
+ ### Inference using the unquantized model with vLLM
150
+
151
+ Set up an OpenAI-compatible server with the following command:
152
+
153
+ ```bash
154
+ # Install vLLM
155
+ # This may take 5-10 minutes.
156
+ # pip install vllm
157
+
158
+ # Start server for Vigogne-Chat models
159
+ python -m vllm.entrypoints.openai.api_server --model bofenghuang/vigogne-2-7b-chat
160
+
161
+ # List models
162
+ # curl http://localhost:8000/v1/models
163
+ ```
164
+
165
+ Query the model using the openai python package.
166
+
167
+ ```python
168
+ import openai
169
+
170
+ # Modify OpenAI's API key and API base to use vLLM's API server.
171
+ openai.api_key = "EMPTY"
172
+ openai.api_base = "http://localhost:8000/v1"
173
+
174
+ # First model
175
+ models = openai.Model.list()
176
+ model = models["data"][0]["id"]
177
+
178
+ # Chat completion API
179
+ chat_completion = openai.ChatCompletion.create(
180
+ model=model,
181
+ messages=[
182
+ {"role": "user", "content": "Parle-moi de toi-même."},
183
+ ],
184
+ max_tokens=1024,
185
+ temperature=0.7,
186
+ )
187
+ print("Chat completion results:", chat_completion)
188
+ ```
189
+
190
  ## Limitations
191
 
192
  Vigogne is still under development, and there are many limitations that have to be addressed. Please note that it is possible that the model generates harmful or biased content, incorrect information or generally unhelpful answers.