reach-vb HF staff alvarobartt HF staff commited on
Commit
457485d
1 Parent(s): ef70fdf

Update README.md (#6)

Browse files

- Update README.md (04190a408c2ac08b4ff8381402f04a43a101a7be)


Co-authored-by: Alvaro Bartolome <alvarobartt@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +132 -1
README.md CHANGED
@@ -116,7 +116,138 @@ The AutoAWQ script has been adapted from [`AutoAWQ/examples/generate.py`](https:
116
 
117
  ### 🤗 Text Generation Inference (TGI)
118
 
119
- Coming soon!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
  ## Quantization Reproduction
122
 
 
116
 
117
  ### 🤗 Text Generation Inference (TGI)
118
 
119
+ To run the `text-generation-launcher` with Llama 3.1 8B Instruct AWQ in INT4 with Marlin kernels for optimized inference speed, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and the `huggingface_hub` Python package as you need to login to the Hugging Face Hub.
120
+
121
+ ```bash
122
+ pip install -q --upgrade huggingface_hub
123
+ huggingface-cli login
124
+ ```
125
+
126
+ Then you just need to run the TGI v2.2.0 (or higher) Docker container as follows:
127
+
128
+ ```bash
129
+ docker run --gpus all --shm-size 1g -ti -p 8080:80 \
130
+ -v hf_cache:/data \
131
+ -e MODEL_ID=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
132
+ -e QUANTIZE=awq \
133
+ -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
134
+ -e MAX_INPUT_LENGTH=4000 \
135
+ -e MAX_TOTAL_TOKENS=4096 \
136
+ ghcr.io/huggingface/text-generation-inference:2.2.0
137
+ ```
138
+
139
+ > [!NOTE]
140
+ > TGI will expose different endpoints, to see all the endpoints available check [TGI OpenAPI Specification](https://huggingface.github.io/text-generation-inference/#/).
141
+
142
+ To send request to the deployed TGI endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
143
+
144
+ ```bash
145
+ curl 0.0.0.0:8080/v1/chat/completions \
146
+ -X POST \
147
+ -H 'Content-Type: application/json' \
148
+ -d '{
149
+ "model": "tgi",
150
+ "messages": [
151
+ {
152
+ "role": "system",
153
+ "content": "You are a helpful assistant."
154
+ },
155
+ {
156
+ "role": "user",
157
+ "content": "What is Deep Learning?"
158
+ }
159
+ ],
160
+ "max_tokens": 128
161
+ }'
162
+ ```
163
+
164
+ Or programatically via the `huggingface_hub` Python client as follows:
165
+
166
+ ```python
167
+ import os
168
+ from huggingface_hub import InferenceClient
169
+
170
+ client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
171
+
172
+ chat_completion = client.chat.completions.create(
173
+ model="hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
174
+ messages=[
175
+ {"role": "system", "content": "You are a helpful assistant."},
176
+ {"role": "user", "content": "What is Deep Learning?"},
177
+ ],
178
+ max_tokens=128,
179
+ )
180
+ ```
181
+
182
+ Alternatively, the OpenAI Python client can also be used (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
183
+
184
+ ```python
185
+ import os
186
+ from openai import OpenAI
187
+
188
+ client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("OPENAI_API_KEY", "-"))
189
+
190
+ chat_completion = client.chat.completions.create(
191
+ model="tgi",
192
+ messages=[
193
+ {"role": "system", "content": "You are a helpful assistant."},
194
+ {"role": "user", "content": "What is Deep Learning?"},
195
+ ],
196
+ max_tokens=128,
197
+ )
198
+ ```
199
+
200
+ ### vLLM
201
+
202
+ To run vLLM with Llama 3.1 8B Instruct AWQ in INT4, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and run the latest vLLM Docker container as follows:
203
+
204
+ ```bash
205
+ docker run --runtime nvidia --gpus all --ipc=host -p 8000:8000 \
206
+ -v hf_cache:/root/.cache/huggingface \
207
+ vllm/vllm-openai:latest \
208
+ --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
209
+ --max-model-len 4096
210
+ ```
211
+
212
+ To send request to the deployed vLLM endpoint compatible with [OpenAI OpenAPI specification](https://github.com/openai/openai-openapi) i.e. `/v1/chat/completions`:
213
+
214
+ ```bash
215
+ curl 0.0.0.0:8000/v1/chat/completions \
216
+ -X POST \
217
+ -H 'Content-Type: application/json' \
218
+ -d '{
219
+ "model": "hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
220
+ "messages": [
221
+ {
222
+ "role": "system",
223
+ "content": "You are a helpful assistant."
224
+ },
225
+ {
226
+ "role": "user",
227
+ "content": "What is Deep Learning?"
228
+ }
229
+ ],
230
+ "max_tokens": 128
231
+ }'
232
+ ```
233
+
234
+ Or programatically via the `openai` Python client (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
235
+
236
+ ```python
237
+ import os
238
+ from openai import OpenAI
239
+
240
+ client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key=os.getenv("VLLM_API_KEY", "-"))
241
+
242
+ chat_completion = client.chat.completions.create(
243
+ model="hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4",
244
+ messages=[
245
+ {"role": "system", "content": "You are a helpful assistant."},
246
+ {"role": "user", "content": "What is Deep Learning?"},
247
+ ],
248
+ max_tokens=128,
249
+ )
250
+ ```
251
 
252
  ## Quantization Reproduction
253