File size: 2,447 Bytes
e841326
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a70257c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e841326
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
license: llama3.3
---

The original [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) model quantized using AutoAWQ. Follow the instruction [here](https://docs.vllm.ai/en/latest/quantization/auto_awq.html).

```
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'meta-llama/Llama-3.3-70B-Instruct'
quant_path = 'Llama-3.3-70B-Instruct-AWQ-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
            model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
            )
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
```


vLLM serve
```
vllm serve lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \
--swap-space 16 \
--disable-log-requests \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2
```


Benchmark 
```
python benchmark_serving.py \
--backend vllm \
--model lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \
--tokenizer meta-llama/Meta-Llama-3-70B \
--dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1000

============ Serving Benchmark Result ============
Successful requests:                     902       
Benchmark duration (s):                  128.07    
Total input tokens:                      177877    
Total generated tokens:                  182359    
Request throughput (req/s):              7.04      
Output token throughput (tok/s):         1423.85   
Total Token throughput (tok/s):          2812.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          47225.59  
Median TTFT (ms):                        43313.95  
P99 TTFT (ms):                           105587.66 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          141.01    
Median TPOT (ms):                        148.94    
P99 TPOT (ms):                           174.16    
---------------Inter-token Latency----------------
Mean ITL (ms):                           131.55    
Median ITL (ms):                         150.82    
P99 ITL (ms):                            344.50    
==================================================
```