|
--- |
|
license: llama3.3 |
|
--- |
|
|
|
The original [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) model quantized using AutoAWQ. Follow the instruction [here](https://docs.vllm.ai/en/latest/quantization/auto_awq.html). |
|
|
|
``` |
|
from awq import AutoAWQForCausalLM |
|
from transformers import AutoTokenizer |
|
|
|
model_path = 'meta-llama/Llama-3.3-70B-Instruct' |
|
quant_path = 'Llama-3.3-70B-Instruct-AWQ-4bit' |
|
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } |
|
|
|
# Load model |
|
model = AutoAWQForCausalLM.from_pretrained( |
|
model_path, **{"low_cpu_mem_usage": True, "use_cache": False} |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
# Quantize |
|
model.quantize(tokenizer, quant_config=quant_config) |
|
|
|
# Save quantized model |
|
model.save_quantized(quant_path) |
|
tokenizer.save_pretrained(quant_path) |
|
``` |
|
|
|
|
|
vLLM serve |
|
``` |
|
vllm serve lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \ |
|
--swap-space 16 \ |
|
--disable-log-requests \ |
|
--tokenizer meta-llama/Llama-3.3-70B-Instruct \ |
|
--tensor-parallel-size 2 |
|
``` |
|
|
|
|
|
Benchmark |
|
``` |
|
python benchmark_serving.py \ |
|
--backend vllm \ |
|
--model lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \ |
|
--tokenizer meta-llama/Meta-Llama-3-70B \ |
|
--dataset-name sharegpt \ |
|
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ |
|
--num-prompts 1000 |
|
|
|
============ Serving Benchmark Result ============ |
|
Successful requests: 902 |
|
Benchmark duration (s): 128.07 |
|
Total input tokens: 177877 |
|
Total generated tokens: 182359 |
|
Request throughput (req/s): 7.04 |
|
Output token throughput (tok/s): 1423.85 |
|
Total Token throughput (tok/s): 2812.71 |
|
---------------Time to First Token---------------- |
|
Mean TTFT (ms): 47225.59 |
|
Median TTFT (ms): 43313.95 |
|
P99 TTFT (ms): 105587.66 |
|
-----Time per Output Token (excl. 1st token)------ |
|
Mean TPOT (ms): 141.01 |
|
Median TPOT (ms): 148.94 |
|
P99 TPOT (ms): 174.16 |
|
---------------Inter-token Latency---------------- |
|
Mean ITL (ms): 131.55 |
|
Median ITL (ms): 150.82 |
|
P99 ITL (ms): 344.50 |
|
================================================== |
|
``` |