--- license: llama3.3 --- The original [Llama 3.3 70B Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) model quantized using AutoAWQ. Follow the instruction [here](https://docs.vllm.ai/en/latest/quantization/auto_awq.html). ``` from awq import AutoAWQForCausalLM from transformers import AutoTokenizer model_path = 'meta-llama/Llama-3.3-70B-Instruct' quant_path = 'Llama-3.3-70B-Instruct-AWQ-4bit' quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" } # Load model model = AutoAWQForCausalLM.from_pretrained( model_path, **{"low_cpu_mem_usage": True, "use_cache": False} ) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) # Quantize model.quantize(tokenizer, quant_config=quant_config) # Save quantized model model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path) ``` vLLM serve ``` vllm serve lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \ --swap-space 16 \ --disable-log-requests \ --tokenizer meta-llama/Llama-3.3-70B-Instruct \ --tensor-parallel-size 2 ``` Benchmark ``` python benchmark_serving.py \ --backend vllm \ --model lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \ --tokenizer meta-llama/Meta-Llama-3-70B \ --dataset-name sharegpt \ --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \ --num-prompts 1000 ============ Serving Benchmark Result ============ Successful requests: 902 Benchmark duration (s): 128.07 Total input tokens: 177877 Total generated tokens: 182359 Request throughput (req/s): 7.04 Output token throughput (tok/s): 1423.85 Total Token throughput (tok/s): 2812.71 ---------------Time to First Token---------------- Mean TTFT (ms): 47225.59 Median TTFT (ms): 43313.95 P99 TTFT (ms): 105587.66 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 141.01 Median TPOT (ms): 148.94 P99 TPOT (ms): 174.16 ---------------Inter-token Latency---------------- Mean ITL (ms): 131.55 Median ITL (ms): 150.82 P99 ITL (ms): 344.50 ================================================== ```