Spaces:
Runtime error
Runtime error
A newer version of the Gradio SDK is available:
5.6.0
AWQ 4bit Inference
We integrated AWQ into FastChat to provide efficient and accurate 4bit LLM inference.
Install AWQ
Setup environment (please refer to this link for more details):
conda create -n fastchat-awq python=3.10 -y
conda activate fastchat-awq
# cd /path/to/FastChat
pip install --upgrade pip # enable PEP 660 support
pip install -e . # install fastchat
git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e . # install awq package
cd awq/kernels
python setup.py install # install awq CUDA kernels
Chat with the CLI
# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq
# You can specify which quantized model to use by setting --awq-ckpt
python3 -m fastchat.serve.cli \
--model-path models/vicuna-7b-v1.3-4bit-g128-awq \
--awq-wbits 4 \
--awq-groupsize 128
Benchmark
Through 4-bit weight quantization, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.
Benchmark on NVIDIA RTX A6000:
Model Bits Max Memory (MiB) Speed (ms/token) AWQ Speedup vicuna-7b 16 13543 26.06 / vicuna-7b 4 5547 12.43 2.1x llama2-7b-chat 16 13543 27.14 / llama2-7b-chat 4 5547 12.44 2.2x vicuna-13b 16 25647 44.91 / vicuna-13b 4 9355 17.30 2.6x llama2-13b-chat 16 25647 47.28 / llama2-13b-chat 4 9355 20.28 2.3x NVIDIA RTX 4090:
Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup vicuna-7b 8.61 19.09 2.2x llama2-7b-chat 8.66 19.97 2.3x vicuna-13b 12.17 OOM / llama2-13b-chat 13.54 OOM / NVIDIA Jetson Orin:
Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup vicuna-7b 65.34 93.12 1.4x llama2-7b-chat 75.11 104.71 1.4x vicuna-13b 115.40 OOM / llama2-13b-chat 136.81 OOM /