bf16_vs_fp8 / docs /awq.md
zjasper666's picture
Upload folder using huggingface_hub
8655a4b verified

A newer version of the Gradio SDK is available: 5.6.0

Upgrade

AWQ 4bit Inference

We integrated AWQ into FastChat to provide efficient and accurate 4bit LLM inference.

Install AWQ

Setup environment (please refer to this link for more details):

conda create -n fastchat-awq python=3.10 -y
conda activate fastchat-awq
# cd /path/to/FastChat
pip install --upgrade pip    # enable PEP 660 support
pip install -e .             # install fastchat

git clone https://github.com/mit-han-lab/llm-awq repositories/llm-awq
cd repositories/llm-awq
pip install -e .             # install awq package

cd awq/kernels				
python setup.py install	     # install awq CUDA kernels

Chat with the CLI

# Download quantized model from huggingface
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/mit-han-lab/vicuna-7b-v1.3-4bit-g128-awq

# You can specify which quantized model to use by setting --awq-ckpt
python3 -m fastchat.serve.cli \
    --model-path models/vicuna-7b-v1.3-4bit-g128-awq \
    --awq-wbits 4 \
    --awq-groupsize 128 

Benchmark

  • Through 4-bit weight quantization, AWQ helps to run larger language models within the device memory restriction and prominently accelerates token generation. All benchmarks are done with group_size 128.

  • Benchmark on NVIDIA RTX A6000:

    Model Bits Max Memory (MiB) Speed (ms/token) AWQ Speedup
    vicuna-7b 16 13543 26.06 /
    vicuna-7b 4 5547 12.43 2.1x
    llama2-7b-chat 16 13543 27.14 /
    llama2-7b-chat 4 5547 12.44 2.2x
    vicuna-13b 16 25647 44.91 /
    vicuna-13b 4 9355 17.30 2.6x
    llama2-13b-chat 16 25647 47.28 /
    llama2-13b-chat 4 9355 20.28 2.3x
  • NVIDIA RTX 4090:

    Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup
    vicuna-7b 8.61 19.09 2.2x
    llama2-7b-chat 8.66 19.97 2.3x
    vicuna-13b 12.17 OOM /
    llama2-13b-chat 13.54 OOM /
  • NVIDIA Jetson Orin:

    Model AWQ 4bit Speed (ms/token) FP16 Speed (ms/token) AWQ Speedup
    vicuna-7b 65.34 93.12 1.4x
    llama2-7b-chat 75.11 104.71 1.4x
    vicuna-13b 115.40 OOM /
    llama2-13b-chat 136.81 OOM /