license: apache-2.0
pipeline_tag: text-generation
LMDeploy supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graphics cards being sm80, such as A10, A100, Geforce 30/40 series.
Before proceeding with the inference of internlm2-chat-20b-4bits
, please ensure that lmdeploy is installed.
pip install 'lmdeploy>=0.0.11'
Inference
Please download internlm2-chat-20b-4bits
model as follows,
git-lfs install
git clone https://huggingface.co/internlm/internlm2-chat-20b-4bits
As demonstrated in the command below, you can interact with the AI assistant in the terminal
lmdeploy chat turbomind \
--model-path ./internlm2-chat-20b-4bits \
--model-name internlm2-chat-20b \
--model-format awq \
--group-size 128
Serve with gradio
If you wish to interact with the model via web UI, please initiate the gradio server as indicated below:
python3 -m lmdeploy.serve.gradio.app ./workspace --server_name {ip_addr} --server_port {port}
Subsequently, you can open the website http://{ip_addr}:{port}
in your browser and interact with the model.
Besides serving with gradio, there are two more serving methods. One is serving with Triton Inference Server (TIS), and the other is an OpenAI-like server named as api_server
.
Please refer to the user guide for detailed information if you are interested.
Inference Performance
LMDeploy provides scripts for benchmarking token throughput
and request throughput
.
token throughput
tests the speed of generating new tokens, given a specified number of prompt tokens and completion tokens, while request throughput
measures the number of requests processed per minute with real dialogue data.
We conducted benchmarks on internlm2-chat-20b-4bits
. And token_throughput
was measured by setting 256 prompt tokens and generating 512 tokens in response on A100-80G.
Note: The session_len
in workspace/triton_models/weights/config.ini
is changed to 2056
in our test.
batch | tensor parallel | prompt_tokens | completion_tokens | thr_per_proc(token/s) | rpm (req/min) | mem_per_proc(GB) |
---|---|---|---|---|---|---|
1 | 1 | 256 | 512 | 88.77 | - | 15.65 |
16 | 1 | 256 | 512 | 792.7 | 220.23 | 51.46 |
token throughput
Run the following command,
python benchmark/profile_generation.py \
--model-path ./workspace \
--concurrency 1 8 16 --prompt-tokens 256 512 512 1024 --completion-tokens 512 512 1024 1024
--dst-csv ./token_throughput.csv
You will find the token_throughput
metrics in ./token_throughput.csv
batch | prompt_tokens | completion_tokens | thr_per_proc(token/s) | thr_per_node(token/s) | rpm(req/min) | mem_per_proc(GB) | mem_per_gpu(GB) | mem_per_node(GB) |
---|---|---|---|---|---|---|---|---|
1 | 256 | 512 | 88.77 | 710.12 | - | 15.65 | 15.65 | 125.21 |
1 | 512 | 512 | 83.89 | 671.15 | - | 15.68 | 15.68 | 125.46 |
1 | 512 | 1024 | 80.19 | 641.5 | - | 15.68 | 15.68 | 125.46 |
1 | 1024 | 1024 | 72.34 | 578.74 | - | 15.75 | 15.75 | 125.96 |
1 | 1 | 2048 | 80.69 | 645.55 | - | 15.62 | 15.62 | 124.96 |
8 | 256 | 512 | 565.21 | 4521.67 | - | 32.37 | 32.37 | 258.96 |
8 | 512 | 512 | 489.04 | 3912.33 | - | 32.62 | 32.62 | 260.96 |
8 | 512 | 1024 | 467.23 | 3737.84 | - | 32.62 | 32.62 | 260.96 |
8 | 1024 | 1024 | 383.4 | 3067.19 | - | 33.06 | 33.06 | 264.46 |
8 | 1 | 2048 | 487.74 | 3901.93 | - | 32.12 | 32.12 | 256.96 |
16 | 256 | 512 | 792.7 | 6341.6 | - | 51.46 | 51.46 | 411.71 |
16 | 512 | 512 | 639.4 | 5115.17 | - | 51.93 | 51.93 | 415.46 |
16 | 512 | 1024 | 591.39 | 4731.09 | - | 51.93 | 51.93 | 415.46 |
16 | 1024 | 1024 | 449.11 | 3592.85 | - | 52.06 | 52.06 | 416.46 |
16 | 1 | 2048 | 620.5 | 4964.02 | - | 51 | 51 | 407.96 |
request throughput
LMDeploy uses ShareGPT dataset to test request throughput. Try the next commands, and you will get the rpm
(request per minute) metric.
# download the ShareGPT dataset
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
#
python profile_throughput.py \
ShareGPT_V3_unfiltered_cleaned_split.json \
./workspace \
--concurrency 16