whcao commited on
Commit
7a29a0c
·
1 Parent(s): 810142a

WIP: fix readme

Browse files
Files changed (1) hide show
  1. README.md +34 -84
README.md CHANGED
@@ -1,110 +1,60 @@
1
- ---
2
- license: apache-2.0
3
- pipeline_tag: text-generation
4
- ---
5
- <div align="center">
6
- <img src="https://raw.githubusercontent.com/InternLM/lmdeploy/0be9e7ab6fe9a066cfb0a09d0e0c8d2e28435e58/resources/lmdeploy-logo.svg" width="450"/>
7
- </div>
8
 
9
- [LMDeploy](https://github.com/InternLM/lmdeploy) supports LLM model inference of 4-bit weight, with the minimum requirement for NVIDIA graphics cards being sm80, such as A10, A100, Geforce 30/40 series.
10
 
11
- Before proceeding with the inference of `internlm2-chat-7b-4bits`, please ensure that lmdeploy is installed.
12
 
13
- ```shell
14
- pip install lmdeploy
15
- ```
16
 
17
- ## Inference
18
 
19
- Please download `internlm2-chat-7b-4bits` model as follows,
20
 
21
- ```shell
22
- git-lfs install
23
- git clone https://huggingface.co/internlm/internlm2-chat-7b-4bits
24
- ```
25
-
26
- As demonstrated in the command below, you can interact with the AI assistant in the terminal
27
 
28
  ```shell
29
- lmdeploy chat turbomind \
30
- --model-path ./internlm2-chat-7b-4bits \
31
- --model-name internlm2-chat-7b \
32
- --model-format awq \
33
- --group-size 128
34
  ```
35
 
36
- ## Serve with gradio
37
-
38
- If you wish to interact with the model via web UI, please initiate the gradio server as indicated below:
39
 
40
- ```shell
41
- python3 -m lmdeploy.serve.gradio.app ./workspace --server_name {ip_addr} --server_port {port}
42
- ```
43
 
44
- Subsequently, you can open the website `http://{ip_addr}:{port}` in your browser and interact with the model.
45
-
46
- Besides serving with gradio, there are two more serving methods. One is serving with Triton Inference Server (TIS), and the other is an OpenAI-like server named as `api_server`.
47
-
48
- Please refer to the [user guide](https://github.com/InternLM/lmdeploy#quick-start) for detailed information if you are interested.
49
 
 
 
50
 
51
- ## Inference Performance
52
 
53
- LMDeploy provides scripts for benchmarking `token throughput` and `request throughput`.
 
 
 
 
 
 
54
 
55
- `token throughput` tests the speed of generating new tokens, given a specified number of prompt tokens and completion tokens, while `request throughput` measures the number of requests processed per minute with real dialogue data.
56
 
57
- We conducted benchmarks on `internlm2-chat-7b-4bits`. And `token_throughput` was measured by setting 256 prompt tokens and generating 512 tokens in response on A100-80G.
58
 
59
- **Note**: The `session_len` in `workspace/triton_models/weights/config.ini` is changed to `2056` in our test.
60
 
 
61
 
62
- | batch | tensor parallel | prompt_tokens | completion_tokens | thr_per_proc(token/s) | rpm (req/min) | mem_per_proc(GB) |
63
- |-------|-----------------|---------------|-------------------|-----------------------|---------------|------------------|
64
- | 1 | 1 | 256 | 512 | 88.77 | - | 15.65 |
65
- | 16 | 1 | 256 | 512 | 792.7 | 220.23 | 51.46 |
66
 
67
- ### token throughput
 
 
68
 
69
- Run the following command,
70
 
71
  ```shell
72
- python benchmark/profile_generation.py \
73
- --model-path ./workspace \
74
- --concurrency 1 8 16 --prompt-tokens 256 512 512 1024 --completion-tokens 512 512 1024 1024
75
- --dst-csv ./token_throughput.csv
76
  ```
77
- You will find the `token_throughput` metrics in `./token_throughput.csv`
78
-
79
- | batch | prompt_tokens | completion_tokens | thr_per_proc(token/s) | thr_per_node(token/s) | rpm(req/min) | mem_per_proc(GB) | mem_per_gpu(GB) | mem_per_node(GB) |
80
- |-------|---------------|-------------------|-----------------------|-----------------------|--------------|------------------|-----------------|------------------|
81
- | 1 | 256 | 512 | 88.77 | 710.12 | - | 15.65 | 15.65 | 125.21 |
82
- | 1 | 512 | 512 | 83.89 | 671.15 | - | 15.68 | 15.68 | 125.46 |
83
- | 1 | 512 | 1024 | 80.19 | 641.5 | - | 15.68 | 15.68 | 125.46 |
84
- | 1 | 1024 | 1024 | 72.34 | 578.74 | - | 15.75 | 15.75 | 125.96 |
85
- | 1 | 1 | 2048 | 80.69 | 645.55 | - | 15.62 | 15.62 | 124.96 |
86
- | 8 | 256 | 512 | 565.21 | 4521.67 | - | 32.37 | 32.37 | 258.96 |
87
- | 8 | 512 | 512 | 489.04 | 3912.33 | - | 32.62 | 32.62 | 260.96 |
88
- | 8 | 512 | 1024 | 467.23 | 3737.84 | - | 32.62 | 32.62 | 260.96 |
89
- | 8 | 1024 | 1024 | 383.4 | 3067.19 | - | 33.06 | 33.06 | 264.46 |
90
- | 8 | 1 | 2048 | 487.74 | 3901.93 | - | 32.12 | 32.12 | 256.96 |
91
- | 16 | 256 | 512 | 792.7 | 6341.6 | - | 51.46 | 51.46 | 411.71 |
92
- | 16 | 512 | 512 | 639.4 | 5115.17 | - | 51.93 | 51.93 | 415.46 |
93
- | 16 | 512 | 1024 | 591.39 | 4731.09 | - | 51.93 | 51.93 | 415.46 |
94
- | 16 | 1024 | 1024 | 449.11 | 3592.85 | - | 52.06 | 52.06 | 416.46 |
95
- | 16 | 1 | 2048 | 620.5 | 4964.02 | - | 51 | 51 | 407.96 |
96
-
97
-
98
- ### request throughput
99
-
100
- LMDeploy uses ShareGPT dataset to test request throughput. Try the next commands, and you will get the `rpm` (request per minute) metric.
101
 
102
- ```
103
- # download the ShareGPT dataset
104
- wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
105
- #
106
- python profile_throughput.py \
107
- ShareGPT_V3_unfiltered_cleaned_split.json \
108
- ./workspace \
109
- --concurrency 16
110
- ```
 
1
+ # INT4 Weight-only Quantization and Deployment (W4A16)
 
 
 
 
 
 
2
 
3
+ LMDeploy adopts [AWQ](https://arxiv.org/abs/2306.00978) algorithm for 4bit weight-only quantization. By developed the high-performance cuda kernel, the 4bit quantized model inference achieves up to 2.4x faster than FP16.
4
 
5
+ LMDeploy supports the following NVIDIA GPU for W4A16 inference:
6
 
7
+ - Turing(sm75): 20 series, T4
 
 
8
 
9
+ - Ampere(sm80,sm86): 30 series, A10, A16, A30, A100
10
 
11
+ - Ada Lovelace(sm90): 40 series
12
 
13
+ Before proceeding with the quantization and inference, please ensure that lmdeploy is installed.
 
 
 
 
 
14
 
15
  ```shell
16
+ pip install lmdeploy[all]
 
 
 
 
17
  ```
18
 
19
+ This article comprises the following sections:
 
 
20
 
21
+ <!-- toc -->
 
 
22
 
23
+ - [Inference](#inference)
24
+ - [Evaluation](#evaluation)
25
+ - [Service](#service)
 
 
26
 
27
+ <!-- tocstop -->
28
+ ## Inference
29
 
30
+ Trying the following codes, you can perform the batched offline inference with the quantized model:
31
 
32
+ ```python
33
+ from lmdeploy import pipeline, TurbomindEngineConfig
34
+ engine_config = TurbomindEngineConfig(model_format='awq')
35
+ pipe = pipeline("internlm/internlm2-chat-7b-4bits", engine_config)
36
+ response = pipe(["Hi, pls intro yourself", "Shanghai is"])
37
+ print(response)
38
+ ```
39
 
40
+ For more information about the pipeline parameters, please refer to [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md).
41
 
42
+ ## Evaluation
43
 
44
+ Please overview [this guide](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_turbomind.html) about model evaluation with LMDeploy.
45
 
46
+ ## Service
47
 
48
+ LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 
 
 
49
 
50
+ ```shell
51
+ lmdeploy serve api_server internlm/internlm2-chat-7b-4bits --backend turbomind --model-format awq
52
+ ```
53
 
54
+ The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:
55
 
56
  ```shell
57
+ lmdeploy serve api_client http://0.0.0.0:23333
 
 
 
58
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
+ You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](../serving/restful_api.md).