File size: 2,475 Bytes
278e8bb 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c 8253e76 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 7a29a0c ad163e9 fd2432e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
license: apache-2.0
pipeline_tag: text-generation
---
<div align="center">
<img src="https://raw.githubusercontent.com/InternLM/lmdeploy/0be9e7ab6fe9a066cfb0a09d0e0c8d2e28435e58/resources/lmdeploy-logo.svg" width="450"/>
</div>
# INT4 Weight-only Quantization and Deployment (W4A16)
LMDeploy adopts [AWQ](https://arxiv.org/abs/2306.00978) algorithm for 4bit weight-only quantization. By developed the high-performance cuda kernel, the 4bit quantized model inference achieves up to 2.4x faster than FP16.
LMDeploy supports the following NVIDIA GPU for W4A16 inference:
- Turing(sm75): 20 series, T4
- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100
- Ada Lovelace(sm90): 40 series
Before proceeding with the quantization and inference, please ensure that lmdeploy is installed.
```shell
pip install lmdeploy[all]
```
This article comprises the following sections:
<!-- toc -->
- [Inference](#inference)
- [Evaluation](#evaluation)
- [Service](#service)
<!-- tocstop -->
## Inference
Trying the following codes, you can perform the batched offline inference with the quantized model:
```python
from lmdeploy import pipeline, TurbomindEngineConfig
engine_config = TurbomindEngineConfig(model_format='awq')
pipe = pipeline("internlm/internlm2-chat-7b-4bits", backend_config=engine_config)
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
```
For more information about the pipeline parameters, please refer to [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/inference/pipeline.md).
## Evaluation
Please overview [this guide](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_turbomind.html) about model evaluation with LMDeploy.
## Service
LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
```shell
lmdeploy serve api_server internlm/internlm2-chat-7b-4bits --backend turbomind --model-format awq
```
The default port of `api_server` is `23333`. After the server is launched, you can communicate with server on terminal through `api_client`:
```shell
lmdeploy serve api_client http://0.0.0.0:23333
```
You can overview and try out `api_server` APIs online by swagger UI at `http://0.0.0.0:23333`, or you can also read the API specification from [here](https://github.com/InternLM/lmdeploy/blob/main/docs/en/serving/restful_api.md).
|