Update README.md
Browse files
README.md
CHANGED
@@ -6,50 +6,85 @@ language:
|
|
6 |
---
|
7 |
|
8 |
|
9 |
-
## Model Details
|
10 |
|
11 |
-
This
|
12 |
Inference of this model is compatible with AutoGPTQ's Kernel.
|
13 |
|
14 |
|
15 |
|
16 |
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
### Reproduce the model
|
21 |
|
22 |
Here is the sample command to reproduce the model
|
23 |
|
24 |
```bash
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
python3 main.py \
|
29 |
-
--model_name facebook/opt-1.3b \
|
30 |
--device 0 \
|
31 |
--group_size 128 \
|
32 |
--bits 4 \
|
33 |
--iters 1000 \
|
34 |
--nsamples 512 \
|
35 |
-
--
|
36 |
--minmax_lr 2e-3 \
|
37 |
--disable_quanted_input \
|
38 |
--output_dir "./tmp_autoround" \
|
39 |
-
|
40 |
```
|
41 |
|
|
|
42 |
|
43 |
-
|
44 |
-
|
45 |
-
### Evaluate the model
|
46 |
-
|
47 |
-
Install [lm-eval-harness 0.4.2](https://github.com/EleutherAI/lm-evaluation-harness.git) from source.
|
48 |
|
49 |
```bash
|
50 |
-
|
|
|
|
|
51 |
```
|
52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
| Metric | FP16 | INT4 |
|
54 |
| -------------- | ------ | ------ |
|
55 |
| Avg. | 0.4405 | 0.4315 |
|
|
|
6 |
---
|
7 |
|
8 |
|
9 |
+
## Model Recipe Details
|
10 |
|
11 |
+
This is an int4 model recipe with group_size 128 of [facebook/opt-1.3b](https://huggingface.co/facebook/opt-1.3b) generated by [intel/auto-round](https://github.com/intel/auto-round).
|
12 |
Inference of this model is compatible with AutoGPTQ's Kernel.
|
13 |
|
14 |
|
15 |
|
16 |
|
17 |
+
### Quantize the model
|
|
|
|
|
|
|
18 |
|
19 |
Here is the sample command to reproduce the model
|
20 |
|
21 |
```bash
|
22 |
+
pip install auto-round
|
23 |
+
auto-round
|
24 |
+
--model facebook/opt-1.3b \
|
|
|
|
|
25 |
--device 0 \
|
26 |
--group_size 128 \
|
27 |
--bits 4 \
|
28 |
--iters 1000 \
|
29 |
--nsamples 512 \
|
30 |
+
--format 'auto_gptq' \
|
31 |
--minmax_lr 2e-3 \
|
32 |
--disable_quanted_input \
|
33 |
--output_dir "./tmp_autoround" \
|
|
|
34 |
```
|
35 |
|
36 |
+
## How to use
|
37 |
|
38 |
+
### INT4 Inference with IPEX on Intel CPU
|
39 |
+
Install the latest [Intel Extension for Pytorch](https://github.com/intel/intel-extension-for-pytorch) and [Intel Neural Compressor](https://github.com/intel/neural-compressor)
|
|
|
|
|
|
|
40 |
|
41 |
```bash
|
42 |
+
pip install torch --index-url https://download.pytorch.org/whl/cpu
|
43 |
+
pip install intel_extension_for_pytorch
|
44 |
+
pip install neural_compressor_pt
|
45 |
```
|
46 |
|
47 |
+
```python
|
48 |
+
from transformers import AutoTokenizer
|
49 |
+
from neural_compressor.transformers import AutoModelForCausalLM
|
50 |
+
|
51 |
+
## note: use quantized model directory name below
|
52 |
+
model_name_or_path="./tmp_autoround/<model directory name>"
|
53 |
+
q_model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
|
54 |
+
|
55 |
+
prompt = "Once upon a time, a little girl"
|
56 |
+
|
57 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
58 |
+
print(tokenizer.decode(q_model.generate(**tokenizer(prompt, return_tensors="pt").to(q_model.device),max_new_tokens=50)[0]))
|
59 |
+
## Once upon a time, a little girl was born. She was a beautiful little girl, with a beautiful smile. She was a little girl who loved to play. She was a little girl who loved to sing.She was a little girl who loved to dance.
|
60 |
+
```
|
61 |
+
|
62 |
+
### INT4 Inference on Intel Gaudi Accelerator
|
63 |
+
docker image with Gaudi Software Stack is recommended. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/).
|
64 |
+
|
65 |
+
```python
|
66 |
+
import habana_frameworks.torch.core as htcore
|
67 |
+
from neural_compressor.torch.quantization import load
|
68 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
69 |
+
|
70 |
+
## note: use quantized model directory name below
|
71 |
+
model_name_or_path="./tmp_autoround/<model directory name>"
|
72 |
+
|
73 |
+
model = load(
|
74 |
+
model_name_or_path=model_name_or_path,
|
75 |
+
format="huggingface",
|
76 |
+
device="hpu"
|
77 |
+
)
|
78 |
+
|
79 |
+
prompt = "Once upon a time, a little girl"
|
80 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
|
81 |
+
print(tokenizer.decode(model.generate(**tokenizer(prompt, return_tensors="pt").to("hpu"),max_new_tokens=50)[0]))
|
82 |
+
|
83 |
+
```
|
84 |
+
|
85 |
+
## Accuracy Result
|
86 |
+
|
87 |
+
|
88 |
| Metric | FP16 | INT4 |
|
89 |
| -------------- | ------ | ------ |
|
90 |
| Avg. | 0.4405 | 0.4315 |
|