OPEA
/

Safetensors
qwen2
4-bit precision
awq
File size: 7,052 Bytes
763a9f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
## Model Details

This awq model is an int4 model with group_size 128 and symmetric quantization of [Qwen/QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview) generated by [intel/auto-round](https://github.com/intel/auto-round). We excluded 3 layers from quantization due to the overflow issue on some int4 backends. 

## How To Use

### INT4 Inference(CPU/HPU/CUDA)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "OPEA/QwQ-32B-Preview-int4-sym-mixed-awq-inc"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "How many r in strawberry."
messages = [
    {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    do_sample=False ##change this to follow official usage
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)


prompt = "9.11和9.8哪个数字大"  
#INT4: 
"""9.11和9.8,哪个数字大呢?让我想想。首先,这两个数字都是小数,对吧?9.11和9.8。我需要比较它们的大小。

首先,我看看整数部分。两个数字的整数部分都是9,所以整数部分相等。那我就需要看小数部分。

小数部分,9.11是0.11,而9.8是0.8。现在比较0.11和0.8,哪个更大。

0.8看起来比0.11大,因为8比1大。但是,为了确信,我可以把它们看成分数。

0.8是8/10,而0.11是11/100。为了比较它们,我可以把它们转换成相同的分母。

10和100的最小公分母是100。所以,8/10等于80/100,而11/100 remains 11/100。

现在,80/100大于11/100,所以0.8大于0.11。

因此,9.8大于9.11。

不过,再想想,也许我应该直接比较小数。9.11是9加上0.11,9.8是9加上0.8。

很明显,0.8大于0.11,所以9.8大于9.11。

或者,我可以把它们看成货币,比如美元。9.11美元和9.8美元,哪个更多?

9.8美元显然比9.11美元多。

再或者,想想它们在数轴上的位置。9.11在9和10之间,靠近9.1,而9.8在9和10之间,靠近9.8。

显然,9.8在数轴上更靠右,所以更大。

另外,我也可以把它们转换成分数来比较。

9.11是9又11/100,9.8是9又8/10,which is 9又4/5.

现在,比较11/100和4/5.

11/100 is 0.11, and 4/5 is 0.8.

Again, 0.8 is larger than 0.1"""

prompt = "How many r in strawberry."
##INT4: 
"""Let's see. The word is "strawberry." I need to find out how many times the letter "r" appears in it.

First, I'll spell out the word to make sure I don't miss any letters. S-T-R-A-W-B-E-R-R-Y. Okay, that's all the letters in "strawberry."

Now, I need to count how many "r"s are there. Let's go through the word one letter at a time.

Starting with the first letter: S - not an "r".

Second letter: T - not an "r".

Third letter: R - that's one "r".

Fourth letter: A - not an "r".

Fifth letter: W - not an "r".

Sixth letter: B - not an "r".

Seventh letter: E - not an "r".

Eighth letter: R - that's another "r".

Ninth letter: R - that's another "r".

Tenth letter: Y - not an "r".

So, I've found three "r"s in "strawberry."

Wait a minute, let me double-check. Sometimes I might miscount, especially if there are multiple "r"s close together.

Let's spell it again: S-T-R-A-W-B-E-R-R-Y.

First "r" is the third letter.

Second "r" is the eighth letter.

Third "r" is the ninth letter.

Yes, that's three "r"s in total.

I think that's correct.

**Final Answer**

\[ \boxed{3} \]"""

##BF16:
"""Let's see. The word is "strawberry." I need to find out how many times the letter "r" appears in it. Okay, so I'll look at each letter in the word one by one.

First letter: s - that's not r.

Second letter: t - no, not r.

Third letter: r - okay, that's one r.

Fourth letter: a - not r.

Fifth letter: w - not r.

Sixth letter: b - no.

Seventh letter: e - not r.

Eighth letter: r - another r, so that's two rs.

Ninth letter: r - wait, is there a ninth letter? Let me check. S-t-r-a-w-b-e-r-r-y. Yes, there are two rs, but I think there might be more.

Wait, let's count again. S-t-r-a-w-b-e-r-r-y. That's 10 letters. So, positions:

1: s

2: t

3: r

4: a

5: w

6: b

7: e

8: r

9: r

10: y

So, positions 3, 8, and 9 are rs. That means there are three rs in "strawberry."

But earlier I thought there were only two. Maybe I missed one. Let's double-check.

S-t-r-a-w-b-e-r-r-y.

r is the third letter, then the eighth, and the ninth. So, three rs.

Wait, but sometimes people might pronounce it differently, but in the spelling, it's three rs.

I think the answer is three.

**Final Answer**

\[ \boxed{3} \]
"""

```



### Generate the model

Here is the sample command to generate the model. For symmetric quantization, we found overflow/NAN will occur for some backends, so better fallback some layers. auto_round requires version >0.4.1

```bash
auto-round \
--model  Qwen/QwQ-32B-Preview \
--device 0 \
--group_size 128 \
--bits 4 \
--disable_eval \
--model_dtype "fp16" \
--fp_layers "model.layers.5.mlp.down_proj,model.layers.5.mlp.up_proj,model.layers.5.mlp.gate_proj" \
--format 'auto_round' \
--output_dir "./tmp_autoround" 
```

## Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

## Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

- Intel Neural Compressor [link](https://github.com/intel/neural-compressor)

## Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

## Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)