add details to generate the model
Browse filesSigned-off-by: wenhuach <wenhuach87@gmail.com>
- README.md +63 -11
- config.json +1 -1
README.md
CHANGED
@@ -4,31 +4,32 @@ datasets:
|
|
4 |
base_model:
|
5 |
- deepseek-ai/DeepSeek-V3
|
6 |
|
|
|
7 |
---
|
8 |
|
9 |
## Model Details
|
10 |
|
11 |
This model is an int4 model with group_size 128 and symmetric quantization of [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm.
|
12 |
|
13 |
-
**
|
14 |
|
15 |
-
Due to limited GPU resources, we have only tested a few prompts on a CPU backend with
|
16 |
|
17 |
Please follow the license of the original model.
|
18 |
|
19 |
## How To Use
|
20 |
|
21 |
-
### INT4 Inference on CPU with
|
22 |
-
|
23 |
|
24 |
**pip3 install auto-round** (it will install intel-extension-for-pytorch and intel-extension-for-transformers both). For intel cpu, it will prioritize using intel-extension-for-pytorch , for other cpus, it will prioritize using intel-extension-for-transformers.
|
25 |
|
26 |
**To make sure to use qbits with intel-extension-for-transformers, please uninstall intel-extension-for-pytorch**
|
27 |
|
28 |
-
intel-extension-for-transformers faster repacking, slower inference,higher accuracy
|
29 |
-
|
|
|
30 |
|
31 |
-
|
32 |
from auto_round import AutoRoundConfig ##must import for autoround format
|
33 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
34 |
import torch
|
@@ -160,10 +161,11 @@ prompt = "There is a girl who likes adventure,"
|
|
160 |
prompt = "Please give a brief introduction of DeepSeek company."
|
161 |
##INT4:
|
162 |
"""DeepSeek Artificial Intelligence Co., Ltd. (referred to as "DeepSeek" or "深度求索") , founded in 2023, is a Chinese company dedicated to making AGI a reality"""
|
163 |
-
|
164 |
|
165 |
### INT4 Inference on CUDA(have not tested, maybe need 8X80G GPU)
|
166 |
-
|
|
|
167 |
|
168 |
````python
|
169 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
@@ -211,9 +213,59 @@ we have no enough resource to evaluate the model
|
|
211 |
|
212 |
### Generate the model
|
213 |
|
214 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
215 |
|
|
|
|
|
|
|
216 |
|
|
|
|
|
|
|
217 |
|
218 |
## Ethical Considerations and Limitations
|
219 |
|
@@ -237,4 +289,4 @@ The license on this model does not constitute legal advice. We are not responsib
|
|
237 |
|
238 |
@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
|
239 |
|
240 |
-
[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)
|
|
|
4 |
base_model:
|
5 |
- deepseek-ai/DeepSeek-V3
|
6 |
|
7 |
+
|
8 |
---
|
9 |
|
10 |
## Model Details
|
11 |
|
12 |
This model is an int4 model with group_size 128 and symmetric quantization of [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm.
|
13 |
|
14 |
+
**On CUDA devices, this model is prone to overflow caused by the INT4 kernel using the FP16 computation dtype. Additionally, loading the model in Transformers can be quite slow. Consider using an alternative serving framework capable of running INT4 models with the BF16 computation dtype.**
|
15 |
|
16 |
+
Due to limited GPU resources, we have only tested a few prompts on a CPU backend with intel-extension-for-transformers . If this model does not meet your performance expectations, you may explore another quantized model in AWQ format, generated via AutoRound with different hyperparameters. This alternative model will be uploaded soon.
|
17 |
|
18 |
Please follow the license of the original model.
|
19 |
|
20 |
## How To Use
|
21 |
|
22 |
+
### INT4 Inference on CPU with ITREX(Recommended)
|
|
|
23 |
|
24 |
**pip3 install auto-round** (it will install intel-extension-for-pytorch and intel-extension-for-transformers both). For intel cpu, it will prioritize using intel-extension-for-pytorch , for other cpus, it will prioritize using intel-extension-for-transformers.
|
25 |
|
26 |
**To make sure to use qbits with intel-extension-for-transformers, please uninstall intel-extension-for-pytorch**
|
27 |
|
28 |
+
intel-extension-for-transformers: faster repacking, slower inference,higher accuracy
|
29 |
+
|
30 |
+
intel-extension-for-pytorch: much slower repacking, faster inference, lower accuracy
|
31 |
|
32 |
+
~~python
|
33 |
from auto_round import AutoRoundConfig ##must import for autoround format
|
34 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
35 |
import torch
|
|
|
161 |
prompt = "Please give a brief introduction of DeepSeek company."
|
162 |
##INT4:
|
163 |
"""DeepSeek Artificial Intelligence Co., Ltd. (referred to as "DeepSeek" or "深度求索") , founded in 2023, is a Chinese company dedicated to making AGI a reality"""
|
164 |
+
~~
|
165 |
|
166 |
### INT4 Inference on CUDA(have not tested, maybe need 8X80G GPU)
|
167 |
+
|
168 |
+
Int4 kernel with BF16 computing dtype is required.
|
169 |
|
170 |
````python
|
171 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
213 |
|
214 |
### Generate the model
|
215 |
|
216 |
+
**5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
|
217 |
+
|
218 |
+
We discovered that the inputs and outputs of certain layers in this model are very large and even exceed the FP16 range when tested with a few prompts. It is recommended to exclude these layers from quantization—particularly the 'down_proj' in layer 60—and run them using BF16 precision instead. However, we have not implemented this in this int4 model as in cpu, the compute dtype for int4 is bf16 or FP32.
|
219 |
+
|
220 |
+
~~python
|
221 |
+
model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
|
222 |
+
model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
|
223 |
+
model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
|
224 |
+
model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
|
225 |
+
model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
|
226 |
+
model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
|
227 |
+
model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
|
228 |
+
model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
|
229 |
+
|
230 |
+
~~
|
231 |
+
|
232 |
+
|
233 |
+
|
234 |
+
**1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16
|
235 |
+
|
236 |
+
~~python
|
237 |
+
import safetensors
|
238 |
+
from safetensors.torch import save_file
|
239 |
+
|
240 |
+
|
241 |
+
for i in range(1, 164):
|
242 |
+
idx_str = "0" * (5-len(str(i))) + str(i)
|
243 |
+
safetensors_path = f"model-{idx_str}-of-000163.safetensors"
|
244 |
+
print(safetensors_path)
|
245 |
+
tensors = dict()
|
246 |
+
with safetensors.safe_open(safetensors_path, framework="pt") as f:
|
247 |
+
for key in f.keys():
|
248 |
+
tensors[key] = f.get_tensor(key)
|
249 |
+
save_file(tensors, safetensors_path, metadata={'format': 'pt'})
|
250 |
+
~~
|
251 |
+
|
252 |
+
|
253 |
+
|
254 |
+
**2 replace the modeling_deepseek.py with the following file**, basically align device and remove torch.no_grad as we need some tuning in AutoRound.
|
255 |
+
|
256 |
+
https://github.com/intel/auto-round/blob/deepseekv3/modeling_deepseek.py
|
257 |
+
|
258 |
+
|
259 |
+
|
260 |
+
**3 tuning**
|
261 |
|
262 |
+
~~
|
263 |
+
git clone https://github.com/intel/auto-round.git && cd auto-round && git checkout deepseekv3
|
264 |
+
~~
|
265 |
|
266 |
+
```bash
|
267 |
+
python3 -m auto_round --model "/models/DeepSeek-V3-bf16/" --group_size 128 --format "auto_gptq" --iters 200 --devices 0,1,2,3,4 --nsamples 512 --batch_size 8 --seqlen 512 --low_gpu_mem_usage --output_dir "tmp_autoround" --disable_eval e 2>&1 | tee -a seekv3.txt
|
268 |
+
```
|
269 |
|
270 |
## Ethical Considerations and Limitations
|
271 |
|
|
|
289 |
|
290 |
@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
|
291 |
|
292 |
+
[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)
|
config.json
CHANGED
@@ -79,7 +79,7 @@
|
|
79 |
"tie_word_embeddings": false,
|
80 |
"topk_group": 4,
|
81 |
"topk_method": "noaux_tc",
|
82 |
-
"torch_dtype": "
|
83 |
"transformers_version": "4.47.0",
|
84 |
"use_cache": true,
|
85 |
"v_head_dim": 128,
|
|
|
79 |
"tie_word_embeddings": false,
|
80 |
"topk_group": 4,
|
81 |
"topk_method": "noaux_tc",
|
82 |
+
"torch_dtype": "bfloat16",
|
83 |
"transformers_version": "4.47.0",
|
84 |
"use_cache": true,
|
85 |
"v_head_dim": 128,
|