wenhuach commited on
Commit
7c8c9c6
·
1 Parent(s): 8303562

add details to generate the model

Browse files

Signed-off-by: wenhuach <wenhuach87@gmail.com>

Files changed (2) hide show
  1. README.md +63 -11
  2. config.json +1 -1
README.md CHANGED
@@ -4,31 +4,32 @@ datasets:
4
  base_model:
5
  - deepseek-ai/DeepSeek-V3
6
 
 
7
  ---
8
 
9
  ## Model Details
10
 
11
  This model is an int4 model with group_size 128 and symmetric quantization of [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm.
12
 
13
- **Please note this model may introduce overflow issue cuased by FP16 kernel , which is tipically used on cuda device. Additionally, loading the model in Transformers can be quite slow. Consider using an alternative serving framework that could run int4 models with bf16 computing dtype.**
14
 
15
- Due to limited GPU resources, we have only tested a few prompts on a CPU backend with QBits. If this model does not meet your performance expectations, you may explore another quantized model in AWQ format, generated via AutoRound with different hyperparameters. This alternative model will be uploaded soon.
16
 
17
  Please follow the license of the original model.
18
 
19
  ## How To Use
20
 
21
- ### INT4 Inference on CPU with Qbits(Recommend)
22
-
23
 
24
  **pip3 install auto-round** (it will install intel-extension-for-pytorch and intel-extension-for-transformers both). For intel cpu, it will prioritize using intel-extension-for-pytorch , for other cpus, it will prioritize using intel-extension-for-transformers.
25
 
26
  **To make sure to use qbits with intel-extension-for-transformers, please uninstall intel-extension-for-pytorch**
27
 
28
- intel-extension-for-transformers faster repacking, slower inference,higher accuracy
29
- intel-extension-for-pytorch much slower repacking, faster inferecne, lower accuracy
 
30
 
31
- ~~~python
32
  from auto_round import AutoRoundConfig ##must import for autoround format
33
  from transformers import AutoModelForCausalLM, AutoTokenizer
34
  import torch
@@ -160,10 +161,11 @@ prompt = "There is a girl who likes adventure,"
160
  prompt = "Please give a brief introduction of DeepSeek company."
161
  ##INT4:
162
  """DeepSeek Artificial Intelligence Co., Ltd. (referred to as "DeepSeek" or "深度求索") , founded in 2023, is a Chinese company dedicated to making AGI a reality"""
163
- ~~~
164
 
165
  ### INT4 Inference on CUDA(have not tested, maybe need 8X80G GPU)
166
- BF16 int4 kernel is required.
 
167
 
168
  ````python
169
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -211,9 +213,59 @@ we have no enough resource to evaluate the model
211
 
212
  ### Generate the model
213
 
214
- need 200G GPU memory, details will be updated later
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
 
 
 
 
216
 
 
 
 
217
 
218
  ## Ethical Considerations and Limitations
219
 
@@ -237,4 +289,4 @@ The license on this model does not constitute legal advice. We are not responsib
237
 
238
  @article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
239
 
240
- [arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)
 
4
  base_model:
5
  - deepseek-ai/DeepSeek-V3
6
 
7
+
8
  ---
9
 
10
  ## Model Details
11
 
12
  This model is an int4 model with group_size 128 and symmetric quantization of [deepseek-ai/DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm.
13
 
14
+ **On CUDA devices, this model is prone to overflow caused by the INT4 kernel using the FP16 computation dtype. Additionally, loading the model in Transformers can be quite slow. Consider using an alternative serving framework capable of running INT4 models with the BF16 computation dtype.**
15
 
16
+ Due to limited GPU resources, we have only tested a few prompts on a CPU backend with intel-extension-for-transformers . If this model does not meet your performance expectations, you may explore another quantized model in AWQ format, generated via AutoRound with different hyperparameters. This alternative model will be uploaded soon.
17
 
18
  Please follow the license of the original model.
19
 
20
  ## How To Use
21
 
22
+ ### INT4 Inference on CPU with ITREX(Recommended)
 
23
 
24
  **pip3 install auto-round** (it will install intel-extension-for-pytorch and intel-extension-for-transformers both). For intel cpu, it will prioritize using intel-extension-for-pytorch , for other cpus, it will prioritize using intel-extension-for-transformers.
25
 
26
  **To make sure to use qbits with intel-extension-for-transformers, please uninstall intel-extension-for-pytorch**
27
 
28
+ intel-extension-for-transformers: faster repacking, slower inference,higher accuracy
29
+
30
+ intel-extension-for-pytorch: much slower repacking, faster inference, lower accuracy
31
 
32
+ ~~python
33
  from auto_round import AutoRoundConfig ##must import for autoround format
34
  from transformers import AutoModelForCausalLM, AutoTokenizer
35
  import torch
 
161
  prompt = "Please give a brief introduction of DeepSeek company."
162
  ##INT4:
163
  """DeepSeek Artificial Intelligence Co., Ltd. (referred to as "DeepSeek" or "深度求索") , founded in 2023, is a Chinese company dedicated to making AGI a reality"""
164
+ ~~
165
 
166
  ### INT4 Inference on CUDA(have not tested, maybe need 8X80G GPU)
167
+
168
+ Int4 kernel with BF16 computing dtype is required.
169
 
170
  ````python
171
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
213
 
214
  ### Generate the model
215
 
216
+ **5*80G gpu is needed(could optimize), 1.4T cpu memory is needed**
217
+
218
+ We discovered that the inputs and outputs of certain layers in this model are very large and even exceed the FP16 range when tested with a few prompts. It is recommended to exclude these layers from quantization—particularly the 'down_proj' in layer 60—and run them using BF16 precision instead. However, we have not implemented this in this int4 model as in cpu, the compute dtype for int4 is bf16 or FP32.
219
+
220
+ ~~python
221
+ model.layers.60.mlp.experts.150.down_proj tensor(1144.) tensor(2122.9451)
222
+ model.layers.60.mlp.experts.231.down_proj tensor(25856.) tensor(12827.9980)
223
+ model.layers.60.mlp.shared_experts.down_proj tensor(1880.) tensor(3156.7344)
224
+ model.layers.60.mlp.experts.81.down_proj tensor(4416.) tensor(6124.6846)
225
+ model.layers.60.mlp.experts.92.down_proj tensor(107520.) tensor(50486.0781)
226
+ model.layers.59.mlp.experts.138.down_proj tensor(1568.) tensor(190.8769)
227
+ model.layers.60.mlp.experts.81.down_proj tensor(7360.) tensor(10024.4531)
228
+ model.layers.60.mlp.experts.92.down_proj tensor(116224.) tensor(55192.4180)
229
+
230
+ ~~
231
+
232
+
233
+
234
+ **1 add meta data to bf16 model** https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16
235
+
236
+ ~~python
237
+ import safetensors
238
+ from safetensors.torch import save_file
239
+
240
+
241
+ for i in range(1, 164):
242
+ idx_str = "0" * (5-len(str(i))) + str(i)
243
+ safetensors_path = f"model-{idx_str}-of-000163.safetensors"
244
+ print(safetensors_path)
245
+ tensors = dict()
246
+ with safetensors.safe_open(safetensors_path, framework="pt") as f:
247
+ for key in f.keys():
248
+ tensors[key] = f.get_tensor(key)
249
+ save_file(tensors, safetensors_path, metadata={'format': 'pt'})
250
+ ~~
251
+
252
+
253
+
254
+ **2 replace the modeling_deepseek.py with the following file**, basically align device and remove torch.no_grad as we need some tuning in AutoRound.
255
+
256
+ https://github.com/intel/auto-round/blob/deepseekv3/modeling_deepseek.py
257
+
258
+
259
+
260
+ **3 tuning**
261
 
262
+ ~~
263
+ git clone https://github.com/intel/auto-round.git && cd auto-round && git checkout deepseekv3
264
+ ~~
265
 
266
+ ```bash
267
+ python3 -m auto_round --model "/models/DeepSeek-V3-bf16/" --group_size 128 --format "auto_gptq" --iters 200 --devices 0,1,2,3,4 --nsamples 512 --batch_size 8 --seqlen 512 --low_gpu_mem_usage --output_dir "tmp_autoround" --disable_eval e 2>&1 | tee -a seekv3.txt
268
+ ```
269
 
270
  ## Ethical Considerations and Limitations
271
 
 
289
 
290
  @article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
291
 
292
+ [arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)
config.json CHANGED
@@ -79,7 +79,7 @@
79
  "tie_word_embeddings": false,
80
  "topk_group": 4,
81
  "topk_method": "noaux_tc",
82
- "torch_dtype": "float16",
83
  "transformers_version": "4.47.0",
84
  "use_cache": true,
85
  "v_head_dim": 128,
 
79
  "tie_word_embeddings": false,
80
  "topk_group": 4,
81
  "topk_method": "noaux_tc",
82
+ "torch_dtype": "bfloat16",
83
  "transformers_version": "4.47.0",
84
  "use_cache": true,
85
  "v_head_dim": 128,