luofuli commited on
Commit
be9443d
1 Parent(s): d52a9ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -4
README.md CHANGED
@@ -132,7 +132,15 @@ DeepSeek-V2 adopts innovative architectures to guarantee economical training and
132
  <img width="90%" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/architecture.png?raw=true" />
133
  </p>
134
 
135
- ## 6. How to run locally
 
 
 
 
 
 
 
 
136
 
137
  **To utilize DeepSeek-V2-Lite in BF16 format for inference, 40GB*1 GPU is required.**
138
  ### Inference with Huggingface's Transformers
@@ -242,10 +250,10 @@ llm = ChatOpenAI(
242
  temperature=0.85,
243
  max_tokens=8000)
244
  ```
245
- ## 7. License
246
  This code repository is licensed under [the MIT License](LICENSE-CODE). The use of DeepSeek-V2 Base/Chat models is subject to [the Model License](LICENSE-MODEL). DeepSeek-V2 series (including Base and Chat) supports commercial use.
247
 
248
- ## 8. Citation
249
  ```
250
  @misc{deepseekv2,
251
  title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model},
@@ -257,5 +265,5 @@ This code repository is licensed under [the MIT License](LICENSE-CODE). The use
257
  }
258
  ```
259
 
260
- ## 9. Contact
261
  If you have any questions, please raise an issue or contact us at [service@deepseek.com](service@deepseek.com).
 
132
  <img width="90%" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/architecture.png?raw=true" />
133
  </p>
134
 
135
+ DeepSeek-V2-Lite has 27 layers and a hidden dimension of 2048. It also employs MLA and has 16 attention heads, where each head has a dimension of 128. Its KV compression dimension is 512, but slightly different from DeepSeek-V2, it does not compress the queries. For the decoupled queries and key, it has a per-head dimension of 64. DeepSeek-V2-Lite also employs DeepSeekMoE, and all FFNs except for the first layer are replaced with MoE layers. Each MoE layer consists of 2 shared experts and 64 routed experts, where the intermediate hidden dimension of each expert is 1408. Among the routed experts, 6 experts will be activated for each token. Under this configuration, DeepSeek-V2-Lite comprises 15.7B total parameters, of which 2.4B are activated for each token.
136
+
137
+
138
+ ## 6. Training Details
139
+ DeepSeek-V2-Lite is also trained from scratch on the same pre-training corpus of DeepSeek-V2, which is not polluted by any SFT data. It uses the AdamW optimizer with hyper-parameters set to $\beta_1=0.9$, $\beta_2=0.95$, and $\mathrm{weight_decay}=0.1$. The learning rate is scheduled using a warmup-and-step-decay strategy. Initially, the learning rate linearly increases from 0 to the maximum value during the first 2K steps. Subsequently, the learning rate is multiplied by 0.316 after training about 80% of tokens, and again by 0.316 after training about 90% of tokens. The maximum learning rate is set to $4.2 \times 10^{-4}$, and the gradient clipping norm is set to 1.0. We do not employ the batch size scheduling strategy for it, and it is trained with a constant batch size of 4608 sequences. During pre-training, we set the maximum sequence length to 4K, and train DeepSeek-V2-Lite on 5.7T tokens. We leverage pipeline parallelism to deploy different layers of it on different devices, but for each layer, all experts will be deployed on the same device. Therefore, we only employ a small expert-level balance loss with $\alpha_{1}=0.001$, and do not employ device-level balance loss and communication balance loss for it. After pre-training, we also perform long-context extension, SFT for DeepSeek-V2-Lite and get a chat model called DeepSeek-V2-Lite Chat.
140
+
141
+
142
+
143
+ ## 7. How to run locally
144
 
145
  **To utilize DeepSeek-V2-Lite in BF16 format for inference, 40GB*1 GPU is required.**
146
  ### Inference with Huggingface's Transformers
 
250
  temperature=0.85,
251
  max_tokens=8000)
252
  ```
253
+ ## 8. License
254
  This code repository is licensed under [the MIT License](LICENSE-CODE). The use of DeepSeek-V2 Base/Chat models is subject to [the Model License](LICENSE-MODEL). DeepSeek-V2 series (including Base and Chat) supports commercial use.
255
 
256
+ ## 9. Citation
257
  ```
258
  @misc{deepseekv2,
259
  title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model},
 
265
  }
266
  ```
267
 
268
+ ## 10. Contact
269
  If you have any questions, please raise an issue or contact us at [service@deepseek.com](service@deepseek.com).