rpand002 commited on
Commit
e16343c
β€’
1 Parent(s): 3e0c59c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -15
README.md CHANGED
@@ -64,21 +64,21 @@ Granite-3.1-3B-A800M-Base is based on a decoder-only sparse Mixture of Experts (
64
 
65
  | Model | 2B Dense | 8B Dense | 1B MoE | 3B MoE |
66
  | :-------- | :--------| :--------| :-------- | :--------|
67
- | Embedding size | 2048 | 4096 | **1024** | 1536 |
68
- | Number of layers | 40 | 40 | **24** | 32 |
69
- | Attention head size | 64 | 128 | **64** | 64 |
70
- | Number of attention heads | 32 | 32 | **16** | 24 |
71
- | Number of KV heads | 8 | 8 | **8** | 8 |
72
- | MLP hidden size | 8192 | 12800 | **512** | 512 |
73
- | MLP activation | SwiGLU | SwiGLU | **SwiGLU** | SwiGLU |
74
- | Number of experts | β€” | β€” | **32** | 40 |
75
- | MoE TopK | β€” | β€” | **8** | 8 |
76
- | Initialization std | 0.1 | 0.1 | **0.1** | 0.1 |
77
- | Sequence length | 128K | 128k | **128k** | 128k |
78
- | Position embedding | RoPE | RoPE | **RoPE** | RoPE |
79
- | # Parameters | 2.5B | 8.1B | **1.3B** | 3.3B |
80
- | # Active parameters | 2.5B | 8.1B | **400M** | 800M |
81
- | # Training tokens | 12T | 12T | **10T** | 10T |
82
 
83
  **Training Data:**
84
  This model is trained on a mix of open source and proprietary data following a two-stage training strategy.
 
64
 
65
  | Model | 2B Dense | 8B Dense | 1B MoE | 3B MoE |
66
  | :-------- | :--------| :--------| :-------- | :--------|
67
+ | Embedding size | 2048 | 4096 | 1024 | **1536** |
68
+ | Number of layers | 40 | 40 | 24 | **32** |
69
+ | Attention head size | 64 | 128 | 64 | **64** |
70
+ | Number of attention heads | 32 | 32 | 16 | **24** |
71
+ | Number of KV heads | 8 | 8 | 8 | **8** |
72
+ | MLP hidden size | 8192 | 12800 | 512 | **512** |
73
+ | MLP activation | SwiGLU | SwiGLU | SwiGLU | **SwiGLU** |
74
+ | Number of experts | β€” | β€” | 32 | **40** |
75
+ | MoE TopK | β€” | β€” | 8 | **8** |
76
+ | Initialization std | 0.1 | 0.1 | 0.1 | **0.1** |
77
+ | Sequence length | 128K | 128k | 128K | **128K** |
78
+ | Position embedding | RoPE | RoPE | RoPE | **RoPE** |
79
+ | # Parameters | 2.5B | 8.1B | 1.3B | **3.3B** |
80
+ | # Active parameters | 2.5B | 8.1B | 400M | **800M** |
81
+ | # Training tokens | 12T | 12T | 10T | **10T** |
82
 
83
  **Training Data:**
84
  This model is trained on a mix of open source and proprietary data following a two-stage training strategy.