Update README.md
Browse files
README.md
CHANGED
@@ -64,21 +64,21 @@ Granite-3.1-3B-A800M-Base is based on a decoder-only sparse Mixture of Experts (
|
|
64 |
|
65 |
| Model | 2B Dense | 8B Dense | 1B MoE | 3B MoE |
|
66 |
| :-------- | :--------| :--------| :-------- | :--------|
|
67 |
-
| Embedding size | 2048 | 4096 |
|
68 |
-
| Number of layers | 40 | 40 |
|
69 |
-
| Attention head size | 64 | 128 |
|
70 |
-
| Number of attention heads | 32 | 32 |
|
71 |
-
| Number of KV heads | 8 | 8 |
|
72 |
-
| MLP hidden size | 8192 | 12800 |
|
73 |
-
| MLP activation | SwiGLU | SwiGLU |
|
74 |
-
| Number of experts | β | β |
|
75 |
-
| MoE TopK | β | β |
|
76 |
-
| Initialization std | 0.1 | 0.1 |
|
77 |
-
| Sequence length | 128K | 128k |
|
78 |
-
| Position embedding | RoPE | RoPE |
|
79 |
-
| # Parameters | 2.5B | 8.1B |
|
80 |
-
| # Active parameters | 2.5B | 8.1B |
|
81 |
-
| # Training tokens | 12T | 12T |
|
82 |
|
83 |
**Training Data:**
|
84 |
This model is trained on a mix of open source and proprietary data following a two-stage training strategy.
|
|
|
64 |
|
65 |
| Model | 2B Dense | 8B Dense | 1B MoE | 3B MoE |
|
66 |
| :-------- | :--------| :--------| :-------- | :--------|
|
67 |
+
| Embedding size | 2048 | 4096 | 1024 | **1536** |
|
68 |
+
| Number of layers | 40 | 40 | 24 | **32** |
|
69 |
+
| Attention head size | 64 | 128 | 64 | **64** |
|
70 |
+
| Number of attention heads | 32 | 32 | 16 | **24** |
|
71 |
+
| Number of KV heads | 8 | 8 | 8 | **8** |
|
72 |
+
| MLP hidden size | 8192 | 12800 | 512 | **512** |
|
73 |
+
| MLP activation | SwiGLU | SwiGLU | SwiGLU | **SwiGLU** |
|
74 |
+
| Number of experts | β | β | 32 | **40** |
|
75 |
+
| MoE TopK | β | β | 8 | **8** |
|
76 |
+
| Initialization std | 0.1 | 0.1 | 0.1 | **0.1** |
|
77 |
+
| Sequence length | 128K | 128k | 128K | **128K** |
|
78 |
+
| Position embedding | RoPE | RoPE | RoPE | **RoPE** |
|
79 |
+
| # Parameters | 2.5B | 8.1B | 1.3B | **3.3B** |
|
80 |
+
| # Active parameters | 2.5B | 8.1B | 400M | **800M** |
|
81 |
+
| # Training tokens | 12T | 12T | 10T | **10T** |
|
82 |
|
83 |
**Training Data:**
|
84 |
This model is trained on a mix of open source and proprietary data following a two-stage training strategy.
|