ibm-granite
/

granite-3.1-3b-a800m-base

Text Generation

Model card Files Files and versions Community

rpand002 commited on 10 days ago

Commit

e16343c

•

1 Parent(s): 3e0c59c

Update README.md

Files changed (1) hide show

README.md +15 -15

README.md CHANGED Viewed

@@ -64,21 +64,21 @@ Granite-3.1-3B-A800M-Base is based on a decoder-only sparse Mixture of Experts (
 | Model                        | 2B Dense | 8B Dense | 1B MoE       | 3B MoE   |
 | :--------                    | :--------| :--------| :--------    | :--------|
-| Embedding size               | 2048     | 4096     | **1024**     | 1536     |
-| Number of layers             | 40       | 40       | **24**       | 32       |
-| Attention head size          | 64       | 128      | **64**       | 64       |
-| Number of attention heads    | 32       | 32       | **16**       | 24       |
-| Number of KV heads           | 8        | 8        | **8**        | 8        |
-| MLP hidden size              | 8192     | 12800    | **512**      | 512      |
-| MLP activation               | SwiGLU   | SwiGLU   | **SwiGLU**   | SwiGLU   |
-| Number of experts            | —        | —        | **32**       | 40       |
-| MoE TopK                     | —        | —        | **8**        | 8        |
-| Initialization std           | 0.1      | 0.1      | **0.1**      | 0.1      |
-| Sequence length              | 128K     | 128k     | **128k**     | 128k     |
-| Position embedding           | RoPE     | RoPE     | **RoPE**     | RoPE     |
-| # Parameters                 | 2.5B     | 8.1B     | **1.3B**     | 3.3B     |
-| # Active parameters          | 2.5B     | 8.1B     | **400M**     | 800M     |
-| # Training tokens            | 12T      | 12T      | **10T**      | 10T      |
 **Training Data:**
 This model is trained on a mix of open source and proprietary data following a two-stage training strategy.

 | Model                        | 2B Dense | 8B Dense | 1B MoE       | 3B MoE   |
 | :--------                    | :--------| :--------| :--------    | :--------|
+| Embedding size               | 2048     | 4096     | 1024     | **1536**     |
+| Number of layers             | 40       | 40       | 24       | **32**       |
+| Attention head size          | 64       | 128      | 64       | **64**       |
+| Number of attention heads    | 32       | 32       | 16       | **24**       |
+| Number of KV heads           | 8        | 8        | 8        | **8**        |
+| MLP hidden size              | 8192     | 12800    | 512      | **512**      |
+| MLP activation               | SwiGLU   | SwiGLU   | SwiGLU   | **SwiGLU**   |
+| Number of experts            | —        | —        | 32       | **40**       |
+| MoE TopK                     | —        | —        | 8        | **8**        |
+| Initialization std           | 0.1      | 0.1      | 0.1      | **0.1**      |
+| Sequence length              | 128K     | 128k     | 128K     | **128K**     |
+| Position embedding           | RoPE     | RoPE     | RoPE     | **RoPE**     |
+| # Parameters                 | 2.5B     | 8.1B     | 1.3B     | **3.3B**     |
+| # Active parameters          | 2.5B     | 8.1B     | 400M     | **800M**     |
+| # Training tokens            | 12T      | 12T      | 10T      | **10T**      |
 **Training Data:**
 This model is trained on a mix of open source and proprietary data following a two-stage training strategy.