rpand002 commited on
Commit
ead3bbd
β€’
1 Parent(s): 1963a23

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -23
README.md CHANGED
@@ -7,13 +7,13 @@ tags:
7
  - language
8
  - granite-3.1
9
  base_model:
10
- - ibm-granite/granite-3.1-1b-a400m-base
11
  ---
12
 
13
  # Granite-3.1-1B-A400M-Instruct
14
 
15
  **Model Summary:**
16
- Granite-3.1-1B-A400M-Instruct is an 1B parameter model finetuned from *Granite-3.1-1B-A400M-Base* using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
17
 
18
  - **Developers:** Granite Team, IBM
19
  - **GitHub Repository:** [ibm-granite/granite-3.1-language-models](https://github.com/ibm-granite/granite-3.1-language-models)
@@ -37,6 +37,7 @@ The model is designed to respond to general instructions and can be used to buil
37
  * Code related tasks
38
  * Function-calling tasks
39
  * Multilingual dialog use cases
 
40
 
41
  **Generation:**
42
  This is a simple example of how to use Granite-3.1-1B-A400M-Instruct model.
@@ -76,29 +77,29 @@ output = tokenizer.batch_decode(output)
76
  print(output)
77
  ```
78
 
79
- **Model Architecture:**
80
- Granite-3.1-1B-A400M-Instruct is based on a decoder-only sparse Mixture of Experts (MoE) transformer architecture. Core components of this architecture are: Fine-grained Experts, Dropless Token Routing, and Load Balancing Loss.
81
-
82
- | Model | 2B Dense | 8B Dense | 1B MoE | 3B MoE |
83
- | :-------- | :--------| :--------| :-------- |:-------- |
84
- | Embedding size | 2048 | 4096 | **1024** | 1536 |
85
- | Number of layers | 40 | 40 | **24** | 32 |
86
- | Attention head size | 64 | 128 | **64** | 64 |
87
- | Number of attention heads | 32 | 32 | **16** | 24 |
88
- | Number of KV heads | 8 | 8 | **8** | 8 |
89
- | MLP hidden size | 8192 | 12800 | **512** | 512 |
90
- | MLP activation | SwiGLU | SwiGLU | **SwiGLU** | SwiGLU |
91
- | Number of experts | β€” | β€” | **32** | 40 |
92
- | MoE TopK | β€” | β€” | **8** | 8 |
93
- | Initialization std | 0.1 | 0.1 | **0.1** | 0.1 |
94
- | Sequence length | 128K | 128K | **128K** | 128K |
95
- | Position embedding | RoPE | RoPE | **RoPE** | RoPE |
96
- | # Parameters | 2.5B | 8.1B | **1.3B** | 3.3B |
97
- | # Active parameters | 2.5B | 8.1B | **400M** | 800M |
98
- | # Training tokens | 12T | 12T | **10T** | 10T |
99
 
100
  **Training Data:**
101
- Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the [Granite Technical Report]() and [Accompanying Author List]().
102
 
103
  **Infrastructure:**
104
  We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
 
7
  - language
8
  - granite-3.1
9
  base_model:
10
+ - ibm-granite/Granite-3.1-1B-A400M-base
11
  ---
12
 
13
  # Granite-3.1-1B-A400M-Instruct
14
 
15
  **Model Summary:**
16
+ Granite-3.1-1B-A400M-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
17
 
18
  - **Developers:** Granite Team, IBM
19
  - **GitHub Repository:** [ibm-granite/granite-3.1-language-models](https://github.com/ibm-granite/granite-3.1-language-models)
 
37
  * Code related tasks
38
  * Function-calling tasks
39
  * Multilingual dialog use cases
40
+ * Long-context tasks including long document/meeting summarization, long document QA, etc.
41
 
42
  **Generation:**
43
  This is a simple example of how to use Granite-3.1-1B-A400M-Instruct model.
 
77
  print(output)
78
  ```
79
 
80
+ **Model Architecture:**
81
+ Granite-3.1-1B-A400M-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings.
82
+
83
+ | Model | 2B Dense | 8B Dense | 1B MoE | 3B MoE |
84
+ | :-------- | :--------| :-------- | :------| :------|
85
+ | Embedding size | 2048 | **4096** | 1024 | 1536 |
86
+ | Number of layers | 40 | **40** | 24 | 32 |
87
+ | Attention head size | 64 | **128** | 64 | 64 |
88
+ | Number of attention heads | 32 | **32** | 16 | 24 |
89
+ | Number of KV heads | 8 | **8** | 8 | 8 |
90
+ | MLP hidden size | 8192 | **12800** | 512 | 512 |
91
+ | MLP activation | SwiGLU | **SwiGLU** | SwiGLU | SwiGLU |
92
+ | Number of experts | β€” | **β€”** | 32 | 40 |
93
+ | MoE TopK | β€” | **β€”** | 8 | 8 |
94
+ | Initialization std | 0.1 | **0.1** | 0.1 | 0.1 |
95
+ | Sequence length | 128K | **128K** | 128K | 128K |
96
+ | Position embedding | RoPE | **RoPE** | RoPE | RoPE |
97
+ | # Parameters | 2.5B | **8.1B** | 1.3B | 3.3B |
98
+ | # Active parameters | 2.5B | **8.1B** | 400M | 800M |
99
+ | # Training tokens | 12T | **12T** | 10T | 10T |
100
 
101
  **Training Data:**
102
+ Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the [Granite 3.0 Technical Report](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/paper.pdf), [Granite 3.1 Technical Report (coming soon)](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d), and [Accompanying Author List](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/author-ack.pdf).
103
 
104
  **Infrastructure:**
105
  We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.