Update README.md
Browse files
README.md
CHANGED
@@ -7,13 +7,13 @@ tags:
|
|
7 |
- language
|
8 |
- granite-3.1
|
9 |
base_model:
|
10 |
-
- ibm-granite/
|
11 |
---
|
12 |
|
13 |
# Granite-3.1-1B-A400M-Instruct
|
14 |
|
15 |
**Model Summary:**
|
16 |
-
Granite-3.1-1B-A400M-Instruct is
|
17 |
|
18 |
- **Developers:** Granite Team, IBM
|
19 |
- **GitHub Repository:** [ibm-granite/granite-3.1-language-models](https://github.com/ibm-granite/granite-3.1-language-models)
|
@@ -37,6 +37,7 @@ The model is designed to respond to general instructions and can be used to buil
|
|
37 |
* Code related tasks
|
38 |
* Function-calling tasks
|
39 |
* Multilingual dialog use cases
|
|
|
40 |
|
41 |
**Generation:**
|
42 |
This is a simple example of how to use Granite-3.1-1B-A400M-Instruct model.
|
@@ -76,29 +77,29 @@ output = tokenizer.batch_decode(output)
|
|
76 |
print(output)
|
77 |
```
|
78 |
|
79 |
-
**Model Architecture:**
|
80 |
-
Granite-3.1-1B-A400M-Instruct is based on a decoder-only
|
81 |
-
|
82 |
-
| Model
|
83 |
-
| :--------
|
84 |
-
| Embedding size
|
85 |
-
| Number of layers
|
86 |
-
| Attention head size
|
87 |
-
| Number of attention heads
|
88 |
-
| Number of KV heads
|
89 |
-
| MLP hidden size
|
90 |
-
| MLP activation
|
91 |
-
| Number of experts
|
92 |
-
| MoE TopK
|
93 |
-
| Initialization std
|
94 |
-
| Sequence length
|
95 |
-
| Position embedding
|
96 |
-
| # Parameters
|
97 |
-
| # Active parameters
|
98 |
-
| # Training tokens
|
99 |
|
100 |
**Training Data:**
|
101 |
-
Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the [Granite Technical Report]() and [Accompanying Author List]().
|
102 |
|
103 |
**Infrastructure:**
|
104 |
We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|
|
|
7 |
- language
|
8 |
- granite-3.1
|
9 |
base_model:
|
10 |
+
- ibm-granite/Granite-3.1-1B-A400M-base
|
11 |
---
|
12 |
|
13 |
# Granite-3.1-1B-A400M-Instruct
|
14 |
|
15 |
**Model Summary:**
|
16 |
+
Granite-3.1-1B-A400M-Instruct is a 8B parameter long-context instruct model finetuned from Granite-3.1-8B-Base using a combination of open source instruction datasets with permissive license and internally collected synthetic datasets tailored for solving long context problems. This model is developed using a diverse set of techniques with a structured chat format, including supervised finetuning, model alignment using reinforcement learning, and model merging.
|
17 |
|
18 |
- **Developers:** Granite Team, IBM
|
19 |
- **GitHub Repository:** [ibm-granite/granite-3.1-language-models](https://github.com/ibm-granite/granite-3.1-language-models)
|
|
|
37 |
* Code related tasks
|
38 |
* Function-calling tasks
|
39 |
* Multilingual dialog use cases
|
40 |
+
* Long-context tasks including long document/meeting summarization, long document QA, etc.
|
41 |
|
42 |
**Generation:**
|
43 |
This is a simple example of how to use Granite-3.1-1B-A400M-Instruct model.
|
|
|
77 |
print(output)
|
78 |
```
|
79 |
|
80 |
+
**Model Architecture:**
|
81 |
+
Granite-3.1-1B-A400M-Instruct is based on a decoder-only dense transformer architecture. Core components of this architecture are: GQA and RoPE, MLP with SwiGLU, RMSNorm, and shared input/output embeddings.
|
82 |
+
|
83 |
+
| Model | 2B Dense | 8B Dense | 1B MoE | 3B MoE |
|
84 |
+
| :-------- | :--------| :-------- | :------| :------|
|
85 |
+
| Embedding size | 2048 | **4096** | 1024 | 1536 |
|
86 |
+
| Number of layers | 40 | **40** | 24 | 32 |
|
87 |
+
| Attention head size | 64 | **128** | 64 | 64 |
|
88 |
+
| Number of attention heads | 32 | **32** | 16 | 24 |
|
89 |
+
| Number of KV heads | 8 | **8** | 8 | 8 |
|
90 |
+
| MLP hidden size | 8192 | **12800** | 512 | 512 |
|
91 |
+
| MLP activation | SwiGLU | **SwiGLU** | SwiGLU | SwiGLU |
|
92 |
+
| Number of experts | β | **β** | 32 | 40 |
|
93 |
+
| MoE TopK | β | **β** | 8 | 8 |
|
94 |
+
| Initialization std | 0.1 | **0.1** | 0.1 | 0.1 |
|
95 |
+
| Sequence length | 128K | **128K** | 128K | 128K |
|
96 |
+
| Position embedding | RoPE | **RoPE** | RoPE | RoPE |
|
97 |
+
| # Parameters | 2.5B | **8.1B** | 1.3B | 3.3B |
|
98 |
+
| # Active parameters | 2.5B | **8.1B** | 400M | 800M |
|
99 |
+
| # Training tokens | 12T | **12T** | 10T | 10T |
|
100 |
|
101 |
**Training Data:**
|
102 |
+
Overall, our SFT data is largely comprised of three key sources: (1) publicly available datasets with permissive license, (2) internal synthetic data targeting specific capabilities including long-context tasks, and (3) very small amounts of human-curated data. A detailed attribution of datasets can be found in the [Granite 3.0 Technical Report](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/paper.pdf), [Granite 3.1 Technical Report (coming soon)](https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d), and [Accompanying Author List](https://github.com/ibm-granite/granite-3.0-language-models/blob/main/author-ack.pdf).
|
103 |
|
104 |
**Infrastructure:**
|
105 |
We train Granite 3.1 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|