Update README.md
Browse files
README.md
CHANGED
@@ -4,7 +4,7 @@ license: apache-2.0
|
|
4 |
|
5 |
# Model Card for Zamba2-1.2B
|
6 |
|
7 |
-
Zamba2-1.2B is a hybrid model composed of state-space and transformer blocks. It broadly follows the [Zamba architecture](https://arxiv.org/abs/2405.16712) which consists of a Mamba backbone alternating with shared transformer blocks (see diagram in [Model Details](#model-details)). Zamba2-1.2B possesses three major improvements over Zamba1:
|
8 |
|
9 |
1.) Mamba1 blocks have been replaced with Mamba2 blocks.
|
10 |
|
@@ -14,13 +14,13 @@ Zamba2-1.2B is a hybrid model composed of state-space and transformer blocks. It
|
|
14 |
|
15 |
Zamba2-1.2B differs from our [2.7B model](https://huggingface.co/Zyphra/Zamba2-2.7B) in three ways:
|
16 |
|
17 |
-
1.)
|
18 |
|
19 |
-
2.)
|
20 |
|
21 |
-
3.) Added LoRA projectors to attention
|
22 |
|
23 |
-
We found that while hybrid SSM-transformer models are perfectly capable of performing well without position embeddings, adding rotary embeddings to the shared attention block slightly improved performance. Secondly, we utilize a single attention block instead of alternating because this enables a higher flop count for the model at a given parameter budget and at smaller scales this becomes more important than the slightly faster latency.
|
24 |
|
25 |
Zamba2-1.2B uses the Mistral v0.1 tokenizer and was pre-trained on 3T tokens of text and code data sourced from open web-datasets, including [Zyda](https://arxiv.org/abs/2406.01981). Subsequently, in a second phase, Zamba2-1.2B was annealed on a mixture of 100B high-quality tokens.
|
26 |
|
|
|
4 |
|
5 |
# Model Card for Zamba2-1.2B
|
6 |
|
7 |
+
Zamba2-1.2B is a hybrid model composed of state-space ([Mamba](https://github.com/state-spaces/mamba)) and transformer blocks. It broadly follows the [Zamba architecture](https://arxiv.org/abs/2405.16712) which consists of a Mamba backbone alternating with shared transformer blocks (see diagram in [Model Details](#model-details)). Zamba2-1.2B possesses three major improvements over Zamba1:
|
8 |
|
9 |
1.) Mamba1 blocks have been replaced with Mamba2 blocks.
|
10 |
|
|
|
14 |
|
15 |
Zamba2-1.2B differs from our [2.7B model](https://huggingface.co/Zyphra/Zamba2-2.7B) in three ways:
|
16 |
|
17 |
+
1.) We have added rotary position embeddings
|
18 |
|
19 |
+
2.) A single shared transformer block (instead of two that we alternate between)
|
20 |
|
21 |
+
3.) Added LoRA projectors to attention blocks (instead of just a LoRA on the MLP block)
|
22 |
|
23 |
+
We found that while hybrid SSM-transformer models are perfectly capable of performing well without position embeddings, adding rotary embeddings to the shared attention block slightly improved performance. Secondly, we utilize a single attention block (instead of alternating between two independent transformer blocks) because this enables a higher flop count for the model at a given parameter budget and at smaller scales this becomes more important than the slightly faster latency.
|
24 |
|
25 |
Zamba2-1.2B uses the Mistral v0.1 tokenizer and was pre-trained on 3T tokens of text and code data sourced from open web-datasets, including [Zyda](https://arxiv.org/abs/2406.01981). Subsequently, in a second phase, Zamba2-1.2B was annealed on a mixture of 100B high-quality tokens.
|
26 |
|