Safetensors
transformers_zamba2
zamba2
qanthony-z qanthony commited on
Commit
09b4c5b
1 Parent(s): eee85e1

Update README.md (#1)

Browse files

- Update README.md (d47e404aa9394f2b9cd4b7fa51c7cb2ab8026dc2)


Co-authored-by: Quentin Anthony <qanthony@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -4,7 +4,7 @@ license: apache-2.0
4
 
5
  # Model Card for Zamba2-1.2B
6
 
7
- Zamba2-1.2B is a hybrid model composed of state-space and transformer blocks. It broadly follows the [Zamba architecture](https://arxiv.org/abs/2405.16712) which consists of a Mamba backbone alternating with shared transformer blocks (see diagram in [Model Details](#model-details)). Zamba2-1.2B possesses three major improvements over Zamba1:
8
 
9
  1.) Mamba1 blocks have been replaced with Mamba2 blocks.
10
 
@@ -14,13 +14,13 @@ Zamba2-1.2B is a hybrid model composed of state-space and transformer blocks. It
14
 
15
  Zamba2-1.2B differs from our [2.7B model](https://huggingface.co/Zyphra/Zamba2-2.7B) in three ways:
16
 
17
- 1.) Rotary position embeddings
18
 
19
- 2.) No alternating shared transformer blocks
20
 
21
- 3.) Added LoRA projectors to attention layers
22
 
23
- We found that while hybrid SSM-transformer models are perfectly capable of performing well without position embeddings, adding rotary embeddings to the shared attention block slightly improved performance. Secondly, we utilize a single attention block instead of alternating because this enables a higher flop count for the model at a given parameter budget and at smaller scales this becomes more important than the slightly faster latency.
24
 
25
  Zamba2-1.2B uses the Mistral v0.1 tokenizer and was pre-trained on 3T tokens of text and code data sourced from open web-datasets, including [Zyda](https://arxiv.org/abs/2406.01981). Subsequently, in a second phase, Zamba2-1.2B was annealed on a mixture of 100B high-quality tokens.
26
 
 
4
 
5
  # Model Card for Zamba2-1.2B
6
 
7
+ Zamba2-1.2B is a hybrid model composed of state-space ([Mamba](https://github.com/state-spaces/mamba)) and transformer blocks. It broadly follows the [Zamba architecture](https://arxiv.org/abs/2405.16712) which consists of a Mamba backbone alternating with shared transformer blocks (see diagram in [Model Details](#model-details)). Zamba2-1.2B possesses three major improvements over Zamba1:
8
 
9
  1.) Mamba1 blocks have been replaced with Mamba2 blocks.
10
 
 
14
 
15
  Zamba2-1.2B differs from our [2.7B model](https://huggingface.co/Zyphra/Zamba2-2.7B) in three ways:
16
 
17
+ 1.) We have added rotary position embeddings
18
 
19
+ 2.) A single shared transformer block (instead of two that we alternate between)
20
 
21
+ 3.) Added LoRA projectors to attention blocks (instead of just a LoRA on the MLP block)
22
 
23
+ We found that while hybrid SSM-transformer models are perfectly capable of performing well without position embeddings, adding rotary embeddings to the shared attention block slightly improved performance. Secondly, we utilize a single attention block (instead of alternating between two independent transformer blocks) because this enables a higher flop count for the model at a given parameter budget and at smaller scales this becomes more important than the slightly faster latency.
24
 
25
  Zamba2-1.2B uses the Mistral v0.1 tokenizer and was pre-trained on 3T tokens of text and code data sourced from open web-datasets, including [Zyda](https://arxiv.org/abs/2406.01981). Subsequently, in a second phase, Zamba2-1.2B was annealed on a mixture of 100B high-quality tokens.
26