Zyphra
/

Zamba2-1.2B

transformers_zamba2

Model card Files Files and versions Community

qanthony commited on Aug 26

Commit

d47e404

•

1 Parent(s): eee85e1

Update README.md

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ license: apache-2.0
 # Model Card for Zamba2-1.2B
-Zamba2-1.2B is a hybrid model composed of state-space and transformer blocks. It broadly follows the [Zamba architecture](https://arxiv.org/abs/2405.16712) which consists of a Mamba backbone alternating with shared transformer blocks (see diagram in [Model Details](#model-details)). Zamba2-1.2B possesses three major improvements over Zamba1:
 1.) Mamba1 blocks have been replaced with Mamba2 blocks.
@@ -14,13 +14,13 @@ Zamba2-1.2B is a hybrid model composed of state-space and transformer blocks. It
 Zamba2-1.2B differs from our [2.7B model](https://huggingface.co/Zyphra/Zamba2-2.7B) in three ways:
-1.) Rotary position embeddings
-2.) No alternating shared transformer blocks
-3.) Added LoRA projectors to attention layers
-We found that while hybrid SSM-transformer models are perfectly capable of performing well without position embeddings, adding rotary embeddings to the shared attention block slightly improved performance. Secondly, we utilize a single attention block instead of alternating because this enables a higher flop count for the model at a given parameter budget and at smaller scales this becomes more important than the slightly faster latency.
 Zamba2-1.2B uses the Mistral v0.1 tokenizer and was pre-trained on 3T tokens of text and code data sourced from open web-datasets, including [Zyda](https://arxiv.org/abs/2406.01981). Subsequently, in a second phase, Zamba2-1.2B was annealed on a mixture of 100B high-quality tokens.

 # Model Card for Zamba2-1.2B
+Zamba2-1.2B is a hybrid model composed of state-space ([Mamba](https://github.com/state-spaces/mamba)) and transformer blocks. It broadly follows the [Zamba architecture](https://arxiv.org/abs/2405.16712) which consists of a Mamba backbone alternating with shared transformer blocks (see diagram in [Model Details](#model-details)). Zamba2-1.2B possesses three major improvements over Zamba1:
 1.) Mamba1 blocks have been replaced with Mamba2 blocks.
 Zamba2-1.2B differs from our [2.7B model](https://huggingface.co/Zyphra/Zamba2-2.7B) in three ways:
+1.) We have added rotary position embeddings
+2.) A single shared transformer block (instead of two that we alternate between)
+3.) Added LoRA projectors to attention blocks (instead of just a LoRA on the MLP block)
+We found that while hybrid SSM-transformer models are perfectly capable of performing well without position embeddings, adding rotary embeddings to the shared attention block slightly improved performance. Secondly, we utilize a single attention block (instead of alternating between two independent transformer blocks) because this enables a higher flop count for the model at a given parameter budget and at smaller scales this becomes more important than the slightly faster latency.
 Zamba2-1.2B uses the Mistral v0.1 tokenizer and was pre-trained on 3T tokens of text and code data sourced from open web-datasets, including [Zyda](https://arxiv.org/abs/2406.01981). Subsequently, in a second phase, Zamba2-1.2B was annealed on a mixture of 100B high-quality tokens.