BerenMillidge
commited on
Commit
•
46788a4
1
Parent(s):
c9f031a
Update README.md
Browse files
README.md
CHANGED
@@ -77,7 +77,7 @@ Zamba2-7B-Instruct's high performance, strong instruction-following and reasonin
|
|
77 |
|
78 |
## Model Details
|
79 |
|
80 |
-
Zamba2-7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba2 layers interleaved with one or more shared attention layers
|
81 |
|
82 |
<center>
|
83 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/XrEIEBxd0fqIgh3LyArAV.png" width="300" alt="Zamba architecture">
|
@@ -87,7 +87,7 @@ Zamba2-7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention
|
|
87 |
|
88 |
Our Zamba2-7B instruct features an experimental long-context mode which extends the context from 4k to 16k context. This was achieved by adjusting the rotation frequency of the rotary position embeddings.
|
89 |
|
90 |
-
In Needle-In-A-Haystack tests, we
|
91 |
|
92 |
|
93 |
<center>
|
|
|
77 |
|
78 |
## Model Details
|
79 |
|
80 |
+
Zamba2-7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba2 layers interleaved with one or more shared attention layers. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
|
81 |
|
82 |
<center>
|
83 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/XrEIEBxd0fqIgh3LyArAV.png" width="300" alt="Zamba architecture">
|
|
|
87 |
|
88 |
Our Zamba2-7B instruct features an experimental long-context mode which extends the context from 4k to 16k context. This was achieved by adjusting the rotation frequency of the rotary position embeddings.
|
89 |
|
90 |
+
In Needle-In-A-Haystack tests, we observe that Zamba2-7B-Instruct finds the needle with an extremely high success rate up to and slightly beyond 16k context with performance falling off sharply at about 18k context. In future versions we aim to extend this context length significantly.
|
91 |
|
92 |
|
93 |
<center>
|