Zyphra
/

Zamba2-7B-Instruct

Text Generation

Inference Endpoints

Model card Files Files and versions Community

BerenMillidge commited on 29 days ago

Commit

46788a4

•

1 Parent(s): c9f031a

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -77,7 +77,7 @@ Zamba2-7B-Instruct's high performance, strong instruction-following and reasonin
 ## Model Details
-Zamba2-7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba2 layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
 <center>
 <img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/XrEIEBxd0fqIgh3LyArAV.png" width="300" alt="Zamba architecture">
@@ -87,7 +87,7 @@ Zamba2-7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention
 Our Zamba2-7B instruct features an experimental long-context mode which extends the context from 4k to 16k context. This was achieved by adjusting the rotation frequency of the rotary position embeddings.
-In Needle-In-A-Haystack tests, we can observe that Zamba2-7B-Instruct can find the needle with an extremely high success rate up to and slightly beyond 16k context with performance falling off sharply at about 18k context. In future versions we aim to extend this context length significantly.
 <center>

 ## Model Details
+Zamba2-7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba2 layers interleaved with one or more shared attention layers. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
 <center>
 <img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/XrEIEBxd0fqIgh3LyArAV.png" width="300" alt="Zamba architecture">
 Our Zamba2-7B instruct features an experimental long-context mode which extends the context from 4k to 16k context. This was achieved by adjusting the rotation frequency of the rotary position embeddings.
+In Needle-In-A-Haystack tests, we observe that Zamba2-7B-Instruct finds the needle with an extremely high success rate up to and slightly beyond 16k context with performance falling off sharply at about 18k context. In future versions we aim to extend this context length significantly.
 <center>