mosaicml
/

mosaic-bert-base

Model card Files Files and versions Community

jacobfulano commited on Mar 9, 2023

Commit

c66f045

•

1 Parent(s): c721a25

Update README.md

Files changed (1) hide show

README.md +13 -4

README.md CHANGED Viewed

@@ -6,14 +6,23 @@ language:
 - en
 ---
-# MosaicBERT base model
-Our goal in developing MosaicBERT was to greatly reduce pretraining time.
 ## Model description
 In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
-These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner,
-low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202).
 ### Modifications to the Attention Mechanism
 1. **FlashAttention**: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer

 - en
 ---
+# MosaicBERT-Base model
+MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining. MosaicBERT-Base achieves higher pretraining and finetuning accuracy
+### Model Date
+March 2023
+## Documentation
+* Blog post
+* Github (mosaicml/examples repo)
 ## Model description
 In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
+These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409),
+and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202). In addition, we remove padding inside the transformer block,
+and apply LayerNorm with low precision.
 ### Modifications to the Attention Mechanism
 1. **FlashAttention**: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer