jacobfulano commited on
Commit
c66f045
1 Parent(s): c721a25

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -4
README.md CHANGED
@@ -6,14 +6,23 @@ language:
6
  - en
7
  ---
8
 
9
- # MosaicBERT base model
10
- Our goal in developing MosaicBERT was to greatly reduce pretraining time.
 
 
 
 
 
 
 
 
11
 
12
  ## Model description
13
 
14
  In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
15
- These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner,
16
- low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202).
 
17
 
18
  ### Modifications to the Attention Mechanism
19
  1. **FlashAttention**: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
 
6
  - en
7
  ---
8
 
9
+ # MosaicBERT-Base model
10
+ MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining. MosaicBERT-Base achieves higher pretraining and finetuning accuracy
11
+
12
+ ### Model Date
13
+
14
+ March 2023
15
+
16
+ ## Documentation
17
+ * Blog post
18
+ * Github (mosaicml/examples repo)
19
 
20
  ## Model description
21
 
22
  In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
23
+ These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409),
24
+ and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202). In addition, we remove padding inside the transformer block,
25
+ and apply LayerNorm with low precision.
26
 
27
  ### Modifications to the Attention Mechanism
28
  1. **FlashAttention**: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer