jacobfulano commited on
Commit
29c1999
1 Parent(s): 24512df

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -1
README.md CHANGED
@@ -4,4 +4,44 @@ datasets:
4
  - c4
5
  language:
6
  - en
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - c4
5
  language:
6
  - en
7
+ ---
8
+
9
+ # MosaicBERT base model
10
+ Our goal in developing MosaicBERT was to greatly reduce pretraining time.
11
+
12
+ ## Model description
13
+
14
+ In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
15
+ These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner,
16
+ low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202).
17
+
18
+ 1. Modifications to the Attention Mechanism
19
+ FlashAttention: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
20
+ reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM
21
+ (i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by
22
+ [hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton).
23
+
24
+
25
+ # How to use
26
+
27
+ ## Training data
28
+
29
+ MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of
30
+ text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on
31
+ the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.com/allenai/allennlp/discussions/5056), which contains roughly 365 million curated text documents scraped
32
+ from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining
33
+ corpora like English Wikipedia and BooksCorpus.
34
+
35
+ ## Training procedure
36
+
37
+ ## Evaluation results
38
+
39
+ When fine-tuned on downstream tasks, this model achieves the following results:
40
+
41
+ GLUE test results:
42
+
43
+ | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
44
+ |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
45
+ | | | | | | | | | | |
46
+
47
+ ## Intended uses & limitations