mosaicml
/

mosaic-bert-base

Model card Files Files and versions Community

jacobfulano commited on Mar 9, 2023

Commit

29c1999

•

1 Parent(s): 24512df

Update README.md

Files changed (1) hide show

README.md +41 -1

README.md CHANGED Viewed

@@ -4,4 +4,44 @@ datasets:
 - c4
 language:
 - en
----

 - c4
 language:
 - en
+---
+# MosaicBERT base model
+Our goal in developing MosaicBERT was to greatly reduce pretraining time.
+## Model description
+In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
+These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner,
+low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202).
+1. Modifications to the Attention Mechanism
+FlashAttention: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
+reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM
+(i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by
+[hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton).
+# How to use
+## Training data
+MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of
+text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on
+the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.com/allenai/allennlp/discussions/5056), which contains roughly 365 million curated text documents scraped
+from the internet (equivalent to 156 billion tokens).  We used this more modern dataset in place of traditional BERT pretraining
+corpora like English Wikipedia and BooksCorpus.
+## Training procedure
+## Evaluation results
+When fine-tuned on downstream tasks, this model achieves the following results:
+GLUE test results:
+| Task | MNLI-(m/mm) | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  | Average |
+|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
+|      |    |  |  |   |  |   |  |  |     |
+## Intended uses & limitations