--- license: apache-2.0 datasets: - c4 language: - en --- # MosaicBERT base model Our goal in developing MosaicBERT was to greatly reduce pretraining time. ## Model description In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature. These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner, low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202). 1. Modifications to the Attention Mechanism FlashAttention: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM (i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by [hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton). # How to use ## Training data MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.com/allenai/allennlp/discussions/5056), which contains roughly 365 million curated text documents scraped from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining corpora like English Wikipedia and BooksCorpus. ## Training procedure ## Evaluation results When fine-tuned on downstream tasks, this model achieves the following results: GLUE test results: | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average | |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:| | | | | | | | | | | | ## Intended uses & limitations