---
license: apache-2.0
datasets:
- c4
language:
- en
---

# MosaicBERT base model
Our goal in developing MosaicBERT was to greatly reduce pretraining time.

## Model description

In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature. 
These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner, 
low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202). 

1. Modifications to the Attention Mechanism
FlashAttention: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM
(i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by
[hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton).


# How to use

## Training data

MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of 
text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on 
the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.com/allenai/allennlp/discussions/5056), which contains roughly 365 million curated text documents scraped 
from the internet (equivalent to 156 billion tokens).  We used this more modern dataset in place of traditional BERT pretraining 
corpora like English Wikipedia and BooksCorpus.

## Training procedure

## Evaluation results

When fine-tuned on downstream tasks, this model achieves the following results:

GLUE test results:

| Task | MNLI-(m/mm) | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  | Average |
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|      |    |  |  |   |  |   |  |  |     |

## Intended uses & limitations