mosaic-bert-base / README.md

Update README.md

c721a25 over 1 year ago

5.53 kB

	---
	license: apache-2.0
	datasets:
	- c4
	language:
	- en
	---

	# MosaicBERT base model
	Our goal in developing MosaicBERT was to greatly reduce pretraining time.

	## Model description

	In order to build MosaicBERT, we adopted architectural choices from the recent transformer literature.
	These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner,
	low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202).

	### Modifications to the Attention Mechanism
	1. FlashAttention: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
	reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM
	(i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by
	[hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton).

	2. Attention with Linear Biases (ALiBi): In most BERT models, the positions of tokens in a sequence are encoded with a position embedding layer;
	this embedding allows subsequent layers to keep track of the order of tokens in a sequence. ALiBi eliminates position embeddings and
	instead conveys this information using a bias matrix in the attention operation. It modifies the attention mechanism such that nearby
	tokens strongly attend to one another [[Press et al. 2021]](https://arxiv.org/abs/2108.12409). In addition to improving the performance of the final model, ALiBi helps the
	model to handle sequences longer than it saw during training. Details on our ALiBi implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/d14a7c94a0f805f56a7c865802082bf6d8ac8903/examples/bert/src/bert_layers.py#L425).

	3. Unpadding: Standard NLP practice is to combine text sequences of different lengths into a batch, and pad the sequences with empty
	tokens so that all sequence lengths are the same. During training, however, this can lead to many superfluous operations on those
	padding tokens. In MosaicBERT, we take a different approach: we concatenate all the examples in a minibatch into a single sequence
	of batch size 1. Results from NVIDIA and others have shown that this approach leads to speed improvements during training, since
	operations are not performed on padding tokens (see for example [Zeng et al. 2022](https://arxiv.org/pdf/2208.08124.pdf)).
	Details on our “unpadding” implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/main/examples/bert/src/bert_padding.py).

	4. Low Precision LayerNorm: this small tweak forces LayerNorm modules to run in float16 or bfloat16 precision instead of float32, improving utilization.
	Our implementation can be found [in the mosaicml/examples repo here](https://docs.mosaicml.com/en/v0.12.1/method_cards/low_precision_layernorm.html).

	### Modifications to the Feedforward Layers

	5. Gated Linear Units (GLU): We used Gated Linear Units for the feedforward sublayer of a transformer. GLUs were first proposed in 2016 [[Dauphin et al. 2016]](https://arxiv.org/abs/1612.08083),
	and incorporate an extra learnable matrix that “gates” the outputs of the feedforward layer. More recent work has shown that
	GLUs can improve performance quality in transformers [[Shazeer, 2020](https://arxiv.org/abs/2002.05202), [Narang et al. 2021](https://arxiv.org/pdf/2102.11972.pdf)]. We used the GeLU (Gaussian-error Linear Unit)
	activation function with GLU, which is sometimes referred to as GeGLU. The GeLU activation function is a smooth, fully differentiable
	approximation to ReLU; we found that this led to a nominal improvement over ReLU. More details on our implementation of GLU can be found here.
	The extra gating matrix in a GLU model potentially adds additional parameters to a model; we chose to augment our BERT-Base model with
	additional parameters due to GLU modules as it leads to a Pareto improvement across all timescales (which is not true of all larger
	models such as BERT-Large). While BERT-Base has 110 million parameters, MosaicBERT-Base has 137 million parameters. Note that
	MosaicBERT-Base trains faster than BERT-Base despite having more parameters.


	# How to use

	## Training data

	MosaicBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of
	text with some tokens hidden, and it has to predict these masked tokens. MosaicBERT is trained on
	the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.com/allenai/allennlp/discussions/5056), which contains roughly 365 million curated text documents scraped
	from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining
	corpora like English Wikipedia and BooksCorpus.

	## Training procedure

	## Evaluation results

	When fine-tuned on downstream tasks, this model achieves the following results:

	GLUE test results:

	\| Task \| MNLI-(m/mm) \| QQP \| QNLI \| SST-2 \| CoLA \| STS-B \| MRPC \| RTE \| Average \|
	\|:----:\|:-----------:\|:----:\|:----:\|:-----:\|:----:\|:-----:\|:----:\|:----:\|:-------:\|
	\| \| \| \| \| \| \| \| \| \| \|

	## Intended uses & limitations