mosaicml
/

mosaic-bert-base

@@ -8,7 +8,8 @@ language:
 # MosaicBERT-Base model
 MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
-MosaicBERT-Base achieves higher pretraining and finetuning accuracy than [bert-base-uncased](https://huggingface.co/bert-base-uncased).
 ### Model Date
@@ -16,15 +17,17 @@ March 2023
 ## Documentation
 * Blog post
-* Github (mosaicml/examples repo)
 # How to use
 ```python
 from transformers import AutoModelforForMaskedLM
 mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base', use_auth_token=<your token>, trust_remote_code=True)
 ```
-The tokenizer for this model is the Hugging Face `bert-base-uncased` tokenizer.
 ```python
 from transformers import BertTokenizer
@@ -93,7 +96,7 @@ for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, w
 2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
 While the original BERT paper also included a Next Sentence Prediction (NSP) task in the pretraining objective,
-subsequent papers have shown this to be unnecessary [Liu et al. 2019](https://arxiv.org/abs/1907.11692). For Hugging Face BERT-Base, we used the standard 15% masking ratio.
 However, we found that a 30% masking ratio led to slight accuracy improvements in both pretraining MLM and downstream GLUE performance.
 We therefore included this simple change as part of our MosaicBERT training recipe. Recent studies have also found that this simple
 change can lead to downstream improvements [Wettig et al. 2022](https://arxiv.org/abs/2202.08005).

 # MosaicBERT-Base model
 MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
+MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
+Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
 ### Model Date
 ## Documentation
 * Blog post
+* [Github (mosaicml/examples/bert repo)](https://github.com/mosaicml/examples/tree/main/examples/bert)
 # How to use
+We recommend using the code in the [mosaicml/examples/bert repo](https://github.com/mosaicml/examples/tree/main/examples/bert) for pretraining and finetuning this model.
 ```python
 from transformers import AutoModelforForMaskedLM
 mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base', use_auth_token=<your token>, trust_remote_code=True)
 ```
+The tokenizer for this model is simply the Hugging Face `bert-base-uncased` tokenizer.
 ```python
 from transformers import BertTokenizer
 2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
 While the original BERT paper also included a Next Sentence Prediction (NSP) task in the pretraining objective,
+subsequent papers have shown this to be unnecessary [Liu et al. 2019](https://arxiv.org/abs/1907.11692).
 However, we found that a 30% masking ratio led to slight accuracy improvements in both pretraining MLM and downstream GLUE performance.
 We therefore included this simple change as part of our MosaicBERT training recipe. Recent studies have also found that this simple
 change can lead to downstream improvements [Wettig et al. 2022](https://arxiv.org/abs/2202.08005).