jacobfulano commited on
Commit
2885f1f
1 Parent(s): c66f045

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -2
README.md CHANGED
@@ -7,7 +7,8 @@ language:
7
  ---
8
 
9
  # MosaicBERT-Base model
10
- MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining. MosaicBERT-Base achieves higher pretraining and finetuning accuracy
 
11
 
12
  ### Model Date
13
 
@@ -69,7 +70,36 @@ the English [“Colossal, Cleaned, Common Crawl” C4 dataset](https://github.co
69
  from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining
70
  corpora like English Wikipedia and BooksCorpus.
71
 
72
- ## Training procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
  ## Evaluation results
75
 
 
7
  ---
8
 
9
  # MosaicBERT-Base model
10
+ MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
11
+ MosaicBERT-Base achieves higher pretraining and finetuning accuracy than [bert-base-uncased](https://huggingface.co/bert-base-uncased).
12
 
13
  ### Model Date
14
 
 
70
  from the internet (equivalent to 156 billion tokens). We used this more modern dataset in place of traditional BERT pretraining
71
  corpora like English Wikipedia and BooksCorpus.
72
 
73
+ ## Pretraining Optimizations
74
+
75
+ Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
76
+
77
+ 1. MosaicML Streaming Dataset
78
+ As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
79
+ for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 128; this covers 78.6% of C4.
80
+
81
+
82
+ 3. Higher Masking Ratio for the Masked Language Modeling Objective
83
+ We used the standard Masked Language Modeling (MLM) pretraining objective.
84
+ While the original BERT paper also included a Next Sentence Prediction (NSP) task in the pretraining objective,
85
+ subsequent papers have shown this to be unnecessary [Liu et al. 2019](https://arxiv.org/abs/1907.11692). For Hugging Face BERT-Base, we used the standard 15% masking ratio.
86
+ However, we found that a 30% masking ratio led to slight accuracy improvements in both pretraining MLM and downstream GLUE performance.
87
+ We therefore included this simple change as part of our MosaicBERT training recipe. Recent studies have also found that this simple
88
+ change can lead to downstream improvements [Wettig et al. 2022](https://arxiv.org/abs/2202.08005).
89
+
90
+ 4. Bfloat16 Precision
91
+ We use [bf16 (bfloat16) mixed precision training](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) for all the models, where a matrix multiplication layer uses bf16
92
+ for the multiplication and 32-bit IEEE floating point for gradient accumulation. We found this to be more stable than using float16 mixed precision.
93
+
94
+ 5. Vocab Size as a Multiple of 64
95
+ We increased the vocab size to be a multiple of 8 as well as 64 (i.e. from 30,522 to 30,528).
96
+ This small constraint is something of [a magic trick among ML practitioners](https://twitter.com/karpathy/status/1621578354024677377), and leads to a throughput speedup.
97
+
98
+ 6. Hyperparameters
99
+ For all models, we use Decoupled AdamW with Beta1=0.9 and Beta2=0.98, and a weight decay value of 1.0e-5. The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero. Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches. We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768. These hyperparameters were the same for MosaicBERT-Base and the baseline BERT-Base.
100
+ For the baseline BERT, we applied the standard 0.1 dropout to both the attention and feedforward layers of the transformer block. For MosaicBERT, however, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
101
+ Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
102
+
103
 
104
  ## Evaluation results
105