hybrid-BERTchen-v0.1

Efficiently pretrained MosaicBERT model on German CulturaX. We have released our Paper and Code.

Motivation

Encoder-only models perform well in a variety of tasks. However, their efficient pretraining and language adaptation remain underexplored. This study presents a method for training efficient, state-of-the-art German encoder-only models. Our research highlights the inefficiency of BERT models, in particular due to the plateau effect, and how architectural improvements such as the MosaicBERT architecture and curriculum learning approaches can combat it. We show the importance of an in-domain tokenizer and investigate different pretraining sequence lengths and datasets. BERTchen can beat the previous best model GottBERT on GermanQuAD, increasing the F1 score from 55.14 to 95.1 and the exact match from 73.06 to 91.9. Our research provides a foundation for training efficient encoder-only models in different languages.

Model description

BERTchen follows the architecture of MosaicBERT (introduced in) and utilizes FlashAttention 2. It is pretrained for 4 hours on one A100 40GB GPU.

The tokenizer is taken from prior efficient German pretraining work: paper and code

Only the masked language modeling objective is used, making the [CLS] token redundant, which is excluded from the tokenizer. As pretraining data, a random subset of the CulturaX dataset (introduced in) is used.

Training procedure

BERTchen was pretrained using the MosaicBERT hyperparameters (which can be found in the paper and here), except for the training goal, which we set to 2,500 to better estimate the number of steps the model will make. In addition, we use a batch size of 1024, with a sequence length of 512 as we found this to work better. All training configs can be found here.

Sequence length scheduling

BERT is known to have plateaus in pretraining. There, the loss does not decrease over a long number of steps. A solution has been proposed by doubling the training lengths from 64 to 512 during training (see this paper). We improve this by dynamically changing the sequence length whenever the validation loss stagnates. This early-stop-like mechanism allows for dynamic adaptation to training influences.

Masking scheduling

On top of the sequence length schedule, the masking rate for masked language modeling drops from 0.3 by 0.05 for each sequence length switch, ending at 0.15. This is motivated by a paper that investigates the optimal masking rate.

Evaluation results

After finetuning on Germanquad, Germeval 2017 B and GerMS-Detect subtask 1 from Germeval 2024, we get the following results:

Task	Germanquad (F1/EM)	Germeval 2017 B	Germeval 2024 Subtask 1 as majority vote
	95.5/92.2	0.961	0.906

Efficiency

With MosaicBERT and FlashAttention 2, we can increase the throughput from 190,000 tokens per second of BERT to about 250,000 tokens per second and achieve a MFU of 65.87% (see the paper for more details and calculations).

Model variations

For the creation of BERTchen we tested different datasets and training setups. Two notable variants are:

BERTchen-v0.1 Normal pretraining on the CulturaX dataset. This serves as a baseline to this model.
BERTchen-v0.1-C4 Same pretraining as BERTchen-v0.1 just on the C4 dataset.

frederic-sadrieh
/

hybrid-BERTchen-v0.1