--- license: mit datasets: - uonlp/CulturaX language: - de library_name: transformers pipeline_tag: fill-mask --- # hybrid-BERTchen-v0.1 Efficiently pretrained [MosaicBERT](https://huggingface.co/mosaicml/mosaic-bert-base) model on German [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). We have released our [Paper](https://github.com/FSadrieh/BERTchen/blob/main/BERTchen_paper.pdf) and [Code](https://github.com/FSadrieh/BERTchen). ## Motivation Encoder-only models perform well in a variety of tasks. However, their efficient pretraining and language adaptation remain underexplored. This study presents a method for training efficient, state-of-the-art German encoder-only models. Our research highlights the inefficiency of BERT models, in particular due to the plateau effect, and how architectural improvements such as the MosaicBERT architecture and curriculum learning approaches can combat it. We show the importance of an in-domain tokenizer and investigate different pretraining sequence lengths and datasets. BERTchen can beat the previous best model GottBERT on GermanQuAD, increasing the F1 score from 55.14 to 95.1 and the exact match from 73.06 to 91.9. Our research provides a foundation for training efficient encoder-only models in different languages. ## Model description BERTchen follows the architecture of MosaicBERT (introduced [in](https://arxiv.org/abs/2312.17482)) and utilizes [FlashAttention 2](https://arxiv.org/abs/2307.08691). It is pretrained for 4 hours on one A100 40GB GPU. The tokenizer is taken from prior efficient German pretraining work: [paper](https://openreview.net/forum?id=VYfJaHeVod) and [code](https://github.com/konstantinjdobler/tight-budget-llm-adaptation) Only the masked language modeling objective is used, making the [CLS] token redundant, which is excluded from the tokenizer. As pretraining data, a random subset of the CulturaX dataset (introduced [in](https://arxiv.org/abs/2309.09400)) is used. ## Training procedure BERTchen was pretrained using the MosaicBERT hyperparameters (which can be found in the [paper](https://arxiv.org/abs/2312.17482) and [here](https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/yamls/main/mosaic-bert-base-uncased.yaml)), except for the training goal, which we set to 2,500 to better estimate the number of steps the model will make. In addition, we use a batch size of 1024, with a sequence length of 512 as we found this to work better. All training configs can be found [here](https://github.com/FSadrieh/BERTchen/tree/main/cfgs). ### Sequence length scheduling BERT is known to have plateaus in pretraining. There, the loss does not decrease over a long number of steps. A solution has been proposed by doubling the training lengths from 64 to 512 during training (see this [paper](https://aclanthology.org/2021.ranlp-1.112/)). We improve this by dynamically changing the sequence length whenever the validation loss stagnates. This early-stop-like mechanism allows for dynamic adaptation to training influences. ### Masking scheduling On top of the sequence length schedule, the masking rate for masked language modeling drops from 0.3 by 0.05 for each sequence length switch, ending at 0.15. This is motivated by a [paper](https://aclanthology.org/2024.eacl-short.42.pdf) that investigates the optimal masking rate. ## Evaluation results After finetuning on [Germanquad](https://huggingface.co/datasets/deepset/germanquad), [Germeval 2017 B](https://sites.google.com/view/germeval2017-absa/home) and [GerMS-Detect subtask 1 from Germeval 2024](https://ofai.github.io/GermEval2024-GerMS/subtask1.html), we get the following results: | Task | Germanquad (F1/EM) | Germeval 2017 B | Germeval 2024 Subtask 1 as majority vote | |:----:|:-----------:|:----:|:----:| | | 95.5/92.2 | 0.961 | 0.906 | ## Efficiency With MosaicBERT and FlashAttention 2, we can increase the throughput from 190,000 tokens per second of BERT to about 250,000 tokens per second and achieve a MFU of 65.87% (see the paper for more details and calculations). ## Model variations For the creation of BERTchen we tested different datasets and training setups. Two notable variants are: - [`BERTchen-v0.1`](https://huggingface.co/frederic-sadrieh/BERTchen-v0.1) Normal pretraining on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset. This serves as a baseline to this model. - [`BERTchen-v0.1-C4`](https://huggingface.co/frederic-sadrieh/BERTchen-v0.1-C4) Same pretraining as BERTchen-v0.1 just on the [C4](https://huggingface.co/datasets/allenai/c4) dataset.