--- license: mit datasets: - allenai/c4 language: - de library_name: transformers pipeline_tag: fill-mask --- # BERTchen-v0.1-C4 Efficiently pretrained [MosaicBERT](https://huggingface.co/mosaicml/mosaic-bert-base) model on German [C4](https://huggingface.co/datasets/allenai/c4). We have released our [Paper](https://github.com/FSadrieh/BERTchen/blob/main/BERTchen_paper.pdf) and [Code](https://github.com/FSadrieh/BERTchen). ## Motivation Encoder-only models perform well in a variety of tasks. However, their efficient pretraining and language adaptation remain underexplored. This study presents a method for training efficient, state-of-the-art German encoder-only models. Our research highlights the inefficiency of BERT models, in particular due to the plateau effect, and how architectural improvements such as the MosaicBERT architecture and curriculum learning approaches can combat it. We show the importance of an in-domain tokenizer and investigate different pretraining sequence lengths and datasets. BERTchen can beat the previous best model GottBERT on GermanQuAD, increasing the F1 score from 55.14 to 95.1 and the exact match from 73.06 to 91.9. Our research provides a foundation for training efficient encoder-only models in different languages. ## Model description BERTchen follows the architecture of MosaicBERT (introduced [in](https://arxiv.org/abs/2312.17482)) and utilizes [FlashAttention 2](https://arxiv.org/abs/2307.08691). It is pretrained for 4 hours on one A100 40GB GPU. The tokenizer is taken from prior efficient German pretraining work: [paper](https://openreview.net/forum?id=VYfJaHeVod) and [code](https://github.com/konstantinjdobler/tight-budget-llm-adaptation) Only the masked language modeling objective is used, making the [CLS] token redundant, which is excluded from the tokenizer. As pretraining data, a random subset of the German C4 dataset (introduced [in](https://arxiv.org/abs/1910.10683)) is used. ## Training procedure BERTchen was pretrained using the MosaicBERT hyperparameters (which can be found in the [paper](https://arxiv.org/abs/2312.17482) and [here](https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/yamls/main/mosaic-bert-base-uncased.yaml)), except for the training goal, which we set to 2,500 to better estimate the number of steps the model will make. In addition, we use a batch size of 1024, with a sequence length of 512 as we found this to work better. All training configs can be found [here](https://github.com/FSadrieh/BERTchen/tree/main/cfgs). ## Evaluation results After finetuning on [Germanquad](https://huggingface.co/datasets/deepset/germanquad), [Germeval 2017 B](https://sites.google.com/view/germeval2017-absa/home) and [GerMS-Detect subtask 1 from Germeval 2024](https://ofai.github.io/GermEval2024-GerMS/subtask1.html), we get the following results: | Task | Germanquad (F1/EM) | Germeval 2017 B | Germeval 2024 Subtask 1 as majority vote | |:----:|:-----------:|:----:|:----:| | | 96.4/93.6 | 0.96 | 0.887 | ## Efficiency With MosaicBERT and FlashAttention 2, we can increase the throughput from 190,000 tokens per second of BERT to about 250,000 tokens per second and achieve a MFU of 65.87% (see the paper for more details and calculations). ## Model variations For the creation of BERTchen we tested different datasets and training setups. Two notable variants are: - [`BERTchen-v0.1`](https://huggingface.co/frederic-sadrieh/BERTchen-v0.1) Same pretraining setup and hyperparameters just on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset. - [`hybrid_BERTchen-v0.1`](https://huggingface.co/frederic-sadrieh/hybrid_BERTchen-v0.1) Pretrained on [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) with own hybrid sequence length changing approach (For more information see model card or paper)