frederic-sadrieh
/

BERTchen-v0.1-C4

@@ -11,26 +11,32 @@ pipeline_tag: fill-mask
 # BERTchen-v0.1-C4
 Efficiently pretrained [MosaicBERT](https://huggingface.co/mosaicml/mosaic-bert-base) model on German [C4](https://huggingface.co/datasets/allenai/c4).
-Paper and Code following soon.
 ## Model description
-BERTchen follows the architecture of a MosaicBERT model (introduced [in](https://arxiv.org/abs/2312.17482)) and utilizes [FlashAttention 2](https://arxiv.org/abs/2307.08691). It is pretrained for 4 hours on one A100 40GB GPU.
 Only the masked language modeling objective is used, making the [CLS] token redundant, which is excluded from the tokenizer. As pretraining data, a random subset of the German C4 dataset (introduced [in](https://arxiv.org/abs/1910.10683)) is used.
-The tokenizer is taken from other efficient German pretraining work: [paper](https://openreview.net/forum?id=VYfJaHeVod) and [code](https://github.com/konstantinjdobler/tight-budget-llm-adaptation)
 ## Training procedure
-BERTchen was pretrained using the MosaicBERT hyper-parameters (Which can be found in the [paper](https://arxiv.org/abs/2312.17482) and [here](https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/yamls/main/mosaic-bert-base-uncased.yaml)). We changed the training-goal to 2500 to better reflect the steps achievable by the model in the constrained time. In addition, we used a batch size of 1024, with a sequence length of 512 as we found this to work better. After 4 hours the training is cut and the checkpoint saved.
 ## Evaluation results
 | Task | Germanquad (F1/EM) | Germeval 2017 B  | Germeval 2024 Subtask 1 as majority vote |
 |:----:|:-----------:|:----:|:----:|
 |      | 96.4/93.6   | 0.96 | 0.887 |
 ## Model variations
 For the creation of BERTchen we tested different datasets and training setups. Two notable variants are:
-- [`BERTchen-v0.1`](https://huggingface.co/frederic-sadrieh/BERTchen-v0.1) Same pre-training just on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset.
-- [`hybrid_BERTchen-v0.1`](https://huggingface.co/frederic-sadrieh/hybrid_BERTchen-v0.1) Pre-trained on [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) with own hybrid sequence length changing approach (For more information see model card or paper)

 # BERTchen-v0.1-C4
 Efficiently pretrained [MosaicBERT](https://huggingface.co/mosaicml/mosaic-bert-base) model on German [C4](https://huggingface.co/datasets/allenai/c4).
+We have released our [Paper](https://github.com/FSadrieh/BERTchen/blob/main/BERTchen_paper.pdf) and [Code](https://github.com/FSadrieh/BERTchen).
+## Motivation
+Encoder-only models perform well in a variety of tasks. However, their efficient pretraining and language adaptation remain underexplored. This study presents a method for training efficient, state-of-the-art German encoder-only models. Our research highlights the inefficiency of BERT models, in particular due to the plateau effect, and how architectural improvements such as the MosaicBERT architecture and curriculum learning approaches can combat it. We show the importance of an in-domain tokenizer and investigate different pretraining sequence lengths and datasets. BERTchen can beat the previous best model GottBERT on GermanQuAD, increasing the F1 score from 55.14 to 95.1 and the exact match from 73.06 to 91.9. Our research provides a foundation for training efficient encoder-only models in different languages.
 ## Model description
+BERTchen follows the architecture of MosaicBERT (introduced [in](https://arxiv.org/abs/2312.17482)) and utilizes [FlashAttention 2](https://arxiv.org/abs/2307.08691). It is pretrained for 4 hours on one A100 40GB GPU.
+The tokenizer is taken from prior efficient German pretraining work: [paper](https://openreview.net/forum?id=VYfJaHeVod) and [code](https://github.com/konstantinjdobler/tight-budget-llm-adaptation)
 Only the masked language modeling objective is used, making the [CLS] token redundant, which is excluded from the tokenizer. As pretraining data, a random subset of the German C4 dataset (introduced [in](https://arxiv.org/abs/1910.10683)) is used.
 ## Training procedure
+BERTchen was pretrained using the MosaicBERT hyperparameters (which can be found in the [paper](https://arxiv.org/abs/2312.17482) and [here](https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/yamls/main/mosaic-bert-base-uncased.yaml)), except for the training goal, which we set to 2,500 to better estimate the number of steps the model will make. In addition, we use a batch size of 1024, with a sequence length of 512 as we found this to work better. All training configs can be found [here](https://github.com/FSadrieh/BERTchen/tree/main/cfgs).
 ## Evaluation results
+After finetuning on [Germanquad](https://huggingface.co/datasets/deepset/germanquad), [Germeval 2017 B](https://sites.google.com/view/germeval2017-absa/home) and [GerMS-Detect subtask 1 from Germeval 2024](https://ofai.github.io/GermEval2024-GerMS/subtask1.html), we get the following results:
 | Task | Germanquad (F1/EM) | Germeval 2017 B  | Germeval 2024 Subtask 1 as majority vote |
 |:----:|:-----------:|:----:|:----:|
 |      | 96.4/93.6   | 0.96 | 0.887 |
+## Efficiency
+With MosaicBERT and FlashAttention 2, we can increase the throughput from 190,000 tokens per second of BERT to about 250,000 tokens per second and achieve a MFU of 65.87%  (see the paper for more details and calculations).
 ## Model variations
 For the creation of BERTchen we tested different datasets and training setups. Two notable variants are:
+- [`BERTchen-v0.1`](https://huggingface.co/frederic-sadrieh/BERTchen-v0.1) Same pretraining setup and hyperparameters just on the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset.
+- [`hybrid_BERTchen-v0.1`](https://huggingface.co/frederic-sadrieh/hybrid_BERTchen-v0.1) Pretrained on [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) with own hybrid sequence length changing approach (For more information see model card or paper)