license: mit
datasets:
- allenai/c4
language:
- de
library_name: transformers
pipeline_tag: fill-mask
BERTchen-v0.1
Efficiently pretrained MosaicBERT model on German C4. Paper and Code following soon.
Model description
BERTchen follows the architecture of a MosaicBERT model (introduced in) and utilizes FlashAttention 2. It is pretrained for 4 hours on one A100 40GB GPU.
Only the masked language modeling objective is used, making the [CLS] token redundant, which is excluded from the tokenizer. As pretraining data, a random subset of the German C4 dataset (introduced in) is used.
The tokenizer is taken from other efficient German pretraining work: paper and code
Training procedure
BERTchen was pretrained using the MosaicBERT hyper-parameters (Which can be found in the paper and here). We changed the training-goal to 2500 to better reflect the steps achievable by the model in the constrained time. In addition, we used a batch size of 1024, with a sequence length of 512 as we found this to work better. After 4 hours the training is cut and the checkpoint saved.
Evaluation results
Task | Germanquad (F1/EM) | Germeval 2017 B | Germeval 2024 Subtask 1 as majority vote |
---|---|---|---|
96.4/93.6 | 0.96 | 0.887 |
Model variations
For the creation of BERTchen we tested different datasets and training setups. Two notable variants are:
BERTchen-v0.1
Same pre-training just on the CulturaX dataset.hybrid_BERTchen-v0.1
Pre-trained on CulturaX with own hybrid sequence length changing approach (For more information see model card or paper)