geberta-large / README.md
amindada's picture
Update README.md
4fa4e92 verified
metadata
{}

GeBERTa

GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM. The models range in size from 122M to 750M parameters.

Model details

The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary, while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps and have a maximum sequence length of 512 tokens.

Dataset

The pre-training dataset consists of documents from different domains:

Domain Dataset Data Size #Docs #Tokens
Formal Wikipedia 9GB 2,665,357 1.9B
Formal News 28GB 12,305,326 6.1B
Formal GC4 90GB 31,669,772 19.4B
Informal Reddit 2019-2023 (GER) 5.8GB 15,036,592 1.3B
Informal Holiday Reviews 2GB 4,876,405 428M
Legal OpenLegalData: German cases and laws 5.4GB 308,228 1B
Medical Smaller public datasets 253MB 179,776 50M
Medical CC medical texts 3.6GB 2,000,000 682M
Medical Medicine Dissertations 1.4GB 14,496 295M
Medical Pubmed abstracts (translated) 8.5GB 21,044,382 1.7B
Medical MIMIC III (translated) 2.6GB 24,221,834 695M
Medical PMC-Patients-ReCDS (translated) 2.1GB 1,743,344 414M
Literature German Fiction 1.1GB 3,219 243M
Literature English books (translated) 7.1GB 11,038 1.6B
- Total 167GB 116,079,769 35.8B

Benchmark

In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering, classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets. When the datasets provided training, development, and test sets, we used them accordingly.

We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available. The following table presents the F1 scores:

Model GE14 GQuAD GE18 TS GGP GRAS1 JS DROC Avg
GBERTlarge 88.48±0.23 81.51±0.84 54.37±1.65 73.60±0.61 79.17±0.14 69.28±0.80 76.32±4.42 90.29±0.15 76.63±0.63
GELECTRAlarge 88.39±0.13 80.51±0.41 55.41±1.54 73.84±0.86 79.09±0.09 70.16±0.92 73.73±2.35 89.83±0.27 76.37±0.69
GeBERTalarge 88.84±0.18 82.52±0.59 53.76±1.86 75.32±0.53 78.35±0.08 70.02±1.34 82.16±2.36 90.39±0.24 77.67±0.69
GeBERTaxlarge 89.04±0.26 85.05±0.63 55.80±1.42 76.25±0.704 76.71±0.08 67.92±1.00 82.42±4.70 90.63±0.21 77.98±0.62

Publication

@inproceedings{dada2023impact,
  title={On the Impact of Cross-Domain Data on German Language Models},
  author={Dada, Amin and Chen, Aokun and Peng, Cheng and Smith, Kaleb E and Idrissi-Yaghir, Ahmad and Seibold, Constantin Marc and Li, Jianning and Heiliger, Lars and Friedrich, Christoph M and Truhn, Daniel and others},
  booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing},
  year={2023}
}

Arxiv to link paper on Hugging Face: https://arxiv.org/abs/2310.07321

Contact

amin.dada@uk-essen.de