|
--- |
|
|
|
|
|
{} |
|
--- |
|
|
|
# GeBERTa |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM. |
|
The models range in size from 122M to 750M parameters. |
|
|
|
|
|
## Model details |
|
|
|
The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary, |
|
while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps |
|
and have a maximum sequence length of 512 tokens. |
|
|
|
|
|
## Dataset |
|
|
|
The pre-training dataset consists of documents from different domains: |
|
|
|
| Domain | Dataset | Data Size | #Docs | #Tokens | |
|
| -------- | ----------- | --------- | ------ | ------- | |
|
| Formal | Wikipedia | 9GB | 2,665,357 | 1.9B | |
|
| Formal | News | 28GB | 12,305,326 | 6.1B | |
|
| Formal | GC4 | 90GB | 31,669,772 | 19.4B | |
|
| Informal | Reddit 2019-2023 (GER) | 5.8GB | 15,036,592 | 1.3B | |
|
| Informal | Holiday Reviews | 2GB | 4,876,405 | 428M | |
|
| Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B | |
|
| Medical | Smaller public datasets | 253MB | 179,776 | 50M | |
|
| Medical | CC medical texts | 3.6GB | 2,000,000 | 682M | |
|
| Medical | Medicine Dissertations | 1.4GB | 14,496 | 295M | |
|
| Medical | Pubmed abstracts (translated) | 8.5GB | 21,044,382 | 1.7B | |
|
| Medical | MIMIC III (translated) | 2.6GB | 24,221,834 | 695M | |
|
| Medical | PMC-Patients-ReCDS (translated) | 2.1GB | 1,743,344 | 414M | |
|
| Literature | German Fiction | 1.1GB | 3,219 | 243M | |
|
| Literature | English books (translated) | 7.1GB | 11,038 | 1.6B | |
|
| - | Total | 167GB | 116,079,769 | 35.8B | |
|
|
|
|
|
## Benchmark |
|
|
|
In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering, |
|
classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets. |
|
When the datasets provided training, development, and test sets, we used them accordingly. |
|
|
|
|
|
|
|
We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available. |
|
The following table presents the F1 scores: |
|
|
|
|
|
|
|
| Model | [GE14](https://huggingface.co/datasets/germeval_14) | [GQuAD](https://huggingface.co/datasets/deepset/germanquad) | [GE18](https://huggingface.co/datasets/philschmid/germeval18) | TS | [GGP](https://github.com/JULIELab/GGPOnc) | GRAS<sup>1</sup> | [JS](https://github.com/JULIELab/jsyncc) | [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) | Avg | |
|
|:---------------------:|:--------:|:----------:|:--------:|:--------:|:-------:|:------:|:--------:|:------:|:------:| |
|
| [GBERT](https://huggingface.co/deepset/gbert-large)<sub>large</sub> | 88.48±0.23 | 81.51±0.84 | 54.37±1.65 | 73.60±0.61 | **79.17**±0.14 | 69.28±0.80 | 76.32±4.42 | 90.29±0.15 | 76.63±0.63 | |
|
| [GELECTRA](https://huggingface.co/deepset/gelectra-large)<sub>large</sub> | 88.39±0.13 | 80.51±0.41 | 55.41±1.54 | 73.84±0.86 | 79.09±0.09 | **70.16**±0.92 | 73.73±2.35 | 89.83±0.27 | 76.37±0.69 | |
|
| GeBERTa<sub>large</sub> | 88.84±0.18 | 82.52±0.59 | 53.76±1.86 | 75.32±0.53 | 78.35±0.08 | 70.02±1.34 | 82.16±2.36 | 90.39±0.24 | 77.67±0.69 | |
|
| [GeBERTa](https://huggingface.co/ikim-uk-essen/geberta-xlarge)<sub>xlarge</sub> | **89.04**±0.26 | **85.05**±0.63 | **55.80**±1.42 | **76.25**±0.704 | 76.71±0.08 | 67.92±1.00 | **82.42**±4.70 | **90.63**±0.21 | **77.98**±0.62 | |
|
|
|
|
|
## Publication |
|
|
|
```bibtex |
|
@inproceedings{dada2023impact, |
|
title={On the Impact of Cross-Domain Data on German Language Models}, |
|
author={Dada, Amin and Chen, Aokun and Peng, Cheng and Smith, Kaleb E and Idrissi-Yaghir, Ahmad and Seibold, Constantin Marc and Li, Jianning and Heiliger, Lars and Friedrich, Christoph M and Truhn, Daniel and others}, |
|
booktitle={The 2023 Conference on Empirical Methods in Natural Language Processing}, |
|
year={2023} |
|
} |
|
``` |
|
|
|
Arxiv to link paper on Hugging Face: https://arxiv.org/abs/2310.07321 |
|
## Contact |
|
|
|
<amin.dada@uk-essen.de> |
|
|
|
|
|
|
|
|
|
|
|
|