license: apache-2.0
language:
- de
tags:
- historical
- german
- teams
datasets:
- biglam/europeana_newspapers
- storytracer/German-PD-Newspapers
Zeitungs-LM
The Zeitungs-LM is a language model pretrained on historical German newspapers. Technically the model itself is an ELECTRA model, which was pretrained with the TEAMS approach.
Datasets
Version 1 of the Zeitungs-LM was pretrained on the following publicly available datasets:
In total, the pretraining corpus has a size of 133GB.
Benchmarks (Named Entity Recognition)
We compare our Zeitungs-LM directly to the Europeana BERT model (as Zeitungs-LM is supposed to be the successor of it) on various downstream tasks from the hmBench repository, which is focussed on Named Entity Recognition.
Additionally, we use two additional datasets (ONB and LFT) from the "A Named Entity Recognition Shootout for German" paper.
We report averaged micro F1-Score over 5 runs with different seeds and use the best hyper-parameter configuration on the development set of each dataset to report the final test score.
Development Set
The results on the development set can be seen in the following table:
Model \ Dataset | LFT | ONB | HisGermaNER | HIPE-2020 | NewsEye | AjMC | Avg. |
---|---|---|---|---|---|---|---|
Europeana BERT | 79.22 | 88.20 | 81.41 | 80.92 | 41.65 | 87.91 | 76.55 |
Zeitungs-LM v1 | 79.39 | 88.53 | 83.10 | 81.55 | 44.53 | 89.71 | 77.80 |
Our Zeitungs-LM leads to a performance boost of 1.25% compared to the German Europeana BERT model.
Test Set
The final results on the test set can be seen here:
Model \ Dataset | LFT | ONB | HisGermaNER | HIPE-2020 | NewsEye | AjMC | Avg. |
---|---|---|---|---|---|---|---|
Europeana BERT | 80.43 | 84.39 | 83.21 | 77.49 | 42.96 | 90.52 | 76.50 |
Zeitungs-LM v1 | 80.35 | 87.28 | 84.92 | 79.91 | 47.16 | 92.76 | 78.73 |
Our Zeitungs-LM beats the German Europeana BERT model by a large margin (2.23%).
Changelog
- 02.10.2024: Initial version of the model. More details are coming very soon!
Acknowledgements
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️
Made from Bavarian Oberland with ❤️ and 🥨.