metadata
title: README
emoji: 🏃
colorFrom: indigo
colorTo: blue
sdk: static
pinned: false
hmByT5 - Language Models
Historical Multilingual and Monolingual ByT5 Models. Following languages are currently covered:
- English (British Library Corpus - Books)
- German (Europeana Newspaper)
- French (Europeana Newspaper)
- Finnish (Europeana Newspaper)
- Swedish (Europeana Newspaper)
- Dutch (Delpher Corpus)
- Norwegian (NCC)
More details can be found in our GitHub repository.
Leaderboard
We test our pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana. The following table shows an overview of used datasets.
Language | Dataset | Additional Dataset |
---|---|---|
English | AjMC | - |
German | AjMC | - |
French | AjMC | ICDAR-Europeana |
Finnish | NewsEye | - |
Swedish | NewsEye | - |
Dutch | ICDAR-Europeana | - |
Current best models:
Model | English AjMC | German AjMC | French AjMC | Finnish NewsEye | Swedish NewsEye | Dutch ICDAR | French ICDAR | Avg. |
---|---|---|---|---|---|---|---|---|
hmbyt5/byt5-small-english |
85.65 ± 1.21 | 87.27 ± 0.50 | 84.44 ± 0.79 | |||||
hmbyt5-preliminary/byt5-small-english-german |
85.74 ± 0.72 | 87.45 ± 0.67 | 84.23 ± 0.65 | |||||
hmbyt5-preliminary/byt5-small-english-german-french |
85.61 ± 0.96 | 87.24 ± 0.76 | 84.39 ± 0.68 | |||||
hmbyt5-preliminary/byt5-small-english-german-french-finnish |
85.30 ± 1.14 | 87.37 ± 0.53 | 84.12 ± 0.42 | |||||
hmbyt5-preliminary/byt5-small-english-german-french-finnish-swedish |
85.40 ± 0.78 | 87.12 ± 0.19 | 84.41 ± 0.34 | |||||
hmbyt5-preliminary/byt5-small-english-german-french-finnish-swedish-dutch |
85.51 ± 0.68 | 87.58 ± 0.39 | 84.39 ± 0.83 | 55.46 ± 1.99 | 73.38 ± 2.45 | 84.80 ± 0.44 | 75.97 ± 0.55 | |
hmbyt5-preliminary/byt5-small-multilingual-4g |
83.49 ± 0.96 | 87.65 ± 0.63 | 84.16 ± 0.90 | |||||
hmbyt5-preliminary/byt5-small-multilingual-4g-2e |
83.86 ± 0.61 | 87.54 ± 0.19 | 84.29 ± 0.41 | |||||
hmbyt5-preliminary/byt5-small-multilingual-4g-3e |
83.49 ± 0.99 | 87.38 ± 0.53 | 84.30 ± 0.51 | |||||
hmbyt5-preliminary/byt5-small-historic-multilingual-flax |
83.28 ± 1.67 | 86.98 ± 0.71 | 83.49 ± 1.06 | 76.96 ± 1.58 | 78.80 ± 1.89 | 86.47 ± 0.79 | 77.43 ± 0.51 | |
hmbyt5-preliminary/byt5-small-historic-multilingual-span20-flax |
84.91 ± 0.86 | 88.02 ± 0.35 | 84.78 ± 0.75 | 77.77 ± 1.83 | 79.94 ± 0.60 | 86.85 ± 0.91 | 77.45 ± 0.54 |
More recent results on more datasets can be found in the hmLeaderboard
.
Acknowledgements
We thank Luisa März, Katharina Schmid and Erion Çano for their fruitful discussions about Historical Language Models.
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️