stefan-it commited on
Commit
e457b87
1 Parent(s): 5b0085a

readme: add initial version

Browse files
Files changed (1) hide show
  1. README.md +170 -1
README.md CHANGED
@@ -7,4 +7,173 @@ widget:
7
  - text: "Det vore [MASK] häller nödvändigt att be"
8
  - text: "Comme, à cette époque [MASK] était celle de la"
9
  - text: "In [MASK] an atmosphärischen Nahrungsmitteln"
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - text: "Det vore [MASK] häller nödvändigt att be"
8
  - text: "Comme, à cette époque [MASK] était celle de la"
9
  - text: "In [MASK] an atmosphärischen Nahrungsmitteln"
10
+ ---
11
+
12
+ # hmBERT: Historical Multilingual Language Models for Named Entity Recognition
13
+
14
+ More information about our hmBERT model can be found in our new paper:
15
+ ["hmBERT: Historical Multilingual Language Models for Named Entity Recognition"](https://arxiv.org/abs/2205.15575).
16
+
17
+ ## Languages
18
+
19
+ Our Historic Language Models Zoo contains support for the following languages - incl. their training data source:
20
+
21
+ | Language | Training data | Size
22
+ | -------- | ------------- | ----
23
+ | German | [Europeana](http://www.europeana-newspapers.eu/) | 13-28GB (filtered)
24
+ | French | [Europeana](http://www.europeana-newspapers.eu/) | 11-31GB (filtered)
25
+ | English | [British Library](https://data.bl.uk/digbks/db14.html) | 24GB (year filtered)
26
+ | Finnish | [Europeana](http://www.europeana-newspapers.eu/) | 1.2GB
27
+ | Swedish | [Europeana](http://www.europeana-newspapers.eu/) | 1.1GB
28
+
29
+ ## Models
30
+
31
+ At the moment, the following models are available on the model hub:
32
+
33
+ | Model identifier | Model Hub link
34
+ | --------------------------------------------- | --------------------------------------------------------------------------
35
+ | `dbmdz/bert-base-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased)
36
+ | `dbmdz/bert-base-historic-english-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-english-cased)
37
+ | `dbmdz/bert-base-finnish-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-finnish-europeana-cased)
38
+ | `dbmdz/bert-base-swedish-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-swedish-europeana-cased)
39
+
40
+ # Corpora Stats
41
+
42
+ ## German Europeana Corpus
43
+
44
+ We provide some statistics using different thresholds of ocr confidences, in order to shrink down the corpus size
45
+ and use less-noisier data:
46
+
47
+ | OCR confidence | Size
48
+ | -------------- | ----
49
+ | **0.60** | 28GB
50
+ | 0.65 | 18GB
51
+ | 0.70 | 13GB
52
+
53
+ For the final corpus we use a OCR confidence of 0.6 (28GB). The following plot shows a tokens per year distribution:
54
+
55
+ ![German Europeana Corpus Stats](stats/figures/german_europeana_corpus_stats.png)
56
+
57
+ ## French Europeana Corpus
58
+
59
+ Like German, we use different ocr confidence thresholds:
60
+
61
+ | OCR confidence | Size
62
+ | -------------- | ----
63
+ | 0.60 | 31GB
64
+ | 0.65 | 27GB
65
+ | **0.70** | 27GB
66
+ | 0.75 | 23GB
67
+ | 0.80 | 11GB
68
+
69
+ For the final corpus we use a OCR confidence of 0.7 (27GB). The following plot shows a tokens per year distribution:
70
+
71
+ ![French Europeana Corpus Stats](stats/figures/french_europeana_corpus_stats.png)
72
+
73
+ ## British Library Corpus
74
+
75
+ Metadata is taken from [here](https://data.bl.uk/digbks/DB21.html). Stats incl. year filtering:
76
+
77
+ | Years | Size
78
+ | ----------------- | ----
79
+ | ALL | 24GB
80
+ | >= 1800 && < 1900 | 24GB
81
+
82
+ We use the year filtered variant. The following plot shows a tokens per year distribution:
83
+
84
+ ![British Library Corpus Stats](stats/figures/bl_corpus_stats.png)
85
+
86
+ ## Finnish Europeana Corpus
87
+
88
+ | OCR confidence | Size
89
+ | -------------- | ----
90
+ | 0.60 | 1.2GB
91
+
92
+ The following plot shows a tokens per year distribution:
93
+
94
+ ![Finnish Europeana Corpus Stats](stats/figures/finnish_europeana_corpus_stats.png)
95
+
96
+ ## Swedish Europeana Corpus
97
+
98
+ | OCR confidence | Size
99
+ | -------------- | ----
100
+ | 0.60 | 1.1GB
101
+
102
+ The following plot shows a tokens per year distribution:
103
+
104
+ ![Swedish Europeana Corpus Stats](stats/figures/swedish_europeana_corpus_stats.png)
105
+
106
+ ## All Corpora
107
+
108
+ The following plot shows a tokens per year distribution of the complete training corpus:
109
+
110
+ ![All Corpora Stats](stats/figures/all_corpus_stats.png)
111
+
112
+ # Multilingual Vocab generation
113
+
114
+ For the first attempt, we use the first 10GB of each pretraining corpus. We upsample both Finnish and Swedish to ~10GB.
115
+ The following tables shows the exact size that is used for generating a 32k and 64k subword vocabs:
116
+
117
+ | Language | Size
118
+ | -------- | ----
119
+ | German | 10GB
120
+ | French | 10GB
121
+ | English | 10GB
122
+ | Finnish | 9.5GB
123
+ | Swedish | 9.7GB
124
+
125
+ We then calculate the subword fertility rate and portion of `[UNK]`s over the following NER corpora:
126
+
127
+ | Language | NER corpora
128
+ | -------- | ------------------
129
+ | German | CLEF-HIPE, NewsEye
130
+ | French | CLEF-HIPE, NewsEye
131
+ | English | CLEF-HIPE
132
+ | Finnish | NewsEye
133
+ | Swedish | NewsEye
134
+
135
+ Breakdown of subword fertility rate and unknown portion per language for the 32k vocab:
136
+
137
+ | Language | Subword fertility | Unknown portion
138
+ | -------- | ------------------ | ---------------
139
+ | German | 1.43 | 0.0004
140
+ | French | 1.25 | 0.0001
141
+ | English | 1.25 | 0.0
142
+ | Finnish | 1.69 | 0.0007
143
+ | Swedish | 1.43 | 0.0
144
+
145
+ Breakdown of subword fertility rate and unknown portion per language for the 64k vocab:
146
+
147
+ | Language | Subword fertility | Unknown portion
148
+ | -------- | ------------------ | ---------------
149
+ | German | 1.31 | 0.0004
150
+ | French | 1.16 | 0.0001
151
+ | English | 1.17 | 0.0
152
+ | Finnish | 1.54 | 0.0007
153
+ | Swedish | 1.32 | 0.0
154
+
155
+ # Final pretraining corpora
156
+
157
+ We upsample Swedish and Finnish to ~27GB. The final stats for all pretraining corpora can be seen here:
158
+
159
+ | Language | Size
160
+ | -------- | ----
161
+ | German | 28GB
162
+ | French | 27GB
163
+ | English | 24GB
164
+ | Finnish | 27GB
165
+ | Swedish | 27GB
166
+
167
+ Total size is 130GB.
168
+
169
+ # Pretraining
170
+
171
+ Details about the pretraining are coming soon.
172
+
173
+ # Acknowledgments
174
+
175
+ Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as
176
+ TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️
177
+
178
+ Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
179
+ it is possible to download both cased and uncased models from their S3 storage 🤗