Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/severinsimmler/literary-german-bert/README.md
README.md
ADDED
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: de
|
3 |
+
thumbnail: kfold.png
|
4 |
+
---
|
5 |
+
|
6 |
+
# German BERT for literary texts
|
7 |
+
|
8 |
+
This German BERT is based on `bert-base-german-dbmdz-cased`, and has been adapted to the domain of literary texts by fine-tuning the language modeling task on the [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1). Afterwards the model was fine-tuned for named entity recognition on the [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) corpus, so you can use it to recognize protagonists in German novels.
|
9 |
+
|
10 |
+
|
11 |
+
# Stats
|
12 |
+
|
13 |
+
## Language modeling
|
14 |
+
|
15 |
+
The [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1) consists of 3,194 documents with 203,516,988 tokens or 1,520,855 types. The publication year of the texts ranges from the 18th to the 20th century:
|
16 |
+
|
17 |
+
![years](prosa-jahre.png)
|
18 |
+
|
19 |
+
|
20 |
+
### Results
|
21 |
+
|
22 |
+
After one epoch:
|
23 |
+
|
24 |
+
| Model | Perplexity |
|
25 |
+
| ---------------- | ---------- |
|
26 |
+
| Vanilla BERT | 6.82 |
|
27 |
+
| Fine-tuned BERT | 4.98 |
|
28 |
+
|
29 |
+
|
30 |
+
## Named entity recognition
|
31 |
+
|
32 |
+
The provided model was also fine-tuned for two epochs on 10,799 sentences for training, validated on 547 and tested on 1,845 with three labels: `B-PER`, `I-PER` and `O`.
|
33 |
+
|
34 |
+
|
35 |
+
## Results
|
36 |
+
|
37 |
+
| Dataset | Precision | Recall | F1 |
|
38 |
+
| ------- | --------- | ------ | ---- |
|
39 |
+
| Dev | 96.4 | 87.3 | 91.6 |
|
40 |
+
| Test | 92.8 | 94.9 | 93.8 |
|
41 |
+
|
42 |
+
The model has also been evaluated using 10-fold cross validation and compared with a classic Conditional Random Field baseline described in [Jannidis et al.](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf) (2015):
|
43 |
+
|
44 |
+
![kfold](kfold.png)
|
45 |
+
|
46 |
+
|
47 |
+
# References
|
48 |
+
|
49 |
+
Markus Krug, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, Fotis Jannidis, [Description of a Corpus of Character References in German Novels](http://webdoc.sub.gwdg.de/pub/mon/dariah-de/dwp-2018-27.pdf), 2018.
|
50 |
+
|
51 |
+
Fotis Jannidis, Isabella Reger, Lukas Weimer, Markus Krug, Martin Toepfer, Frank Puppe, [Automatische Erkennung von Figuren in deutschsprachigen Romanen](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf), 2015.
|