mdermentzi commited on
Commit
bf115d6
1 Parent(s): a381619

Create model card

Browse files
Files changed (1) hide show
  1. README.md +122 -0
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: eupl-1.1
3
+ datasets:
4
+ - ehri-ner/ehri-ner-all
5
+ language:
6
+ - cs
7
+ - de
8
+ - en
9
+ - fr
10
+ - hu
11
+ - nl
12
+ - pl
13
+ - sk
14
+ - yi
15
+ metrics:
16
+ - name: f1
17
+ type: f1
18
+ value: 81.5
19
+ pipeline_tag: token-classification
20
+ tags:
21
+ - Holocaust
22
+ - EHRI
23
+ base_model: FacebookAI/xlm-roberta-large
24
+ ---
25
+ # Model Card for Model ID
26
+
27
+ <!-- Provide a quick summary of what the model is/does. -->
28
+
29
+ The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information
30
+ about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of
31
+ detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to
32
+ link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and
33
+ making it more discoverable. The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER)
34
+ using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts.
35
+ The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a
36
+ format suitable for training NER models. The results of our experiments show that despite our relatively small
37
+ dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations
38
+ is 81.5%.
39
+
40
+
41
+ ### Model Description
42
+
43
+ <!-- Provide a longer summary of what this model is. -->
44
+
45
+
46
+ - **Developed by:** Dermentzi, M. & Scheithauer, H.
47
+ - **Funded by:** European Commission call H2020-INFRAIA-2018–2020. Grant agreement ID 871111. DOI 10.3030/871111.
48
+ - **Language(s) (NLP):** [More Information Needed]
49
+ - **License:** EUPL-1.2
50
+ - **Finetuned from model:** FacebookAI/xlm-roberta-large
51
+
52
+ <!-- ### Model Sources [optional]
53
+
54
+ <!-- Provide the basic links for the model.
55
+
56
+ - **Repository:** [More Information Needed]
57
+ - **Paper [optional]:** [More Information Needed]
58
+ -->
59
+
60
+ ## Uses
61
+
62
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
63
+ This model was developed for research purposes in the context of the EHRI-3 project. Specifically, the aim was to determine
64
+ whether a single model can be trained to recognize entities across different document types and languages in Holocaust-related texts.
65
+ The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by
66
+ XLM-R fine-tuned on multilingual Holocaust-related annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps
67
+ towards deploying this model, i.e., receiving more feedback from the EHRI community. Once we have a stable model that EHRI stakeholders are
68
+ satisfied with, this model and its potential successors are intended to be used as part of an EHRI editorial pipeline whereby,
69
+ upon inputting some text into a tool that supports our model, potential named entities within the text will be automatically pre-annotated
70
+ in a way that helps our intended users (i.e., researchers and professional archivists) detect them faster and link them to their associated controlled vocabulary entities from the
71
+ custom EHRI controlled vocabularies and authority sets. This has the potential to facilitate metadata enrichment of descriptions
72
+ in the EHRI Portal and enhance their discoverability. It would also make it easier for EHRI to develop new Online Editions and
73
+ unlock new ways for archivists and researchers within the EHRI network to organize,
74
+ analyze, and present their materials and research data in ways that would otherwise require a lot of manual work.
75
+
76
+ ## Limitations
77
+
78
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
79
+ The dataset used to fine-tune this model stems from a series of manually annotated
80
+ digital scholarly editions, the EHRI Online Editions. The original purpose
81
+ of these editions was not to provide a dataset
82
+ for training NER models, although we argue that they nevertheless
83
+ constitute a high-quality resource that is
84
+ suitable to be used in this way. However, users should still be mindful that
85
+ our dataset repurposes a resource that was not built for purpose.
86
+
87
+ The fine-tuned model occasionally misclassifies entities
88
+ as non-entity tokens, I-GHETTO being the most
89
+ confused entity. The fine-tuned model occasionally
90
+ encounters challenges in extracting multi-tokens
91
+ entities, such as I-CAMP, I-LOC, and I-ORG, which
92
+ are sometimes confused with the beginning of an
93
+ entity. Moreover, it tends to misclassify B-GHETTO
94
+ and B-CAMP as B-LOC, which is not surprising
95
+ given that they are semantically close.
96
+
97
+ This model was envisioned to work as part of EHRI-related editorial and publishing pipelines and may not be suitable for
98
+ the purposes of other users/organizations.
99
+
100
+ <!-- ### Recommendations
101
+
102
+ For more information, we encourage potential users to read the paper accompanying this model:
103
+ Dermentzi, M., & Scheithauer, H. (2024, May 21). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. Proceedings of the LREC-COLING 2024 Workshop on Holocaust Testimonies as Language Resources. HTRes@LREC-COLING 2024, Turin, Italy.
104
+
105
+
106
+ ## Citation
107
+
108
+ **BibTeX:**
109
+ @inproceedings{dermentzi_repurposing_2024,
110
+ address = {Turin, Italy},
111
+ title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}},
112
+ booktitle = {Proceedings of the {LREC}-{COLING} 2024 {Workshop} on {Holocaust} {Testimonies} as {Language} {Resources}},
113
+ author = {Dermentzi, Maria and Scheithauer, Hugo},
114
+ month = may,
115
+ year = {2024},
116
+ pubstate={forthcoming},
117
+ }
118
+
119
+
120
+ **APA:**
121
+ Dermentzi, M., & Scheithauer, H. (2024, May 21). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. Proceedings of the LREC-COLING 2024 Workshop on Holocaust Testimonies as Language Resources. HTRes@LREC-COLING 2024, Turin, Italy.
122
+ -->