mdermentzi
commited on
Commit
•
bf115d6
1
Parent(s):
a381619
Create model card
Browse files
README.md
ADDED
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: eupl-1.1
|
3 |
+
datasets:
|
4 |
+
- ehri-ner/ehri-ner-all
|
5 |
+
language:
|
6 |
+
- cs
|
7 |
+
- de
|
8 |
+
- en
|
9 |
+
- fr
|
10 |
+
- hu
|
11 |
+
- nl
|
12 |
+
- pl
|
13 |
+
- sk
|
14 |
+
- yi
|
15 |
+
metrics:
|
16 |
+
- name: f1
|
17 |
+
type: f1
|
18 |
+
value: 81.5
|
19 |
+
pipeline_tag: token-classification
|
20 |
+
tags:
|
21 |
+
- Holocaust
|
22 |
+
- EHRI
|
23 |
+
base_model: FacebookAI/xlm-roberta-large
|
24 |
+
---
|
25 |
+
# Model Card for Model ID
|
26 |
+
|
27 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
28 |
+
|
29 |
+
The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information
|
30 |
+
about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of
|
31 |
+
detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to
|
32 |
+
link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and
|
33 |
+
making it more discoverable. The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER)
|
34 |
+
using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts.
|
35 |
+
The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a
|
36 |
+
format suitable for training NER models. The results of our experiments show that despite our relatively small
|
37 |
+
dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations
|
38 |
+
is 81.5%.
|
39 |
+
|
40 |
+
|
41 |
+
### Model Description
|
42 |
+
|
43 |
+
<!-- Provide a longer summary of what this model is. -->
|
44 |
+
|
45 |
+
|
46 |
+
- **Developed by:** Dermentzi, M. & Scheithauer, H.
|
47 |
+
- **Funded by:** European Commission call H2020-INFRAIA-2018–2020. Grant agreement ID 871111. DOI 10.3030/871111.
|
48 |
+
- **Language(s) (NLP):** [More Information Needed]
|
49 |
+
- **License:** EUPL-1.2
|
50 |
+
- **Finetuned from model:** FacebookAI/xlm-roberta-large
|
51 |
+
|
52 |
+
<!-- ### Model Sources [optional]
|
53 |
+
|
54 |
+
<!-- Provide the basic links for the model.
|
55 |
+
|
56 |
+
- **Repository:** [More Information Needed]
|
57 |
+
- **Paper [optional]:** [More Information Needed]
|
58 |
+
-->
|
59 |
+
|
60 |
+
## Uses
|
61 |
+
|
62 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
63 |
+
This model was developed for research purposes in the context of the EHRI-3 project. Specifically, the aim was to determine
|
64 |
+
whether a single model can be trained to recognize entities across different document types and languages in Holocaust-related texts.
|
65 |
+
The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by
|
66 |
+
XLM-R fine-tuned on multilingual Holocaust-related annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps
|
67 |
+
towards deploying this model, i.e., receiving more feedback from the EHRI community. Once we have a stable model that EHRI stakeholders are
|
68 |
+
satisfied with, this model and its potential successors are intended to be used as part of an EHRI editorial pipeline whereby,
|
69 |
+
upon inputting some text into a tool that supports our model, potential named entities within the text will be automatically pre-annotated
|
70 |
+
in a way that helps our intended users (i.e., researchers and professional archivists) detect them faster and link them to their associated controlled vocabulary entities from the
|
71 |
+
custom EHRI controlled vocabularies and authority sets. This has the potential to facilitate metadata enrichment of descriptions
|
72 |
+
in the EHRI Portal and enhance their discoverability. It would also make it easier for EHRI to develop new Online Editions and
|
73 |
+
unlock new ways for archivists and researchers within the EHRI network to organize,
|
74 |
+
analyze, and present their materials and research data in ways that would otherwise require a lot of manual work.
|
75 |
+
|
76 |
+
## Limitations
|
77 |
+
|
78 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
79 |
+
The dataset used to fine-tune this model stems from a series of manually annotated
|
80 |
+
digital scholarly editions, the EHRI Online Editions. The original purpose
|
81 |
+
of these editions was not to provide a dataset
|
82 |
+
for training NER models, although we argue that they nevertheless
|
83 |
+
constitute a high-quality resource that is
|
84 |
+
suitable to be used in this way. However, users should still be mindful that
|
85 |
+
our dataset repurposes a resource that was not built for purpose.
|
86 |
+
|
87 |
+
The fine-tuned model occasionally misclassifies entities
|
88 |
+
as non-entity tokens, I-GHETTO being the most
|
89 |
+
confused entity. The fine-tuned model occasionally
|
90 |
+
encounters challenges in extracting multi-tokens
|
91 |
+
entities, such as I-CAMP, I-LOC, and I-ORG, which
|
92 |
+
are sometimes confused with the beginning of an
|
93 |
+
entity. Moreover, it tends to misclassify B-GHETTO
|
94 |
+
and B-CAMP as B-LOC, which is not surprising
|
95 |
+
given that they are semantically close.
|
96 |
+
|
97 |
+
This model was envisioned to work as part of EHRI-related editorial and publishing pipelines and may not be suitable for
|
98 |
+
the purposes of other users/organizations.
|
99 |
+
|
100 |
+
<!-- ### Recommendations
|
101 |
+
|
102 |
+
For more information, we encourage potential users to read the paper accompanying this model:
|
103 |
+
Dermentzi, M., & Scheithauer, H. (2024, May 21). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. Proceedings of the LREC-COLING 2024 Workshop on Holocaust Testimonies as Language Resources. HTRes@LREC-COLING 2024, Turin, Italy.
|
104 |
+
|
105 |
+
|
106 |
+
## Citation
|
107 |
+
|
108 |
+
**BibTeX:**
|
109 |
+
@inproceedings{dermentzi_repurposing_2024,
|
110 |
+
address = {Turin, Italy},
|
111 |
+
title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}},
|
112 |
+
booktitle = {Proceedings of the {LREC}-{COLING} 2024 {Workshop} on {Holocaust} {Testimonies} as {Language} {Resources}},
|
113 |
+
author = {Dermentzi, Maria and Scheithauer, Hugo},
|
114 |
+
month = may,
|
115 |
+
year = {2024},
|
116 |
+
pubstate={forthcoming},
|
117 |
+
}
|
118 |
+
|
119 |
+
|
120 |
+
**APA:**
|
121 |
+
Dermentzi, M., & Scheithauer, H. (2024, May 21). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. Proceedings of the LREC-COLING 2024 Workshop on Holocaust Testimonies as Language Resources. HTRes@LREC-COLING 2024, Turin, Italy.
|
122 |
+
-->
|