ehri-ner
/

xlm-roberta-large-ehri-ner-all

+---
+license: eupl-1.1
+datasets:
+- ehri-ner/ehri-ner-all
+language:
+- cs
+- de
+- en
+- fr
+- hu
+- nl
+- pl
+- sk
+- yi
+metrics:
+- name: f1
+  type: f1
+  value: 81.5
+pipeline_tag: token-classification
+tags:
+- Holocaust
+- EHRI
+base_model: FacebookAI/xlm-roberta-large
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information
+about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of
+detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to
+link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and
+making it more discoverable. The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER)
+using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts.
+The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a
+format suitable for training NER models. The results of our experiments show that despite our relatively small
+dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations
+is 81.5%.
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** Dermentzi, M. & Scheithauer, H.
+- **Funded by:** European Commission call H2020-INFRAIA-2018–2020. Grant agreement ID 871111. DOI 10.3030/871111.
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** EUPL-1.2
+- **Finetuned from model:** FacebookAI/xlm-roberta-large
+<!-- ### Model Sources [optional]
+<!-- Provide the basic links for the model.
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+-->
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+This model was developed for research purposes in the context of the EHRI-3 project. Specifically, the aim was to determine
+whether a single model can be trained to recognize entities across different document types and languages in Holocaust-related texts.
+The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by
+XLM-R fine-tuned on multilingual Holocaust-related annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps
+towards deploying this model, i.e., receiving more feedback from the EHRI community. Once we have a stable model that EHRI stakeholders are
+satisfied with, this model and its potential successors are intended to be used as part of an EHRI editorial pipeline whereby,
+upon inputting some text into a tool that supports our model, potential named entities within the text will be automatically pre-annotated
+in a way that helps our intended users (i.e., researchers and professional archivists) detect them faster and link them to their associated controlled vocabulary entities from the
+custom EHRI controlled vocabularies and authority sets. This has the potential to facilitate metadata enrichment of descriptions
+in the EHRI Portal and enhance their discoverability. It would also make it easier for EHRI to develop new Online Editions and
+unlock new ways for archivists and researchers within the EHRI network to organize,
+analyze, and present their materials and research data in ways that would otherwise require a lot of manual work.
+## Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+The dataset used to fine-tune this model stems from a series of manually annotated
+digital scholarly editions, the EHRI Online Editions. The original purpose
+of these editions was not to provide a dataset
+for training NER models, although we argue that they nevertheless
+constitute a high-quality resource that is
+suitable to be used in this way. However, users should still be mindful that
+our dataset repurposes a resource that was not built for purpose.
+The fine-tuned model occasionally misclassifies entities
+as non-entity tokens, I-GHETTO being the most
+confused entity. The fine-tuned model occasionally
+encounters challenges in extracting multi-tokens
+entities, such as I-CAMP, I-LOC, and I-ORG, which
+are sometimes confused with the beginning of an
+entity. Moreover, it tends to misclassify B-GHETTO
+and B-CAMP as B-LOC, which is not surprising
+given that they are semantically close.
+This model was envisioned to work as part of EHRI-related editorial and publishing pipelines and may not be suitable for
+the purposes of other users/organizations.
+<!-- ### Recommendations
+For more information, we encourage potential users to read the paper accompanying this model:
+Dermentzi, M., & Scheithauer, H. (2024, May 21). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. Proceedings of the LREC-COLING 2024 Workshop on Holocaust Testimonies as Language Resources. HTRes@LREC-COLING 2024, Turin, Italy.
+## Citation
+**BibTeX:**
+@inproceedings{dermentzi_repurposing_2024,
+	address = {Turin, Italy},
+	title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}},
+	booktitle = {Proceedings of the {LREC}-{COLING} 2024 {Workshop} on {Holocaust} {Testimonies} as {Language} {Resources}},
+	author = {Dermentzi, Maria and Scheithauer, Hugo},
+	month = may,
+	year = {2024},
+    pubstate={forthcoming},
+}
+**APA:**
+Dermentzi, M., & Scheithauer, H. (2024, May 21). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. Proceedings of the LREC-COLING 2024 Workshop on Holocaust Testimonies as Language Resources. HTRes@LREC-COLING 2024, Turin, Italy.
+-->