mdermentzi's picture
Update Model Card
9ec5712 verified
|
raw
history blame
6.23 kB
metadata
license: eupl-1.1
datasets:
  - ehri-ner/ehri-ner-all
language:
  - cs
  - de
  - en
  - fr
  - hu
  - nl
  - pl
  - sk
  - yi
metrics:
  - name: f1
    type: f1
    value: 81.5
pipeline_tag: token-classification
tags:
  - Holocaust
  - EHRI
base_model: FacebookAI/xlm-roberta-large

Model Card for Model ID

The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER) using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts. The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5%.

Model Description

  • Developed by: Dermentzi, M. & Scheithauer, H.
  • Funded by: European Commission call H2020-INFRAIA-2018–2020. Grant agreement ID 871111. DOI 10.3030/871111.
  • Language(s) (NLP): [More Information Needed]
  • License: EUPL-1.2
  • Finetuned from model: FacebookAI/xlm-roberta-large

Uses

This model was developed for research purposes in the context of the EHRI-3 project. Specifically, the aim was to determine whether a single model can be trained to recognize entities across different document types and languages in Holocaust-related texts. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual Holocaust-related annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps towards deploying this model, i.e., receiving more feedback from the EHRI community. Once we have a stable model that EHRI stakeholders are satisfied with, this model and its potential successors are intended to be used as part of an EHRI editorial pipeline whereby, upon inputting some text into a tool that supports our model, potential named entities within the text will be automatically pre-annotated in a way that helps our intended users (i.e., researchers and professional archivists) detect them faster and link them to their associated controlled vocabulary entities from the custom EHRI controlled vocabularies and authority sets. This has the potential to facilitate metadata enrichment of descriptions in the EHRI Portal and enhance their discoverability. It would also make it easier for EHRI to develop new Online Editions and unlock new ways for archivists and researchers within the EHRI network to organize, analyze, and present their materials and research data in ways that would otherwise require a lot of manual work.

Limitations

The dataset used to fine-tune this model stems from a series of manually annotated digital scholarly editions, the EHRI Online Editions. The original purpose of these editions was not to provide a dataset for training NER models, although we argue that they nevertheless constitute a high-quality resource that is suitable to be used in this way. However, users should still be mindful that our dataset repurposes a resource that was not built for purpose.

The fine-tuned model occasionally misclassifies entities as non-entity tokens, I-GHETTO being the most confused entity. The fine-tuned model occasionally encounters challenges in extracting multi-tokens entities, such as I-CAMP, I-LOC, and I-ORG, which are sometimes confused with the beginning of an entity. Moreover, it tends to misclassify B-GHETTO and B-CAMP as B-LOC, which is not surprising given that they are semantically close.

This model was envisioned to work as part of EHRI-related editorial and publishing pipelines and may not be suitable for the purposes of other users/organizations.

Recommendations

For more information, we encourage potential users to read the paper accompanying this model: Dermentzi, M., & Scheithauer, H. (2024, May 21). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. Proceedings of the LREC-COLING 2024 Workshop on Holocaust Testimonies as Language Resources. HTRes@LREC-COLING 2024, Turin, Italy.

Citation

BibTeX: @inproceedings{dermentzi_repurposing_2024, address = {Turin, Italy}, title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}}, booktitle = {Proceedings of the {LREC}-{COLING} 2024 {Workshop} on {Holocaust} {Testimonies} as {Language} {Resources}}, author = {Dermentzi, Maria and Scheithauer, Hugo}, month = may, year = {2024}, pubstate={forthcoming}, }

APA: Dermentzi, M., & Scheithauer, H. (2024, May 21). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. Proceedings of the LREC-COLING 2024 Workshop on Holocaust Testimonies as Language Resources. HTRes@LREC-COLING 2024, Turin, Italy.