Update README.md

cb9a19d verified 8 months ago

8.26 kB

	---
	license: eupl-1.1
	datasets:
	- ehri-ner/ehri-ner-all
	language:
	- cs
	- de
	- en
	- fr
	- hu
	- nl
	- pl
	- sk
	- yi
	metrics:
	- name: f1
	type: f1
	value: 81.5
	pipeline_tag: token-classification
	tags:
	- Holocaust
	- EHRI
	base_model: FacebookAI/xlm-roberta-large
	---
	# Model Card for ehri-ner/xlm-roberta-large-ehri-ner-all

	<!-- Provide a quick summary of what the model is/does. -->

	The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information
	about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of
	detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to
	link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and
	making it more discoverable. The xlm-roberta-large-ehri-ner-all model finetunes XLM-RoBERTa (XLM-R) for Holocaust-related Named Entity Recognition (NER)
	using the EHRI-NER dataset, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for NER in Holocaust-related texts.
	The EHRI-NER dataset is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a
	format suitable for training NER models. The results of our experiments show that despite our relatively small
	dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations
	is 81.5%.


	### Model Description

	<!-- Provide a longer summary of what this model is. -->


	- Developed by: Dermentzi, M. & Scheithauer, H.
	- Funded by: European Commission call H2020-INFRAIA-2018–2020. Grant agreement ID 871111. DOI 10.3030/871111.
	- Language(s) (NLP): The model was fine-tuned on cs, de, en, fr, hu, nl, pl, sk, yi data but it may work for more languages due to the use of a multilingual base model (XLM-R) with cross-lingual transfer capabilities.
	- License: EUPL-1.2
	- Finetuned from model: FacebookAI/xlm-roberta-large

	<!-- ### Model Sources [optional]

	<!-- Provide the basic links for the model.

	- Repository: [More Information Needed]
	- Paper [optional]: [More Information Needed]
	-->

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
	This model was developed for research purposes in the context of the EHRI-3 project. Specifically, the aim was to determine
	whether a single model can be trained to recognize entities across different document types and languages in Holocaust-related texts.
	The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by
	XLM-R fine-tuned on multilingual Holocaust-related annotations is 81.5%. We argue that this score is sufficiently high to consider the next steps
	towards deploying this model, i.e., receiving more feedback from the EHRI community. Once we have a stable model that EHRI stakeholders are
	satisfied with, this model and its potential successors are intended to be used as part of an EHRI editorial pipeline whereby,
	upon inputting some text into a tool that supports our model, potential named entities within the text will be automatically pre-annotated
	in a way that helps our intended users (i.e., researchers and professional archivists) detect them faster and link them to their associated controlled vocabulary entities from the
	custom EHRI controlled vocabularies and authority sets. This has the potential to facilitate metadata enrichment of descriptions
	in the EHRI Portal and enhance their discoverability. It would also make it easier for EHRI to develop new Online Editions and
	unlock new ways for archivists and researchers within the EHRI network to organize,
	analyze, and present their materials and research data in ways that would otherwise require a lot of manual work.

	## Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->
	The dataset used to fine-tune this model stems from a series of manually annotated
	digital scholarly editions, the EHRI Online Editions. The original purpose
	of these editions was not to provide a dataset
	for training NER models, although we argue that they nevertheless
	constitute a high-quality resource that is
	suitable to be used in this way. However, users should still be mindful that
	our dataset repurposes a resource that was not built for purpose.

	The fine-tuned model occasionally misclassifies entities
	as non-entity tokens, I-GHETTO being the most
	confused entity. The fine-tuned model occasionally
	encounters challenges in extracting multi-tokens
	entities, such as I-CAMP, I-LOC, and I-ORG, which
	are sometimes confused with the beginning of an
	entity. Moreover, it tends to misclassify B-GHETTO
	and B-CAMP as B-LOC, which is not surprising
	given that they are semantically close.

	This model was envisioned to work as part of EHRI-related editorial and publishing pipelines and may not be suitable for
	the purposes of other users/organizations.

	### Recommendations

	For more information, we encourage potential users to read the paper accompanying this model:
	Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC-COLING 2024, Torino, Italy. https://hal.science/hal-04547222

	## Citation

	BibTeX:
	@inproceedings{dermentzi_repurposing_2024,
	address = {Torino, Italy},
	title = {Repurposing {Holocaust}-{Related} {Digital} {Scholarly} {Editions} to {Develop} {Multilingual} {Domain}-{Specific} {Named} {Entity} {Recognition} {Tools}},
	url = {https://hal.science/hal-04547222},
	abstract = {The European Holocaust Research Infrastructure (EHRI) aims to support Holocaust research by making information about dispersed Holocaust material accessible and interconnected through its services. Creating a tool capable of detecting named entities in texts such as Holocaust testimonies or archival descriptions would make it easier to link more material with relevant identifiers in domain-specific controlled vocabularies, semantically enriching it, and making it more discoverable. With this paper, we release EHRI-NER, a multilingual dataset (Czech, German, English, French, Hungarian, Dutch, Polish, Slovak, Yiddish) for Named Entity Recognition (NER) in Holocaust-related texts. EHRI-NER is built by aggregating all the annotated documents in the EHRI Online Editions and converting them to a format suitable for training NER models. We leverage this dataset to fine-tune the multilingual Transformer-based language model XLM-RoBERTa (XLM-R) to determine whether a single model can be trained to recognize entities across different document types and languages. The results of our experiments show that despite our relatively small dataset, in a multilingual experiment setup, the overall F1 score achieved by XLM-R fine-tuned on multilingual annotations is 81.5{\textbackslash}\%. We argue that this score is sufficiently high to consider the next steps towards deploying this model.},
	urldate = {2024-04-29},
	booktitle = {{LREC}-{COLING} 2024 - {Joint} {International} {Conference} on {Computational} {Linguistics}, {Language} {Resources} and {Evaluation}},
	publisher = {ELRA Language Resources Association (ELRA); International Committee on Computational Linguistics (ICCL)},
	author = {Dermentzi, Maria and Scheithauer, Hugo},
	month = may,
	year = {2024},
	keywords = {Digital Editions, Holocaust Testimonies, Multilingual, Named Entity Recognition, Transfer Learning, Transformers},
	}

	APA:
	Dermentzi, M., & Scheithauer, H. (2024, May). Repurposing Holocaust-Related Digital Scholarly Editions to Develop Multilingual Domain-Specific Named Entity Recognition Tools. LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation. HTRes@LREC-COLING 2024, Torino, Italy. https://hal.science/hal-04547222