---

language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bm
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- ff
- fi
- fr
- fy
- ga
- gd
- gl
- gn
- gu
- ha
- he
- hi
- hr
- ht
- hu
- hy
- id
- ig
- is
- it
- ja
- jv
- ka
- kg
- kk
- km
- kn
- ko
- ku
- ky
- la
- lg
- ln
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- qu
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- ss
- su
- sv
- sw
- ta
- te
- th
- ti
- tl
- tn
- tr
- uk
- ur
- uz
- vi
- wo
- xh
- yo
- zh


tags:
- retrieval
- entity-retrieval
- named-entity-disambiguation
- entity-disambiguation
- named-entity-linking
- entity-linking
- text2text-generation
---


# mGENRE 


The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture. 
GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. 

This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), composed of the following datasets.

| Dataset alias | README | Document type | Languages |  Suitable for | Project | License |
|---------|---------|---------------|-----------| ---------------|---------------| ---------------|
| ajmc       | [link](documentation/README-ajmc.md)  | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | [AjMC](https://mromanello.github.io/ajax-multi-commentary/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) |
| hipe2020   | [link](documentation/README-hipe2020.md)| historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
| topres19th | [link](documentation/README-topres19th.md) | historical newspapers | en | NERC-Coarse, EL |[Living with Machines](https://livingwithmachines.ac.uk/) | [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
| newseye    | [link](documentation/README-newseye.md)|  historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL |  [NewsEye](https://www.newseye.eu/) |  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
| sonar      | [link](documentation/README-sonar.md) | historical newspapers  | de | NERC-Coarse, EL |  [SoNAR](https://sonar.fh-potsdam.de/)  | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|


## BibTeX entry and citation info


## Usage

Here is an example of generation for Wikipedia page disambiguation with simulated OCR noise:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"

# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")

sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.",
             "In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
             "Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]

nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME, 
                        tokenizer=nel_tokenizer, 
                        trust_remote_code=True,
                        device='cpu')
for sentence in sentences:
    print(sentence)
    linked_entity = nel_pipeline(sentence)
    print(linked_entity)
```

```
[{'surface': 'Un1ted Press', 'wkd_id': 'Q493845', 'wkpedia_pagename': 'United Press International', 'wkpedia_url': 'https://en.wikipedia.org/wiki/United_Press_International', 'type': 'UNK', 'confidence_nel': 55.89, 'lOffset': 7, 'rOffset': 21}]
[{'surface': 'Lon6on', 'wkd_id': 'Q84', 'wkpedia_pagename': 'London', 'wkpedia_url': 'https://de.wikipedia.org/wiki/London', 'type': 'UNK', 'confidence_nel': 99.99, 'lOffset': 10, 'rOffset': 18}]
[{'surface': 'AFP', 'wkd_id': 'Q40464', 'wkpedia_pagename': 'Agence France-Presse', 'wkpedia_url': 'https://fr.wikipedia.org/wiki/Agence_France-Presse', 'type': 'UNK', 'confidence_nel': 100.0, 'lOffset': 45, 'rOffset': 50}]
```

---
license: agpl-3.0
---