emanuelaboros's picture
Update README.md
a72b9b3
|
raw
history blame
5.77 kB
---
language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bm
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- ff
- fi
- fr
- fy
- ga
- gd
- gl
- gn
- gu
- ha
- he
- hi
- hr
- ht
- hu
- hy
- id
- ig
- is
- it
- ja
- jv
- ka
- kg
- kk
- km
- kn
- ko
- ku
- ky
- la
- lg
- ln
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- qu
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- ss
- su
- sv
- sw
- ta
- te
- th
- ti
- tl
- tn
- tr
- uk
- ur
- uz
- vi
- wo
- xh
- yo
- zh
tags:
- retrieval
- entity-retrieval
- named-entity-disambiguation
- entity-disambiguation
- named-entity-linking
- entity-linking
- text2text-generation
---
# mGENRE
The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture.
GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers.
This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), composed of the following datasets.
| Dataset alias | README | Document type | Languages | Suitable for | Project | License |
|---------|---------|---------------|-----------| ---------------|---------------| ---------------|
| ajmc | [link](documentation/README-ajmc.md) | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | [AjMC](https://mromanello.github.io/ajax-multi-commentary/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) |
| hipe2020 | [link](documentation/README-hipe2020.md)| historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
| topres19th | [link](documentation/README-topres19th.md) | historical newspapers | en | NERC-Coarse, EL |[Living with Machines](https://livingwithmachines.ac.uk/) | [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
| newseye | [link](documentation/README-newseye.md)| historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL | [NewsEye](https://www.newseye.eu/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
| sonar | [link](documentation/README-sonar.md) | historical newspapers | de | NERC-Coarse, EL | [SoNAR](https://sonar.fh-potsdam.de/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
## BibTeX entry and citation info
## Usage
Here is an example of generation for Wikipedia page disambiguation with simulated OCR noise:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline
NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"
# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")
sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.",
"In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
"Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]
nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME,
tokenizer=nel_tokenizer,
trust_remote_code=True,
device='cpu')
for sentence in sentences:
print(sentence)
linked_entity = nel_pipeline(sentence)
print(linked_entity)
```
```
[{'title': 'United Press International', 'qid': 'Q493845', 'url': 'https://en.wikipedia.org/wiki/United_Press_International'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'Joseph Bradley Varnum', 'qid': 'Q1706673', 'url': 'https://en.wikipedia.org/wiki/Joseph_Bradley_Varnum'}, {'title': 'The Press', 'qid': 'Q2413590', 'url': 'https://en.wikipedia.org/wiki/The_Press'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}]
[{'title': 'London', 'qid': 'Q84', 'url': 'https://en.wikipedia.org/wiki/London'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'Lyon', 'qid': 'Q456', 'url': 'https://en.wikipedia.org/wiki/Lyon'}]
[{'title': 'Agence France-Presse', 'qid': 'Q40464', 'url': 'https://en.wikipedia.org/wiki/Agence_France-Presse'}, {'title': 'Agence France-Presse', 'qid': 'Q40464', 'url': 'https://en.wikipedia.org/wiki/Agence_France-Presse'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}]
```
---
license: agpl-3.0
---