metadata

language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bm
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - ha
  - he
  - hi
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kg
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lg
  - ln
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - ti
  - tl
  - tn
  - tr
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yo
  - zh
tags:
  - retrieval
  - entity-retrieval
  - named-entity-disambiguation
  - entity-disambiguation
  - named-entity-linking
  - entity-linking
  - text2text-generation

mGENRE

The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in Multilingual Autoregressive Entity Linking. mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned mBART architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers.

This model was finetuned on the HIPE-2022 dataset, composed of the following datasets.

Dataset alias	README	Document type	Languages	Suitable for	Project
ajmc	link	classical commentaries	de, fr, en	NERC-Coarse, NERC-Fine, EL	AjMC
hipe2020	link	historical newspapers	de, fr, en	NERC-Coarse, NERC-Fine, EL	CLEF-HIPE-2020
topres19th	link	historical newspapers	en	NERC-Coarse, EL	Living with Machines
newseye	link	historical newspapers	de, fi, fr, sv	NERC-Coarse, NERC-Fine, EL	NewsEye
sonar	link	historical newspapers	de	NERC-Coarse, EL	SoNAR

BibTeX entry and citation info

Usage

Here is an example of generation for Wikipedia page disambiguation with simulated OCR noise:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"

# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")

sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.",
             "In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
             "Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]

nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME, 
                        tokenizer=nel_tokenizer, 
                        trust_remote_code=True,
                        device='cpu')
for sentence in sentences:
    print(sentence)
    linked_entity = nel_pipeline(sentence)
    print(linked_entity)

[{'surface': 'Un1ted Press', 'wkd_id': 'Q493845', 'wkpedia_pagename': 'United Press International', 'wkpedia_url': 'https://en.wikipedia.org/wiki/United_Press_International', 'type': 'UNK', 'confidence_nel': 55.89, 'lOffset': 7, 'rOffset': 21}]
[{'surface': 'Lon6on', 'wkd_id': 'Q84', 'wkpedia_pagename': 'London', 'wkpedia_url': 'https://de.wikipedia.org/wiki/London', 'type': 'UNK', 'confidence_nel': 99.99, 'lOffset': 10, 'rOffset': 18}]
[{'surface': 'AFP', 'wkd_id': 'Q40464', 'wkpedia_pagename': 'Agence France-Presse', 'wkpedia_url': 'https://fr.wikipedia.org/wiki/Agence_France-Presse', 'type': 'UNK', 'confidence_nel': 100.0, 'lOffset': 45, 'rOffset': 50}]

impresso-project
/

nel-mgenre-multilingual

mGENRE

BibTeX entry and citation info

Usage

license: agpl-3.0