maudehrmann's picture
added a link
57b93e7
|
raw
history blame
6.58 kB
metadata
language:
  - multilingual
  - af
  - am
  - ar
  - as
  - az
  - be
  - bg
  - bm
  - bn
  - br
  - bs
  - ca
  - cs
  - cy
  - da
  - de
  - el
  - en
  - eo
  - es
  - et
  - eu
  - fa
  - ff
  - fi
  - fr
  - fy
  - ga
  - gd
  - gl
  - gn
  - gu
  - ha
  - he
  - hi
  - hr
  - ht
  - hu
  - hy
  - id
  - ig
  - is
  - it
  - ja
  - jv
  - ka
  - kg
  - kk
  - km
  - kn
  - ko
  - ku
  - ky
  - la
  - lg
  - ln
  - lo
  - lt
  - lv
  - mg
  - mk
  - ml
  - mn
  - mr
  - ms
  - my
  - ne
  - nl
  - 'no'
  - om
  - or
  - pa
  - pl
  - ps
  - pt
  - qu
  - ro
  - ru
  - sa
  - sd
  - si
  - sk
  - sl
  - so
  - sq
  - sr
  - ss
  - su
  - sv
  - sw
  - ta
  - te
  - th
  - ti
  - tl
  - tn
  - tr
  - uk
  - ur
  - uz
  - vi
  - wo
  - xh
  - yo
  - zh
tags:
  - retrieval
  - entity-retrieval
  - named-entity-disambiguation
  - entity-disambiguation
  - named-entity-linking
  - entity-linking
  - text2text-generation

mGENRE

The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in Multilingual Autoregressive Entity Linking. mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned mBART architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers.

This model was finetuned on the HIPE-2022 dataset, composed of the following datasets.

Dataset alias README Document type Languages Suitable for Project License
ajmc link classical commentaries de, fr, en NERC-Coarse, NERC-Fine, EL AjMC License: CC BY 4.0
hipe2020 link historical newspapers de, fr, en NERC-Coarse, NERC-Fine, EL CLEF-HIPE-2020 License: CC BY-NC-SA 4.0
topres19th link historical newspapers en NERC-Coarse, EL Living with Machines License: CC BY-NC-SA 4.0
newseye link historical newspapers de, fi, fr, sv NERC-Coarse, NERC-Fine, EL NewsEye License: CC BY 4.0
sonar link historical newspapers de NERC-Coarse, EL SoNAR License: CC BY 4.0

BibTeX entry and citation info

Usage

Here is an example of generation for Wikipedia page disambiguation:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-hipe-multilingual")
model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-hipe-multilingual").eval()

sentences = ["[START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids.",
             "In [START] London [END], trotz der Zerstörung, ist der Geist der Menschen ungebrochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.", 
             "Les rapports des correspondants de la [START] AFP [END] mettent en lumière la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]

for sentence in sentences:
    outputs = model.generate(
        **tokenizer([sentence], return_tensors="pt"),
        num_beams=5,
        num_return_sequences=5
    )
    
    print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

which outputs the following top-5 predictions (using constrained beam search)

['United Press International >> en ', 'The United Press International >> en ', 'United Press International >> de ', 'United Press >> en ', 'Associated Press >> en ']
['London >> de ', 'London >> de ', 'London >> de ', 'Stadt London >> de ', 'Londonderry >> de ']
['Agence France-Presse >> fr ', 'Agence France-Presse >> fr ', 'Agence France-Presse de la Presse écrite >> fr ', 'Agence France-Presse de la porte de Vincennes >> fr ', 'Agence France-Presse de la porte océanique >> fr ']

Example with simulated OCR noise:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-hipe-multilingual")
model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-hipe-multilingual").eval()

sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.",
             "In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
             "Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]

for sentence in sentences:
    outputs = model.generate(
        **tokenizer([sentence], return_tensors="pt"),
        num_beams=5,
        num_return_sequences=5
    )
    
    print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
['United Press International >> en ', 'Un1ted Press >> en ', 'Joseph Bradley Varnum >> en ', 'The Press >> en ', 'The Unused Press >> en ']
['London >> de ', 'Longbourne >> de ', 'Longbon >> de ', 'Longston >> de ', 'Lyon >> de ']
['Agence France-Presse >> fr ', 'Agence France-Presse >> fr ', 'Agence France-Presse de la Presse écrite >> fr ', 'Agence France-Presse de la porte de Vincennes >> fr ', 'Agence France-Presse de la porte océanique >> fr ']

license: agpl-3.0