license: apache-2.0
tags:
- flair
- token-classification
- sequence-tagger-model
language: es
datasets:
- conll2003
- BSC-LT/NextProcurement-NER-Spanish-UTE-Company-annotated
widget:
- text: >-
PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS
HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente
oferta:
- text: 'PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:'
Recognition of UTEs and company mentions in Flair
This is a model trained using Flair to recognise mentions of UTEs (Unión Temporal de Empresas) and companies in public tenders.
It is a finetune of the flair/ner-spanish-large model (retrained from scratch to include additional tags).
Based on document-level XLM-R embeddings and FLERT.
Demo: How to use in Flair
Requires: Flair (pip install flair
)
from flair.data import Sentence
from flair.models import SequenceTagger
# load tagger
tagger = SequenceTagger.load("BSC-LT/NextProcurement-NER-Spanish-UTE-Company")
# make example sentence
sentence = Sentence("PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:")
# predict NER tags
tagger.predict(sentence)
# print sentence
print(sentence)
# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)
This yields the following output:
Sentence[24]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L."/UTE, "PODACESA-ECR"/UTE]
The following NER tags are found:
Span[0:14]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L." _ UTE (0.995)
Span[18:19]: "PODACESA-ECR" _ UTE (0.9955)
and with the sentence "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:"
Sentence[11]: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A"/SINGLE_COMPANY]
The following NER tags are found:
Span[0:6]: "PODACESA OBRAS Y SERVICIOS, S.A" _ SINGLE_COMPANY (1.0)
Training: Script to train this model
The following Flair script was used to train this model (TODO: update):
import torch
# 1. get the corpus
from flair.datasets import CONLL_03_SPANISH
corpus = CONLL_03_SPANISH()
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
# 4. initialize fine-tuneable transformer embeddings WITH document context
from flair.embeddings import TransformerWordEmbeddings
embeddings = TransformerWordEmbeddings(
model='xlm-roberta-large',
layers="-1",
subtoken_pooling="first",
fine_tune=True,
use_context=True,
)
# 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection)
from flair.models import SequenceTagger
tagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type='ner',
use_crf=False,
use_rnn=False,
reproject_embeddings=False,
)
# 6. initialize trainer with AdamW optimizer
from flair.trainers import ModelTrainer
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
# 7. run training with XLM parameters (20 epochs, small LR)
from torch.optim.lr_scheduler import OneCycleLR
trainer.train('resources/taggers/ner-spanish-large',
learning_rate=5.0e-6,
mini_batch_size=4,
mini_batch_chunk_size=1,
max_epochs=20,
scheduler=OneCycleLR,
embeddings_storage_mode='none',
weight_decay=0.,
)
)
Evaluation Results
Results:
- F-score (micro) 0.7431
- F-score (macro) 0.7429
- Accuracy 0.5944
By class:
precision recall f1-score support
UTE 0.7568 0.7887 0.7724 71
SINGLE_COMPANY 0.6538 0.7846 0.7133 65
micro avg 0.7039 0.7868 0.7431 136
macro avg 0.7053 0.7867 0.7429 136
weighted avg 0.7076 0.7868 0.7442 136
Additional information
Author
The Language Technologies Unit from Barcelona Supercomputing Center.
Contact
For further information, please send an email to langtech@bsc.es.
Copyright
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
License
Funding
This work has been promoted and financed by the European Commission Health and Digital Executive Agency, Connecting Europe Facility,
Grant Agreement Nº INEA/CEF/ICT/A2020/2373713,
Action Title Open Harmonized and Enriched Procurement Data Platform (nextProcurement),
Action number 2020-ES-IA-0255.
Disclaimer
Click to expand
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
Be aware that the model may have biases and/or any other undesirable distortions.
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
In no event shall the owner and creator of the model (Barcelona Supercomputing Center) be liable for any results arising from the use made by third parties.