Update README.md

08e172a verified 7 months ago

6 kB

	---
	license: apache-2.0
	tags:
	- flair
	- token-classification
	- sequence-tagger-model
	language: es
	datasets:
	- conll2003
	- BSC-LT/NextProcurement-NER-Spanish-UTE-Company-annotated
	widget:
	- text: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:"
	- text: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:"
	---

	# Recognition of UTEs and company mentions in Flair

	This is a model trained using [Flair](https://github.com/flairNLP/flair/) to recognise mentions of UTEs (Unión Temporal de Empresas)
	and companies in public tenders.

	It is a finetune of the flair/ner-spanish-large model (retrained from scratch to include additional tags).

	Based on document-level XLM-R embeddings and [FLERT](https://arxiv.org/pdf/2011.06993v1.pdf/).


	## Demo: How to use in Flair

	Requires: [Flair](https://github.com/flairNLP/flair/) (`pip install flair`)

	```python
	from flair.data import Sentence
	from flair.models import SequenceTagger
	# load tagger
	tagger = SequenceTagger.load("BSC-LT/NextProcurement-NER-Spanish-UTE-Company")
	# make example sentence
	sentence = Sentence("PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:")
	# predict NER tags
	tagger.predict(sentence)
	# print sentence
	print(sentence)
	# print predicted NER spans
	print('The following NER tags are found:')
	# iterate over entities and print
	for entity in sentence.get_spans('ner'):
	print(entity)
	```

	This yields the following output:
	```
	Sentence[24]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L."/UTE, "PODACESA-ECR"/UTE]
	The following NER tags are found:
	Span[0:14]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L." _ UTE (0.995)
	Span[18:19]: "PODACESA-ECR" _ UTE (0.9955)
	```

	and with the sentence "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:"
	```
	Sentence[11]: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A"/SINGLE_COMPANY]
	The following NER tags are found:
	Span[0:6]: "PODACESA OBRAS Y SERVICIOS, S.A" _ SINGLE_COMPANY (1.0)
	```


	## Training: Script to train this model

	The following Flair script was used to train this model (TODO: update):

	```python
	import torch
	# 1. get the corpus
	from flair.datasets import CONLL_03_SPANISH
	corpus = CONLL_03_SPANISH()
	# 2. what tag do we want to predict?
	tag_type = 'ner'
	# 3. make the tag dictionary from the corpus
	tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
	# 4. initialize fine-tuneable transformer embeddings WITH document context
	from flair.embeddings import TransformerWordEmbeddings
	embeddings = TransformerWordEmbeddings(
	model='xlm-roberta-large',
	layers="-1",
	subtoken_pooling="first",
	fine_tune=True,
	use_context=True,
	)
	# 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection)
	from flair.models import SequenceTagger
	tagger = SequenceTagger(
	hidden_size=256,
	embeddings=embeddings,
	tag_dictionary=tag_dictionary,
	tag_type='ner',
	use_crf=False,
	use_rnn=False,
	reproject_embeddings=False,
	)
	# 6. initialize trainer with AdamW optimizer
	from flair.trainers import ModelTrainer
	trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
	# 7. run training with XLM parameters (20 epochs, small LR)
	from torch.optim.lr_scheduler import OneCycleLR
	trainer.train('resources/taggers/ner-spanish-large',
	learning_rate=5.0e-6,
	mini_batch_size=4,
	mini_batch_chunk_size=1,
	max_epochs=20,
	scheduler=OneCycleLR,
	embeddings_storage_mode='none',
	weight_decay=0.,
	)
	)
	```

	## Evaluation Results

	```
	Results:
	- F-score (micro) 0.7431
	- F-score (macro) 0.7429
	- Accuracy 0.5944

	By class:
	precision recall f1-score support

	UTE 0.7568 0.7887 0.7724 71
	SINGLE_COMPANY 0.6538 0.7846 0.7133 65

	micro avg 0.7039 0.7868 0.7431 136
	macro avg 0.7053 0.7867 0.7429 136
	weighted avg 0.7076 0.7868 0.7442 136
	```

	## Additional information

	### Author
	The Language Technologies Unit from Barcelona Supercomputing Center.

	### Contact
	For further information, please send an email to <langtech@bsc.es>.

	### Copyright
	Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.

	### License
	[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

	### Funding
	This work has been promoted and financed by the European Commission Health and Digital Executive Agency, Connecting Europe Facility, Grant Agreement Nº INEA/CEF/ICT/A2020/2373713, Action Title Open Harmonized and Enriched Procurement Data Platform (nextProcurement), Action number 2020-ES-IA-0255.

	### Disclaimer
	<details>
	<summary>Click to expand</summary>

	The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.

	Be aware that the model may have biases and/or any other undesirable distortions.

	When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
	or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
	in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

	In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
	be liable for any results arising from the use made by third parties.

	</details>