File size: 6,015 Bytes
e1ac646 e5877f6 e1ac646 1238bf1 e1ac646 1238bf1 e1ac646 1238bf1 e1ac646 1238bf1 e1ac646 836f5fa e1ac646 836f5fa e1ac646 836f5fa e1ac646 1238bf1 e1ac646 1238bf1 e1ac646 1238bf1 e1ac646 1238bf1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
license: apache-2.0
tags:
- flair
- token-classification
- sequence-tagger-model
language: es
datasets:
- conll2003
- BSC-LT/NextProcurement-NER-Spanish-UTE-Company-annotated
widget:
- text: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:"
- text: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:"
---
# Recognition of UTEs and company mentions in Flair
This is a model trained using [Flair](https://github.com/flairNLP/flair/) to recognise mentions of UTEs (Unión Temporal de Empresas)
and companies in public tenders.
It is a finetune of the flair/ner-spanish-large model (retrained from scratch to include additional tags).
Based on document-level XLM-R embeddings and [FLERT](https://arxiv.org/pdf/2011.06993v1.pdf/).
---
## Demo: How to use in Flair
Requires: **[Flair](https://github.com/flairNLP/flair/)** (`pip install flair`)
```python
from flair.data import Sentence
from flair.models import SequenceTagger
# load tagger
tagger = SequenceTagger.load("BSC-LT/NextProcurement-NER-Spanish-UTE-Company")
# make example sentence
sentence = Sentence("PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRÁULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:")
# predict NER tags
tagger.predict(sentence)
# print sentence
print(sentence)
# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)
```
This yields the following output:
```
Sentence[24]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L., constituidos en UTE PODACESA-ECR realizan la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L."/UTE, "PODACESA-ECR"/UTE]
The following NER tags are found:
Span[0:14]: "PODACESA OBRAS Y SERVICIOS, S.A, y ECR INFRAESTRUCTURAS Y SERVICIOS HIDRAULICOS S.L." _ UTE (0.995)
Span[18:19]: "PODACESA-ECR" _ UTE (0.9955)
```
and with the sentence "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:"
```
Sentence[11]: "PODACESA OBRAS Y SERVICIOS, S.A realiza la siguiente oferta:" _ ["PODACESA OBRAS Y SERVICIOS, S.A"/SINGLE_COMPANY]
The following NER tags are found:
Span[0:6]: "PODACESA OBRAS Y SERVICIOS, S.A" _ SINGLE_COMPANY (1.0)
```
---
## Training: Script to train this model
The following Flair script was used to train this model (**TODO: update**):
```python
import torch
# 1. get the corpus
from flair.datasets import CONLL_03_SPANISH
corpus = CONLL_03_SPANISH()
# 2. what tag do we want to predict?
tag_type = 'ner'
# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
# 4. initialize fine-tuneable transformer embeddings WITH document context
from flair.embeddings import TransformerWordEmbeddings
embeddings = TransformerWordEmbeddings(
model='xlm-roberta-large',
layers="-1",
subtoken_pooling="first",
fine_tune=True,
use_context=True,
)
# 5. initialize bare-bones sequence tagger (no CRF, no RNN, no reprojection)
from flair.models import SequenceTagger
tagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type='ner',
use_crf=False,
use_rnn=False,
reproject_embeddings=False,
)
# 6. initialize trainer with AdamW optimizer
from flair.trainers import ModelTrainer
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
# 7. run training with XLM parameters (20 epochs, small LR)
from torch.optim.lr_scheduler import OneCycleLR
trainer.train('resources/taggers/ner-spanish-large',
learning_rate=5.0e-6,
mini_batch_size=4,
mini_batch_chunk_size=1,
max_epochs=20,
scheduler=OneCycleLR,
embeddings_storage_mode='none',
weight_decay=0.,
)
)
```
## Evaluation Results
```
Results:
- F-score (micro) 0.7431
- F-score (macro) 0.7429
- Accuracy 0.5944
By class:
precision recall f1-score support
UTE 0.7568 0.7887 0.7724 71
SINGLE_COMPANY 0.6538 0.7846 0.7133 65
micro avg 0.7039 0.7868 0.7431 136
macro avg 0.7053 0.7867 0.7429 136
weighted avg 0.7076 0.7868 0.7442 136
```
## Additional information
### Author
The Language Technologies Unit from Barcelona Supercomputing Center.
### Contact
For further information, please send an email to <langtech@bsc.es>.
### Copyright
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
### License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
### Funding
This work has been promoted and financed by the European Commission Health and Digital Executive Agency, Connecting Europe Facility,
Grant Agreement Nº INEA/CEF/ICT/A2020/2373713,
Action Title Open Harmonized and Enriched Procurement Data Platform (nextProcurement),
Action number 2020-ES-IA-0255.
### Disclaimer
<details>
<summary>Click to expand</summary>
The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
Be aware that the model may have biases and/or any other undesirable distortions.
When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
be liable for any results arising from the use made by third parties.
</details>
|