--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: token-classification tags: - NER - token classification - information extraction - question answering --- **UTC-T5-large** - universal token classificator ***🚀 Meet the first prompt-tuned universal token classification model 🚀*** This is a model based on [flan-T5-large](https://huggingface.co/google/flan-t5-large) that was trained on multiple token classification tasks or tasks that can be represented in this way. Such multi-task fine-tuning enabled better generalization; even small models can be used for zero-shot named entity recognition and demonstrate good performance on reading comprehension tasks. The model can be used for the following tasks: * Named entity recognition (NER); * Question answering; * Relation extraction; * Coreference resolution; * Text cleaning; * Summarization; #### How to use We recommend to use the model with transformers `ner` pipeline: ```python from transformers import AutoTokenizer, T5PreTrainedModel, T5Config, T5EncoderModel from transformers.modeling_outputs import TokenClassifierOutput from typing import Union, Optional, Tuple from transformers import pipeline import torch class T5EncoderForTokenClassification(T5PreTrainedModel): _tied_weights_keys = ["encoder.embed_tokens.weight"] def __init__(self, config: T5Config): super().__init__(config) self.transformer = T5EncoderModel(config) self.classification_head = torch.nn.Linear(config.hidden_size, config.num_labels) # Initialize weights and apply final processing self.post_init() self.model_parallel = False def forward( self, input_ids: Optional[torch.LongTensor] = None, attention_mask: Optional[torch.FloatTensor] = None, head_mask: Optional[torch.FloatTensor] = None, inputs_embeds: Optional[torch.FloatTensor] = None, labels: Optional[torch.LongTensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None, ) -> Union[Tuple, TokenClassifierOutput]: r""" Returns: Example: ```python >>> from transformers import AutoTokenizer, MTxEncoderForTokenClassification >>> tokenizer = AutoTokenizer.from_pretrained("MTx-small") >>> model = MTxEncoderForTokenClassification.from_pretrained("MTx-small") >>> input_ids = tokenizer( ... "Studies have been shown that owning a dog is good for you", return_tensors="pt" ... ).input_ids # Batch size 1 >>> outputs = model(input_ids=input_ids) >>> logits = outputs.logits ```""" return_dict = return_dict if return_dict is not None else self.config.use_return_dict outputs = self.transformer( input_ids, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict, ) sequence_output = outputs[0] logits = self.classification_head(sequence_output) loss = None if labels is not None: labels = labels.to(logits.device) loss_fct = CrossEntropyLoss() loss = loss_fct(logits.view(-1, self.config.num_labels), labels.view(-1)) if not return_dict: output = (logits,) + outputs[1:] return ((loss,) + output) if loss is not None else output return TokenClassifierOutput( loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions, ) def process(text, prompt, treshold=0.5): """ Processes text by preparing prompt and adjusting indices. Args: text (str): The text to process prompt (str): The prompt to prepend to the text Returns: list: A list of dicts with adjusted spans and scores """ # Concatenate text and prompt for full input input_ = f"{prompt}\n{text}" results = nlp(input_) # Run NLP on full input processed_results = [] prompt_length = len(prompt) # Get prompt length for result in results: # check whether score is higher than treshold if result['score']