license: apache-2.0
language:
- en
- gu
- mr
- hi
Model Card for Model ID
Model Details
The technique of marking the words in a phrase to their appropriate POS tags is known as part-of-speech tagging (POS tagging or POST). There are two sorts of POS tagging algorithms: rule-based and stochastic, and monolingual and multilingual are different types from a modelling standpoint. POS tags provide grammatical context to a sentence, which can be employed in NLP tasks such as NER, NLU and QNA systems. In this research field, a lot of researchers had already tried to propose various novel approaches, tags and models like Weightless Artificial Neural Network (WANN), different forms of CRF, Bi-LSTM CRF, and transformers, various techniques for language tag mixed POS tags to handle mixed languages. All this research work leads to the enhancement or creating a benchmark for different popular and low resource languages, In the state of monolingual or multilingual context. In this model we are trying to achieve state-of-the-art model for the Indian language context in both native and its Romanised format.
Model Description
The model has been trained on the romanized forms of the Indian languages as well as English, Hindi, Gujarati, and Marathi.i.e(en,gu,mr,hi,gu_romanised,mr_romanised,hi_romanised)
To use this model you have import this class
rom transformers import BertPreTrainedModel, BertModel
from transformers.modeling_outputs import TokenClassifierOutput
from torch import nn
from torch.nn import CrossEntropyLoss
import torch
from torchcrf import CRF
from transformers import BertTokenizerFast
from transformers import BertTokenizerFast, Trainer, TrainingArguments
from transformers.trainer_utils import IntervalStrategy
class BertCRF(BertPreTrainedModel):
_keys_to_ignore_on_load_unexpected = [r"pooler"]
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.bert = BertModel(config, add_pooling_layer=False)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.crf = CRF(num_tags=config.num_labels, batch_first=True)
self.init_weights()
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
1]``.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
sequence_output = self.dropout(sequence_output)
logits = self.classifier(sequence_output)
loss = None
if labels is not None:
log_likelihood, tags = self.crf(logits, labels), self.crf.decode(logits)
loss = 0 - log_likelihood
else:
tags = self.crf.decode(logits)
tags = torch.Tensor(tags)
if not return_dict:
output = (tags,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return loss, tags
Some sample output from the model
Types | Output |
---|---|
English | [{'words': ['my', 'name', 'is', 'swagat'], 'labels': ['DET', 'NN', 'VB', 'NN']}] |
Hindi | [{'words': ['मेरा', 'नाम', 'स्वागत', 'है'], 'labels': ['PRP', 'NN', 'NNP', 'VM']}] |
Hindi Romanised | [{'words': ['mera', 'naam', 'swagat', 'hai'], 'labels': [‘PRP', 'NN', 'NNP', 'VM']}] |
Gujarati | [{'words': ['મારું', 'નામ', 'સ્વગત', 'છે'], 'labels': ['PRP', 'NN', 'NNP', 'VAUX']}] |
Gujarati Romanised | [{'words': ['maru', 'naam', 'swagat', 'che'], 'labels': ['PRP', 'NN', 'NNP', 'VAUX']}] |
- Developed by: Swagat Panda
- Finetuned from model : google/muril-base-cased