File size: 5,109 Bytes
7182399 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
license: apache-2.0
language:
- en
- gu
- mr
- hi
---
# Model Card for Model ID
## Model Details
The technique of marking the words in a phrase to their appropriate POS
tags is known as part-of-speech tagging (POS tagging or POST). There are
two sorts of POS tagging algorithms: rule-based and stochastic, and
monolingual and multilingual are different types from a modelling
standpoint. POS tags provide grammatical context to a sentence, which can
be employed in NLP tasks such as NER, NLU and QNA systems.
In this research field, a lot of researchers had already tried to propose
various novel approaches, tags and models like Weightless Artificial
Neural Network (WANN), different forms of CRF, Bi-LSTM CRF, and
transformers, various techniques for language tag mixed POS tags to
handle mixed languages. All this research work leads to the enhancement
or creating a benchmark for different popular and low resource languages,
In the state of monolingual or multilingual context. In this model
we are trying to achieve state-of-the-art model for the Indian language
context in both native and its Romanised format.
### Model Description
The model has been trained on the romanized forms of the Indian languages as well as English, Hindi, Gujarati, and Marathi.i.e(en,gu,mr,hi,gu_romanised,mr_romanised,hi_romanised)
To use this model you have import this class
```commandline
rom transformers import BertPreTrainedModel, BertModel
from transformers.modeling_outputs import TokenClassifierOutput
from torch import nn
from torch.nn import CrossEntropyLoss
import torch
from torchcrf import CRF
from transformers import BertTokenizerFast
from transformers import BertTokenizerFast, Trainer, TrainingArguments
from transformers.trainer_utils import IntervalStrategy
class BertCRF(BertPreTrainedModel):
_keys_to_ignore_on_load_unexpected = [r"pooler"]
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.bert = BertModel(config, add_pooling_layer=False)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.classifier = nn.Linear(config.hidden_size, config.num_labels)
self.crf = CRF(num_tags=config.num_labels, batch_first=True)
self.init_weights()
def forward(
self,
input_ids=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
1]``.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
sequence_output = self.dropout(sequence_output)
logits = self.classifier(sequence_output)
loss = None
if labels is not None:
log_likelihood, tags = self.crf(logits, labels), self.crf.decode(logits)
loss = 0 - log_likelihood
else:
tags = self.crf.decode(logits)
tags = torch.Tensor(tags)
if not return_dict:
output = (tags,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return loss, tags
```
Some sample output from the model
| Types | Output |
|--------------------|----------------------------------------------------------------------------------------|
| English | [{'words': ['my', 'name', 'is', 'swagat'], 'labels': ['DET', 'NN', 'VB', 'NN']}] |
| Hindi | [{'words': ['मेरा', 'नाम', 'स्वागत', 'है'], 'labels': ['PRP', 'NN', 'NNP', 'VM']}] |
| Hindi Romanised | [{'words': ['mera', 'naam', 'swagat', 'hai'], 'labels': [‘PRP', 'NN', 'NNP', 'VM']}] |
| Gujarati | [{'words': ['મારું', 'નામ', 'સ્વગત', 'છે'], 'labels': ['PRP', 'NN', 'NNP', 'VAUX']}] |
| Gujarati Romanised | [{'words': ['maru', 'naam', 'swagat', 'che'], 'labels': ['PRP', 'NN', 'NNP', 'VAUX']}] |
- **Developed by:** Swagat Panda
- **Finetuned from model :** google/muril-base-cased
### Model Sources [optional]
- **Paper :** https://www.academia.edu/87916386/MULTILINGUAL_APPROACH_TOWARDS_THE_NATIVE_AND_ROMANISED_SCRIPTS_FOR_INDIAN_LANGUGE_CONTEXT_ON_POS_TAGGING?source=swp_share
|