File size: 5,109 Bytes
7182399
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
license: apache-2.0
language:
- en
- gu
- mr
- hi
---
# Model Card for Model ID


## Model Details
The technique of marking the words in a phrase to their appropriate POS
tags is known as part-of-speech tagging (POS tagging or POST). There are
two sorts of POS tagging algorithms: rule-based and stochastic, and
monolingual and multilingual are different types from a modelling
standpoint. POS tags provide grammatical context to a sentence, which can
be employed in NLP tasks such as NER, NLU and QNA systems.
In this research field, a lot of researchers had already tried to propose
various novel approaches, tags and models like Weightless Artificial
Neural Network (WANN), different forms of CRF, Bi-LSTM CRF, and
transformers, various techniques for language tag mixed POS tags to
handle mixed languages. All this research work leads to the enhancement
or creating a benchmark for different popular and low resource languages,
In the state of monolingual or multilingual context. In this model
we are trying to achieve state-of-the-art model for the Indian language
context in both native and its Romanised format. 

### Model Description

The model has been trained on the romanized forms of the Indian languages as well as English, Hindi, Gujarati, and Marathi.i.e(en,gu,mr,hi,gu_romanised,mr_romanised,hi_romanised)

To use this model you have import this class

```commandline
rom transformers import BertPreTrainedModel, BertModel
from transformers.modeling_outputs import  TokenClassifierOutput
from torch import nn
from torch.nn import CrossEntropyLoss
import torch

from torchcrf import CRF
from transformers import BertTokenizerFast
from transformers import BertTokenizerFast, Trainer, TrainingArguments
from transformers.trainer_utils import IntervalStrategy

class BertCRF(BertPreTrainedModel):

    _keys_to_ignore_on_load_unexpected = [r"pooler"]

    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config, add_pooling_layer=False)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        self.crf = CRF(num_tags=config.num_labels, batch_first=True)
        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        r"""
        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
            Labels for computing the token classification loss. Indices should be in ``[0, ..., config.num_labels -
            1]``.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(sequence_output)

        loss = None
        if labels is not None:
            log_likelihood, tags = self.crf(logits, labels), self.crf.decode(logits)
            loss = 0 - log_likelihood
        else:
            tags = self.crf.decode(logits)
        tags = torch.Tensor(tags)

        if not return_dict:
            output = (tags,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return loss, tags
```
Some sample output from the model

| Types              | Output                                                                                 |
|--------------------|----------------------------------------------------------------------------------------|
| English            | [{'words': ['my', 'name', 'is', 'swagat'], 'labels': ['DET', 'NN', 'VB', 'NN']}]       |
| Hindi              | [{'words': ['मेरा', 'नाम', 'स्वागत', 'है'], 'labels': ['PRP', 'NN', 'NNP', 'VM']}]        |
| Hindi Romanised    | [{'words': ['mera', 'naam', 'swagat', 'hai'], 'labels': [‘PRP', 'NN', 'NNP', 'VM']}]   |
| Gujarati           | [{'words': ['મારું', 'નામ', 'સ્વગત', 'છે'], 'labels': ['PRP', 'NN', 'NNP', 'VAUX']}]       |
| Gujarati Romanised | [{'words': ['maru', 'naam', 'swagat', 'che'], 'labels': ['PRP', 'NN', 'NNP', 'VAUX']}] |




- **Developed by:** Swagat Panda
- **Finetuned from model :** google/muril-base-cased

### Model Sources [optional]
- **Paper :** https://www.academia.edu/87916386/MULTILINGUAL_APPROACH_TOWARDS_THE_NATIVE_AND_ROMANISED_SCRIPTS_FOR_INDIAN_LANGUGE_CONTEXT_ON_POS_TAGGING?source=swp_share