File size: 2,479 Bytes
d389b45
 
 
 
 
 
 
 
7ea960e
d389b45
a0ec08a
 
 
 
17f3659
a0ec08a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
license: apache-2.0
language:
- fa
- en
metrics:
- accuracy
pipeline_tag: token-classification

---

#  NER Model using Roberta


This markdown presents a Robustly Optimized BERT Pretraining Approach (RoBERTa) model trained on a combination of two diverse datasets for two languages: English and Persian. The English dataset used is [CoNLL 2003](https://huggingface.co/datasets/conll2003), while the Persian dataset is [PEYMA-ARMAN-Mixed](https://huggingface.co/datasets/AliFartout/PEYMA-ARMAN-Mixed), a fusion of the "PEYAM" and "ARMAN" datasets, both popular for Named Entity Recognition (NER) tasks.

The model training pipeline involves the following steps:

Data Preparation: Cleaning, aligning, and mixing data from the two datasets.
Data Loading: Loading the prepared data for subsequent processing.
Tokenization: Utilizing tokenization to prepare the text data for model input.
Token Splitting: Handling token splitting (e.g., "jack" becomes "_ja _ck") and using "-100" for optimization and ignoring certain tokens.
Model Reconstruction: Adapting the RoBERTa model for token classification in NER tasks.
Model Training: Training the reconstructed model on the combined dataset and evaluating its performance.
The model's performance, as shown in the table below, demonstrates promising results:
| Epoch | Training Loss |  Validation Loss   | F1 |  Recall  |  Precision  |  Accuracy  |
|:-------:|:--------:|:--------:|:----------:|:--------------:|:----------:|:----------------:|
|  1  |  0.072600 |   0.038918  |      	89.5%  |     0.906680     |     0.883703    |       0.987799    |
|  2  |   0.027600 |  0.030184  |   92.3%   |      0.933840    |    0.915573  |        0.991334       |
|  3  |   0.013500 |  0.030962 |    94%   |      0.946840    |    0.933740  |         0.992702        |
|  4  |  0.006600 |   0.029897 |    94.8%   |     0.955207    |    0.941990  |         0.993574        |

The model achieves an impressive F1-score of almost 95%.

To use the model, the following Python code snippet can be employed:

```python
from transformers import AutoConfig, AutoTokenizer, AutoModel

config = AutoConfig.from_pretrained("AliFartout/Roberta-fa-en-ner")
tokenizer = AutoTokenizer.from_pretrained("AliFartout/Roberta-fa-en-ner")
model = AutoModel.from_pretrained("AliFartout/Roberta-fa-en-ner")
```

By following this approach, you can seamlessly access and incorporate the trained multilingual NER model into various Natural Language Processing tasks.