|
--- |
|
license: apache-2.0 |
|
language: |
|
- fa |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: token-classification |
|
|
|
--- |
|
|
|
# NER Model using Roberta |
|
|
|
|
|
This markdown presents a Robustly Optimized BERT Pretraining Approach (RoBERTa) model trained on a combination of two diverse datasets for two languages: English and Persian. The English dataset used is [CoNLL 2003](https://huggingface.co/datasets/conll2003), while the Persian dataset is [PEYMA-ARMAN-Mixed](https://huggingface.co/datasets/AliFartout/PEYMA-ARMAN-Mixed), a fusion of the "PEYAM" and "ARMAN" datasets, both popular for Named Entity Recognition (NER) tasks. |
|
|
|
The model training pipeline involves the following steps: |
|
|
|
Data Preparation: Cleaning, aligning, and mixing data from the two datasets. |
|
Data Loading: Loading the prepared data for subsequent processing. |
|
Tokenization: Utilizing tokenization to prepare the text data for model input. |
|
Token Splitting: Handling token splitting (e.g., "jack" becomes "_ja _ck") and using "-100" for optimization and ignoring certain tokens. |
|
Model Reconstruction: Adapting the RoBERTa model for token classification in NER tasks. |
|
Model Training: Training the reconstructed model on the combined dataset and evaluating its performance. |
|
The model's performance, as shown in the table below, demonstrates promising results: |
|
| Epoch | Training Loss | Validation Loss | F1 | Recall | Precision | Accuracy | |
|
|:-------:|:--------:|:--------:|:----------:|:--------------:|:----------:|:----------------:| |
|
| 1 | 0.072600 | 0.038918 | 89.5% | 0.906680 | 0.883703 | 0.987799 | |
|
| 2 | 0.027600 | 0.030184 | 92.3% | 0.933840 | 0.915573 | 0.991334 | |
|
| 3 | 0.013500 | 0.030962 | 94% | 0.946840 | 0.933740 | 0.992702 | |
|
| 4 | 0.006600 | 0.029897 | 94.8% | 0.955207 | 0.941990 | 0.993574 | |
|
|
|
The model achieves an impressive F1-score of almost 95%. |
|
|
|
To use the model, the following Python code snippet can be employed: |
|
|
|
```python |
|
from transformers import AutoConfig, AutoTokenizer, AutoModel |
|
|
|
config = AutoConfig.from_pretrained("AliFartout/Roberta-fa-en-ner") |
|
tokenizer = AutoTokenizer.from_pretrained("AliFartout/Roberta-fa-en-ner") |
|
model = AutoModel.from_pretrained("AliFartout/Roberta-fa-en-ner") |
|
``` |
|
|
|
By following this approach, you can seamlessly access and incorporate the trained multilingual NER model into various Natural Language Processing tasks. |