NER Model using Roberta

This markdown presents a Robustly Optimized BERT Pretraining Approach (RoBERTa) model trained on a combination of two diverse datasets for two languages: English and Persian. The English dataset used is CoNLL 2003, while the Persian dataset is PEYMA-ARMAN-Mixed, a fusion of the "PEYAM" and "ARMAN" datasets, both popular for Named Entity Recognition (NER) tasks.

The model training pipeline involves the following steps:

Data Preparation: Cleaning, aligning, and mixing data from the two datasets. Data Loading: Loading the prepared data for subsequent processing. Tokenization: Utilizing tokenization to prepare the text data for model input. Token Splitting: Handling token splitting (e.g., "jack" becomes "_ja _ck") and using "-100" for optimization and ignoring certain tokens. Model Reconstruction: Adapting the RoBERTa model for token classification in NER tasks. Model Training: Training the reconstructed model on the combined dataset and evaluating its performance. The model's performance, as shown in the table below, demonstrates promising results:

Epoch Training Loss Validation Loss F1 Recall Precision Accuracy
1 0.072600 0.038918 89.5% 0.906680 0.883703 0.987799
2 0.027600 0.030184 92.3% 0.933840 0.915573 0.991334
3 0.013500 0.030962 94% 0.946840 0.933740 0.992702
4 0.006600 0.029897 94.8% 0.955207 0.941990 0.993574

The model achieves an impressive F1-score of almost 95%.

To use the model, the following Python code snippet can be employed:

from transformers import AutoConfig, AutoTokenizer, AutoModel

config = AutoConfig.from_pretrained("AliFartout/Roberta-fa-en-ner")
tokenizer = AutoTokenizer.from_pretrained("AliFartout/Roberta-fa-en-ner")
model = AutoModel.from_pretrained("AliFartout/Roberta-fa-en-ner")

By following this approach, you can seamlessly access and incorporate the trained multilingual NER model into various Natural Language Processing tasks.

Downloads last month
19
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.