AliFartout
/

Roberta-fa-en-ner

Token Classification

Inference Endpoints

Model card Files Files and versions Community

Roberta-fa-en-ner / README.md

AliFartout's picture

Update README.md

17f3659 over 1 year ago

|

history blame contribute delete

2.48 kB

	---
	license: apache-2.0
	language:
	- fa
	- en
	metrics:
	- accuracy
	pipeline_tag: token-classification

	---

	# NER Model using Roberta


	This markdown presents a Robustly Optimized BERT Pretraining Approach (RoBERTa) model trained on a combination of two diverse datasets for two languages: English and Persian. The English dataset used is [CoNLL 2003](https://huggingface.co/datasets/conll2003), while the Persian dataset is [PEYMA-ARMAN-Mixed](https://huggingface.co/datasets/AliFartout/PEYMA-ARMAN-Mixed), a fusion of the "PEYAM" and "ARMAN" datasets, both popular for Named Entity Recognition (NER) tasks.

	The model training pipeline involves the following steps:

	Data Preparation: Cleaning, aligning, and mixing data from the two datasets.
	Data Loading: Loading the prepared data for subsequent processing.
	Tokenization: Utilizing tokenization to prepare the text data for model input.
	Token Splitting: Handling token splitting (e.g., "jack" becomes "_ja _ck") and using "-100" for optimization and ignoring certain tokens.
	Model Reconstruction: Adapting the RoBERTa model for token classification in NER tasks.
	Model Training: Training the reconstructed model on the combined dataset and evaluating its performance.
	The model's performance, as shown in the table below, demonstrates promising results:
	\| Epoch \| Training Loss \| Validation Loss \| F1 \| Recall \| Precision \| Accuracy \|
	\|:-------:\|:--------:\|:--------:\|:----------:\|:--------------:\|:----------:\|:----------------:\|
	\| 1 \| 0.072600 \| 0.038918 \| 89.5% \| 0.906680 \| 0.883703 \| 0.987799 \|
	\| 2 \| 0.027600 \| 0.030184 \| 92.3% \| 0.933840 \| 0.915573 \| 0.991334 \|
	\| 3 \| 0.013500 \| 0.030962 \| 94% \| 0.946840 \| 0.933740 \| 0.992702 \|
	\| 4 \| 0.006600 \| 0.029897 \| 94.8% \| 0.955207 \| 0.941990 \| 0.993574 \|

	The model achieves an impressive F1-score of almost 95%.

	To use the model, the following Python code snippet can be employed:

	```python
	from transformers import AutoConfig, AutoTokenizer, AutoModel

	config = AutoConfig.from_pretrained("AliFartout/Roberta-fa-en-ner")
	tokenizer = AutoTokenizer.from_pretrained("AliFartout/Roberta-fa-en-ner")
	model = AutoModel.from_pretrained("AliFartout/Roberta-fa-en-ner")
	```

	By following this approach, you can seamlessly access and incorporate the trained multilingual NER model into various Natural Language Processing tasks.