Update README.md

5b3d31c verified about 1 month ago

3.99 kB

	---
	license: mit
	tags:
	- token-classification
	- ner
	- multilingual
	- tamil
	- hindi
	- panx
	datasets:
	- xtreme
	- pan-x
	language:
	- ta
	- hi
	model-index:
	- name: xlm-roberta-base-fintuned-panx-ta-hi
	results:
	- task:
	type: token-classification
	name: Named Entity Recognition
	dataset:
	name: PAN-X
	type: pan-x
	metrics:
	- type: f1
	value: 0.8347
	- type: loss
	value: 0.248
	metrics:
	- f1
	---

	# xlm-roberta-base-fintuned-panx-ta-hi

	This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the PAN-X dataset for Tamil (ta) and Hindi (hi). It is fine-tuned for Named Entity Recognition (NER) and achieves the following results on the evaluation set:
	- Loss: 0.2480
	- F1: 0.8347

	## Model Description

	The model is based on XLM-RoBERTa, a multilingual transformer-based architecture, and fine-tuned for NER tasks in Tamil and Hindi.
	Entity type : LOC (Location), PER (Person), and ORG (Organization)

	B- prefix indicates beginning of an entity and I - prefix indicates consecutive entity

	## Intended Uses & Limitations

	### Intended Uses:
	- Named Entity Recognition (NER) tasks in Tamil and Hindi.

	### Limitations:
	- Performance may degrade on languages or domains not included in the training data.
	- Not intended for general text classification or other NLP tasks.

	---

	## How to Use the Model

	You can load and use the model for Named Entity Recognition as follows:

	### Installation
	Ensure you have the `transformers` and `torch` libraries installed. Install them via pip if necessary:

	```bash
	pip install transformers torch
	```

	### Code Example

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	# Load the tokenizer and model
	model_name = "Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Create an NER pipeline
	ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

	# Example text in Tamil and Hindi
	example_texts = [
	"அப்துல் கலாம் சென்னை நகரத்தில் ஐஎஸ்ஆர்ஓ நிறுவனத்துக்கு சென்றார்.", # Abdul Kalam went to the ISRO organization in Chennai city.
	"सचिन तेंदुलकर ने मुंबई में बीसीसीआई के कार्यालय का दौरा किया।", # Hindi: Sachin Tendulkar visited the BCCI office in Mumbai.
	"മഹാത്മാ ഗാന്ധി തിരുവനന്തപുരം നഗരത്തിലെ ഐഎസ്ആർഒ ഓഫീസ് സന്ദർശിച്ചു." # Malayalam: Mahatma Gandhi visited the ISRO office in Thiruvananthapuram city.
	]

	# Perform Named Entity Recognition
	for text in example_texts:
	results = ner_pipeline(text)
	print(f"Input Text: {text}")
	for entity in results:
	print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")
	print()
	```

	---

	## Training and Evaluation Data

	The model was fine-tuned on the PAN-X dataset, which is part of the XTREME benchmark, specifically for Tamil and Hindi.

	---

	## Training Procedure

	### Hyperparameters
	- Learning Rate: `5e-05`
	- Batch Size: `24` (both training and evaluation)
	- Epochs: `3`
	- Optimizer: `AdamW` with `betas=(0.9, 0.999)` and `epsilon=1e-08`
	- Learning Rate Scheduler: `Linear`

	---

	## Results

	\| Epoch \| Training Loss \| Validation Loss \| F1 \|
	\|-------\|---------------\|-----------------\|--------\|
	\| 1.0 \| 0.1886 \| 0.2413 \| 0.8096 \|
	\| 2.0 \| 0.1252 \| 0.2415 \| 0.8201 \|
	\| 3.0 \| 0.0752 \| 0.2480 \| 0.8347 \|

	---

	## Framework Versions

	- Transformers: 4.47.1
	- PyTorch: 2.5.1+cu121
	- Datasets: 3.2.0
	- Tokenizers: 0.21.0