Lokeshwaran's picture
Update README.md
5b3d31c verified
---
license: mit
tags:
- token-classification
- ner
- multilingual
- tamil
- hindi
- panx
datasets:
- xtreme
- pan-x
language:
- ta
- hi
model-index:
- name: xlm-roberta-base-fintuned-panx-ta-hi
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: PAN-X
type: pan-x
metrics:
- type: f1
value: 0.8347
- type: loss
value: 0.248
metrics:
- f1
---
# xlm-roberta-base-fintuned-panx-ta-hi
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the PAN-X dataset for **Tamil (ta)** and **Hindi (hi)**. It is fine-tuned for Named Entity Recognition (NER) and achieves the following results on the evaluation set:
- Loss: 0.2480
- F1: 0.8347
## Model Description
The model is based on XLM-RoBERTa, a multilingual transformer-based architecture, and fine-tuned for NER tasks in Tamil and Hindi.
Entity type : LOC (Location), PER (Person), and ORG (Organization)
B- prefix indicates beginning of an entity and I - prefix indicates consecutive entity
## Intended Uses & Limitations
### Intended Uses:
- Named Entity Recognition (NER) tasks in Tamil and Hindi.
### Limitations:
- Performance may degrade on languages or domains not included in the training data.
- Not intended for general text classification or other NLP tasks.
---
## How to Use the Model
You can load and use the model for Named Entity Recognition as follows:
### Installation
Ensure you have the `transformers` and `torch` libraries installed. Install them via pip if necessary:
```bash
pip install transformers torch
```
### Code Example
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load the tokenizer and model
model_name = "Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create an NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text in Tamil and Hindi
example_texts = [
"அப்துல் கலாம் சென்னை நகரத்தில் ஐஎஸ்ஆர்ஓ நிறுவனத்துக்கு சென்றார்.", # Abdul Kalam went to the ISRO organization in Chennai city.
"सचिन तेंदुलकर ने मुंबई में बीसीसीआई के कार्यालय का दौरा किया।", # Hindi: Sachin Tendulkar visited the BCCI office in Mumbai.
"മഹാത്മാ ഗാന്ധി തിരുവനന്തപുരം നഗരത്തിലെ ഐഎസ്ആർഒ ഓഫീസ് സന്ദർശിച്ചു." # Malayalam: Mahatma Gandhi visited the ISRO office in Thiruvananthapuram city.
]
# Perform Named Entity Recognition
for text in example_texts:
results = ner_pipeline(text)
print(f"Input Text: {text}")
for entity in results:
print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")
print()
```
---
## Training and Evaluation Data
The model was fine-tuned on the **PAN-X** dataset, which is part of the XTREME benchmark, specifically for Tamil and Hindi.
---
## Training Procedure
### Hyperparameters
- Learning Rate: `5e-05`
- Batch Size: `24` (both training and evaluation)
- Epochs: `3`
- Optimizer: `AdamW` with `betas=(0.9, 0.999)` and `epsilon=1e-08`
- Learning Rate Scheduler: `Linear`
---
## Results
| Epoch | Training Loss | Validation Loss | F1 |
|-------|---------------|-----------------|--------|
| 1.0 | 0.1886 | 0.2413 | 0.8096 |
| 2.0 | 0.1252 | 0.2415 | 0.8201 |
| 3.0 | 0.0752 | 0.2480 | 0.8347 |
---
## Framework Versions
- Transformers: 4.47.1
- PyTorch: 2.5.1+cu121
- Datasets: 3.2.0
- Tokenizers: 0.21.0