---
license: mit
tags:
- token-classification
- ner
- multilingual
- tamil
- hindi
- panx
datasets:
- xtreme
- pan-x
language:
- ta
- hi
model-index:
- name: xlm-roberta-base-fintuned-panx-ta-hi
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: PAN-X
      type: pan-x
    metrics:
    - type: f1
      value: 0.8347
    - type: loss
      value: 0.248
metrics:
- f1
---

# xlm-roberta-base-fintuned-panx-ta-hi

This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the PAN-X dataset for **Tamil (ta)** and **Hindi (hi)**. It is fine-tuned for Named Entity Recognition (NER) and achieves the following results on the evaluation set:
- Loss: 0.2480
- F1: 0.8347

## Model Description

The model is based on XLM-RoBERTa, a multilingual transformer-based architecture, and fine-tuned for NER tasks in Tamil and Hindi.
Entity type : LOC (Location), PER (Person), and ORG (Organization)

B- prefix indicates beginning of an entity and I - prefix indicates consecutive entity

## Intended Uses & Limitations

### Intended Uses:
- Named Entity Recognition (NER) tasks in Tamil and Hindi.

### Limitations:
- Performance may degrade on languages or domains not included in the training data.
- Not intended for general text classification or other NLP tasks.

---

## How to Use the Model

You can load and use the model for Named Entity Recognition as follows:

### Installation
Ensure you have the `transformers` and `torch` libraries installed. Install them via pip if necessary:

```bash
pip install transformers torch
```

### Code Example

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load the tokenizer and model
model_name = "Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create an NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text in Tamil and Hindi
example_texts = [
    "அப்துல் கலாம் சென்னை நகரத்தில் ஐஎஸ்ஆர்ஓ நிறுவனத்துக்கு சென்றார்.",  # Abdul Kalam went to the ISRO organization in Chennai city.
    "सचिन तेंदुलकर ने मुंबई में बीसीसीआई के कार्यालय का दौरा किया।",  # Hindi: Sachin Tendulkar visited the BCCI office in Mumbai.
    "മഹാത്മാ ഗാന്ധി തിരുവനന്തപുരം നഗരത്തിലെ ഐഎസ്ആർഒ ഓഫീസ് സന്ദർശിച്ചു." # Malayalam: Mahatma Gandhi visited the ISRO office in Thiruvananthapuram city.
]

# Perform Named Entity Recognition
for text in example_texts:
    results = ner_pipeline(text)
    print(f"Input Text: {text}")
    for entity in results:
        print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")
    print()
```

---

## Training and Evaluation Data

The model was fine-tuned on the **PAN-X** dataset, which is part of the XTREME benchmark, specifically for Tamil and Hindi.

---

## Training Procedure

### Hyperparameters
- Learning Rate: `5e-05`
- Batch Size: `24` (both training and evaluation)
- Epochs: `3`
- Optimizer: `AdamW` with `betas=(0.9, 0.999)` and `epsilon=1e-08`
- Learning Rate Scheduler: `Linear`

---

## Results

| Epoch | Training Loss | Validation Loss | F1     |
|-------|---------------|-----------------|--------|
| 1.0   | 0.1886        | 0.2413          | 0.8096 |
| 2.0   | 0.1252        | 0.2415          | 0.8201 |
| 3.0   | 0.0752        | 0.2480          | 0.8347 |

---

## Framework Versions

- Transformers: 4.47.1
- PyTorch: 2.5.1+cu121
- Datasets: 3.2.0
- Tokenizers: 0.21.0