File size: 3,989 Bytes

e5e61d3
 
 
6121440
 
 
 
 
 
e5e61d3
6121440
 
e5e61d3
6121440
 
e5e61d3
6121440
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5e61d3
e16bf7e
 
 
482bcfa
e16bf7e
 
 
482bcfa
 
 
8602486
 
 
482bcfa
 
 
 
 
e16bf7e
482bcfa
 
 
e16bf7e
482bcfa
 
 
 
 
e16bf7e
482bcfa
 
e16bf7e
482bcfa
 
 
e16bf7e
482bcfa
e16bf7e
482bcfa
 
 
e16bf7e
482bcfa
 
 
 
e16bf7e
482bcfa
 
e16bf7e
482bcfa
 
bb3dac0
 
5b3d31c
482bcfa
 
 
 
 
 
 
 
 
 
 
 
e16bf7e
482bcfa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e16bf7e
482bcfa
e16bf7e
482bcfa
 
 
6121440

---
license: mit
tags:
- token-classification
- ner
- multilingual
- tamil
- hindi
- panx
datasets:
- xtreme
- pan-x
language:
- ta
- hi
model-index:
- name: xlm-roberta-base-fintuned-panx-ta-hi
  results:
  - task:
      type: token-classification
      name: Named Entity Recognition
    dataset:
      name: PAN-X
      type: pan-x
    metrics:
    - type: f1
      value: 0.8347
    - type: loss
      value: 0.248
metrics:
- f1
---

# xlm-roberta-base-fintuned-panx-ta-hi

This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the PAN-X dataset for **Tamil (ta)** and **Hindi (hi)**. It is fine-tuned for Named Entity Recognition (NER) and achieves the following results on the evaluation set:
- Loss: 0.2480
- F1: 0.8347

## Model Description

The model is based on XLM-RoBERTa, a multilingual transformer-based architecture, and fine-tuned for NER tasks in Tamil and Hindi.
Entity type : LOC (Location), PER (Person), and ORG (Organization)

B- prefix indicates beginning of an entity and I - prefix indicates consecutive entity

## Intended Uses & Limitations

### Intended Uses:
- Named Entity Recognition (NER) tasks in Tamil and Hindi.

### Limitations:
- Performance may degrade on languages or domains not included in the training data.
- Not intended for general text classification or other NLP tasks.

---

## How to Use the Model

You can load and use the model for Named Entity Recognition as follows:

### Installation
Ensure you have the `transformers` and `torch` libraries installed. Install them via pip if necessary:

```bash
pip install transformers torch
```

### Code Example

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load the tokenizer and model
model_name = "Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create an NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text in Tamil and Hindi
example_texts = [
    "அப்துல் கலாம் சென்னை நகரத்தில் ஐஎஸ்ஆர்ஓ நிறுவனத்துக்கு சென்றார்.",  # Abdul Kalam went to the ISRO organization in Chennai city.
    "सचिन तेंदुलकर ने मुंबई में बीसीसीआई के कार्यालय का दौरा किया।",  # Hindi: Sachin Tendulkar visited the BCCI office in Mumbai.
    "മഹാത്മാ ഗാന്ധി തിരുവനന്തപുരം നഗരത്തിലെ ഐഎസ്ആർഒ ഓഫീസ് സന്ദർശിച്ചു." # Malayalam: Mahatma Gandhi visited the ISRO office in Thiruvananthapuram city.
]

# Perform Named Entity Recognition
for text in example_texts:
    results = ner_pipeline(text)
    print(f"Input Text: {text}")
    for entity in results:
        print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")
    print()
```

---

## Training and Evaluation Data

The model was fine-tuned on the **PAN-X** dataset, which is part of the XTREME benchmark, specifically for Tamil and Hindi.

---

## Training Procedure

### Hyperparameters
- Learning Rate: `5e-05`
- Batch Size: `24` (both training and evaluation)
- Epochs: `3`
- Optimizer: `AdamW` with `betas=(0.9, 0.999)` and `epsilon=1e-08`
- Learning Rate Scheduler: `Linear`

---

## Results

| Epoch | Training Loss | Validation Loss | F1     |
|-------|---------------|-----------------|--------|
| 1.0   | 0.1886        | 0.2413          | 0.8096 |
| 2.0   | 0.1252        | 0.2415          | 0.8201 |
| 3.0   | 0.0752        | 0.2480          | 0.8347 |

---

## Framework Versions

- Transformers: 4.47.1
- PyTorch: 2.5.1+cu121
- Datasets: 3.2.0
- Tokenizers: 0.21.0