File size: 3,989 Bytes
e5e61d3 6121440 e5e61d3 6121440 e5e61d3 6121440 e5e61d3 6121440 e5e61d3 e16bf7e 482bcfa e16bf7e 482bcfa 8602486 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa bb3dac0 5b3d31c 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa e16bf7e 482bcfa 6121440 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
license: mit
tags:
- token-classification
- ner
- multilingual
- tamil
- hindi
- panx
datasets:
- xtreme
- pan-x
language:
- ta
- hi
model-index:
- name: xlm-roberta-base-fintuned-panx-ta-hi
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: PAN-X
type: pan-x
metrics:
- type: f1
value: 0.8347
- type: loss
value: 0.248
metrics:
- f1
---
# xlm-roberta-base-fintuned-panx-ta-hi
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the PAN-X dataset for **Tamil (ta)** and **Hindi (hi)**. It is fine-tuned for Named Entity Recognition (NER) and achieves the following results on the evaluation set:
- Loss: 0.2480
- F1: 0.8347
## Model Description
The model is based on XLM-RoBERTa, a multilingual transformer-based architecture, and fine-tuned for NER tasks in Tamil and Hindi.
Entity type : LOC (Location), PER (Person), and ORG (Organization)
B- prefix indicates beginning of an entity and I - prefix indicates consecutive entity
## Intended Uses & Limitations
### Intended Uses:
- Named Entity Recognition (NER) tasks in Tamil and Hindi.
### Limitations:
- Performance may degrade on languages or domains not included in the training data.
- Not intended for general text classification or other NLP tasks.
---
## How to Use the Model
You can load and use the model for Named Entity Recognition as follows:
### Installation
Ensure you have the `transformers` and `torch` libraries installed. Install them via pip if necessary:
```bash
pip install transformers torch
```
### Code Example
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load the tokenizer and model
model_name = "Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create an NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text in Tamil and Hindi
example_texts = [
"அப்துல் கலாம் சென்னை நகரத்தில் ஐஎஸ்ஆர்ஓ நிறுவனத்துக்கு சென்றார்.", # Abdul Kalam went to the ISRO organization in Chennai city.
"सचिन तेंदुलकर ने मुंबई में बीसीसीआई के कार्यालय का दौरा किया।", # Hindi: Sachin Tendulkar visited the BCCI office in Mumbai.
"മഹാത്മാ ഗാന്ധി തിരുവനന്തപുരം നഗരത്തിലെ ഐഎസ്ആർഒ ഓഫീസ് സന്ദർശിച്ചു." # Malayalam: Mahatma Gandhi visited the ISRO office in Thiruvananthapuram city.
]
# Perform Named Entity Recognition
for text in example_texts:
results = ner_pipeline(text)
print(f"Input Text: {text}")
for entity in results:
print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")
print()
```
---
## Training and Evaluation Data
The model was fine-tuned on the **PAN-X** dataset, which is part of the XTREME benchmark, specifically for Tamil and Hindi.
---
## Training Procedure
### Hyperparameters
- Learning Rate: `5e-05`
- Batch Size: `24` (both training and evaluation)
- Epochs: `3`
- Optimizer: `AdamW` with `betas=(0.9, 0.999)` and `epsilon=1e-08`
- Learning Rate Scheduler: `Linear`
---
## Results
| Epoch | Training Loss | Validation Loss | F1 |
|-------|---------------|-----------------|--------|
| 1.0 | 0.1886 | 0.2413 | 0.8096 |
| 2.0 | 0.1252 | 0.2415 | 0.8201 |
| 3.0 | 0.0752 | 0.2480 | 0.8347 |
---
## Framework Versions
- Transformers: 4.47.1
- PyTorch: 2.5.1+cu121
- Datasets: 3.2.0
- Tokenizers: 0.21.0 |