--- license: mit tags: - token-classification - ner - multilingual - tamil - hindi - panx datasets: - xtreme - pan-x language: - ta - hi model-index: - name: xlm-roberta-base-fintuned-panx-ta-hi results: - task: type: token-classification name: Named Entity Recognition dataset: name: PAN-X type: pan-x metrics: - type: f1 value: 0.8347 - type: loss value: 0.248 metrics: - f1 --- # xlm-roberta-base-fintuned-panx-ta-hi This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the PAN-X dataset for **Tamil (ta)** and **Hindi (hi)**. It is fine-tuned for Named Entity Recognition (NER) and achieves the following results on the evaluation set: - Loss: 0.2480 - F1: 0.8347 ## Model Description The model is based on XLM-RoBERTa, a multilingual transformer-based architecture, and fine-tuned for NER tasks in Tamil and Hindi. Entity type : LOC (Location), PER (Person), and ORG (Organization) B- prefix indicates beginning of an entity and I - prefix indicates consecutive entity ## Intended Uses & Limitations ### Intended Uses: - Named Entity Recognition (NER) tasks in Tamil and Hindi. ### Limitations: - Performance may degrade on languages or domains not included in the training data. - Not intended for general text classification or other NLP tasks. --- ## How to Use the Model You can load and use the model for Named Entity Recognition as follows: ### Installation Ensure you have the `transformers` and `torch` libraries installed. Install them via pip if necessary: ```bash pip install transformers torch ``` ### Code Example ```python from transformers import AutoTokenizer, AutoModelForTokenClassification from transformers import pipeline # Load the tokenizer and model model_name = "Lokeshwaran/xlm-roberta-base-fintuned-panx-ta-hi" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) # Create an NER pipeline ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # Example text in Tamil and Hindi example_texts = [ "அப்துல் கலாம் சென்னை நகரத்தில் ஐஎஸ்ஆர்ஓ நிறுவனத்துக்கு சென்றார்.", # Abdul Kalam went to the ISRO organization in Chennai city. "सचिन तेंदुलकर ने मुंबई में बीसीसीआई के कार्यालय का दौरा किया।", # Hindi: Sachin Tendulkar visited the BCCI office in Mumbai. "മഹാത്മാ ഗാന്ധി തിരുവനന്തപുരം നഗരത്തിലെ ഐഎസ്ആർഒ ഓഫീസ് സന്ദർശിച്ചു." # Malayalam: Mahatma Gandhi visited the ISRO office in Thiruvananthapuram city. ] # Perform Named Entity Recognition for text in example_texts: results = ner_pipeline(text) print(f"Input Text: {text}") for entity in results: print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}") print() ``` --- ## Training and Evaluation Data The model was fine-tuned on the **PAN-X** dataset, which is part of the XTREME benchmark, specifically for Tamil and Hindi. --- ## Training Procedure ### Hyperparameters - Learning Rate: `5e-05` - Batch Size: `24` (both training and evaluation) - Epochs: `3` - Optimizer: `AdamW` with `betas=(0.9, 0.999)` and `epsilon=1e-08` - Learning Rate Scheduler: `Linear` --- ## Results | Epoch | Training Loss | Validation Loss | F1 | |-------|---------------|-----------------|--------| | 1.0 | 0.1886 | 0.2413 | 0.8096 | | 2.0 | 0.1252 | 0.2415 | 0.8201 | | 3.0 | 0.0752 | 0.2480 | 0.8347 | --- ## Framework Versions - Transformers: 4.47.1 - PyTorch: 2.5.1+cu121 - Datasets: 3.2.0 - Tokenizers: 0.21.0