Priyansh-K's picture
Update README.md
7aa272b verified
---
license: apache-2.0
language:
- ne
base_model: NepBERTa/NepBERTa
tags:
- token-classification
- ner
- nepali
datasets:
- custom
metrics:
- f1
- precision
- recall
---
# Model Card for Finetuned NepBertA-NER
This model is a fine-tuned version of the **NepBERTa** model, specifically trained for Named Entity Recognition (NER) tasks in the Nepali language. It recognizes entities such as persons (PER), organizations (ORG), and locations (LOC) in Nepali text. The model has been trained on a custom dataset and supports token classification for the following entity tags:
- `O` (Other)
- `B-PER` (Beginning of a person’s name)
- `I-PER` (Inside of a person’s name)
- `B-ORG` (Beginning of an organization)
- `I-ORG` (Inside of an organization)
- `B-LOC` (Beginning of a location)
- `I-LOC` (Inside of a location)
## Model Details
### Model Description
- **Developed by:** Priyanshu Koirala (Synapse Technologies)
- **Model type:** Token Classification (NER)
- **Language(s) (NLP):** Nepali
- **License:** Apache 2.0
- **Finetuned from model:** NepBERTa
## Uses
### Direct Use
The model can be directly used to recognize and classify named entities in Nepali text, such as persons, organizations, and locations. This is useful for text analysis tasks like extracting important information from Nepali documents, news articles, and customer feedback.
### Downstream Use
The model can be further fine-tuned on other similar datasets or integrated into applications for Nepali language processing.
### Out-of-Scope Use
The model may not perform well for texts outside the scope of its training data, such as texts with unseen entity types or non-Nepali language texts.
## Bias, Risks, and Limitations
As with any NER model, there may be biases in the data that influence how the model identifies and classifies entities. It may struggle with unseen entities, domain-specific jargon, or ambiguous contexts.
### Recommendations
Users should evaluate the model in their specific use case, ensuring that the data fed into the model aligns with the training data, and understand that the model might require further fine-tuning for specialized tasks.
## How to Get Started with the Model
Use the following code to start using the model:
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
tokenizer = AutoTokenizer.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
model.to(device)
def predict_ner_chunked(text, model, tokenizer, device, max_length=512):
model.eval()
words = text.split()
ner_results = []
for i in range(0, len(words), max_length):
chunk = ' '.join(words[i:i+max_length])
tokens = tokenizer(chunk, return_tensors="pt", truncation=True, padding=True, max_length=max_length)
tokens = {k: v.to(device) for k, v in tokens.items()}
with torch.no_grad():
outputs = model(**tokens)
predictions = torch.argmax(outputs.logits, dim=2)
predicted_labels = [model.config.id2label[p.item()] for p in predictions[0]]
chunk_words = tokenizer.convert_ids_to_tokens(tokens["input_ids"][0])
for word, label in zip(chunk_words, predicted_labels):
if label in ["B-PER", "I-PER", "B-ORG"] and word not in ["[CLS]", "[SEP]", "[PAD]"]:
ner_results.append((word, label))
return ner_results
# Test the model
text = "सङ्घीय लोकतान्त्रिक गणतन्त्र नेपालको प्रधानमन्त्री शेरबहादुर देउवा हुन्।"
ner_results = predict_ner_chunked(text, model, tokenizer, device)
print(ner_results)
```
## Training Details
# Training Data
The model was trained on a custom-labeled dataset in Nepali, consisting of sentences annotated with named entities for People (PER), Organizations (ORG), and Locations (LOC).
# Training Procedure
- **Optimizer:** AdamW
- **Learning Rate:** 5e-5
- **Batch Size:** 16
- **Epochs:** 5
- **Validation Split:** 20% of the dataset was reserved for validation.
- **Hardware:** Trained on a single GPU.
# Training Hyperparameters
- **Number of labels:** 7 (including O label)
- **Maximum sequence length:** 128 tokens
- **Gradient accumulation:** 1
## Evaluation
# Metrics
The model was evaluated using the seqeval metric, with the following results on the validation set:
- **F1 Score:** 0.89
- **Precision:** 0.86
- **Recall:** 0.90
## Citation for the Base Model
If you use this model or the base model in your work, please consider citing **NepBERTa** as follows:
```bibtex
@inproceedings{timilsina2022nepberta,
title={NepBERTa: Nepali language model trained in a large corpus},
author={Timilsina, Sulav and Gautam, Milan and Bhattarai, Binod},
booktitle={Proceedings of the 2nd conference of the Asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing},
year={2022},
organization={Association for Computational Linguistics (ACL)}
}
```
## Citation
If you use this model in your research, please consider citing it:
``` bibtex
@misc{nepali_ner,
author = {Synapse Technologies},
title = {Finetuned NepBertA-NER for Nepali},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/SynapseHQ/Finetuned-NER-NepBertA}},
}
```