Update README.md

7aa272b verified about 1 month ago

5.62 kB

	---
	license: apache-2.0
	language:
	- ne
	base_model: NepBERTa/NepBERTa
	tags:
	- token-classification
	- ner
	- nepali
	datasets:
	- custom
	metrics:
	- f1
	- precision
	- recall
	---

	# Model Card for Finetuned NepBertA-NER

	This model is a fine-tuned version of the NepBERTa model, specifically trained for Named Entity Recognition (NER) tasks in the Nepali language. It recognizes entities such as persons (PER), organizations (ORG), and locations (LOC) in Nepali text. The model has been trained on a custom dataset and supports token classification for the following entity tags:

	- `O` (Other)
	- `B-PER` (Beginning of a person’s name)
	- `I-PER` (Inside of a person’s name)
	- `B-ORG` (Beginning of an organization)
	- `I-ORG` (Inside of an organization)
	- `B-LOC` (Beginning of a location)
	- `I-LOC` (Inside of a location)

	## Model Details

	### Model Description

	- Developed by: Priyanshu Koirala (Synapse Technologies)
	- Model type: Token Classification (NER)
	- Language(s) (NLP): Nepali
	- License: Apache 2.0
	- Finetuned from model: NepBERTa


	## Uses

	### Direct Use
	The model can be directly used to recognize and classify named entities in Nepali text, such as persons, organizations, and locations. This is useful for text analysis tasks like extracting important information from Nepali documents, news articles, and customer feedback.

	### Downstream Use
	The model can be further fine-tuned on other similar datasets or integrated into applications for Nepali language processing.

	### Out-of-Scope Use
	The model may not perform well for texts outside the scope of its training data, such as texts with unseen entity types or non-Nepali language texts.

	## Bias, Risks, and Limitations

	As with any NER model, there may be biases in the data that influence how the model identifies and classifies entities. It may struggle with unseen entities, domain-specific jargon, or ambiguous contexts.

	### Recommendations
	Users should evaluate the model in their specific use case, ensuring that the data fed into the model aligns with the training data, and understand that the model might require further fine-tuning for specialized tasks.

	## How to Get Started with the Model

	Use the following code to start using the model:

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	# Load model and tokenizer
	model = AutoModelForTokenClassification.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
	tokenizer = AutoTokenizer.from_pretrained("SynapseHQ/Finetuned-NER-NepBertA")
	model.to(device)

	def predict_ner_chunked(text, model, tokenizer, device, max_length=512):
	model.eval()
	words = text.split()
	ner_results = []

	for i in range(0, len(words), max_length):
	chunk = ' '.join(words[i:i+max_length])
	tokens = tokenizer(chunk, return_tensors="pt", truncation=True, padding=True, max_length=max_length)
	tokens = {k: v.to(device) for k, v in tokens.items()}

	with torch.no_grad():
	outputs = model(**tokens)

	predictions = torch.argmax(outputs.logits, dim=2)
	predicted_labels = [model.config.id2label[p.item()] for p in predictions[0]]

	chunk_words = tokenizer.convert_ids_to_tokens(tokens["input_ids"][0])
	for word, label in zip(chunk_words, predicted_labels):
	if label in ["B-PER", "I-PER", "B-ORG"] and word not in ["[CLS]", "[SEP]", "[PAD]"]:
	ner_results.append((word, label))

	return ner_results

	# Test the model
	text = "सङ्घीय लोकतान्त्रिक गणतन्त्र नेपालको प्रधानमन्त्री शेरबहादुर देउवा हुन्।"
	ner_results = predict_ner_chunked(text, model, tokenizer, device)
	print(ner_results)
	```

	## Training Details
	# Training Data
	The model was trained on a custom-labeled dataset in Nepali, consisting of sentences annotated with named entities for People (PER), Organizations (ORG), and Locations (LOC).

	# Training Procedure
	- Optimizer: AdamW
	- Learning Rate: 5e-5
	- Batch Size: 16
	- Epochs: 5
	- Validation Split: 20% of the dataset was reserved for validation.
	- Hardware: Trained on a single GPU.

	# Training Hyperparameters
	- Number of labels: 7 (including O label)
	- Maximum sequence length: 128 tokens
	- Gradient accumulation: 1

	## Evaluation

	# Metrics

	The model was evaluated using the seqeval metric, with the following results on the validation set:

	- F1 Score: 0.89
	- Precision: 0.86
	- Recall: 0.90

	## Citation for the Base Model

	If you use this model or the base model in your work, please consider citing NepBERTa as follows:

	```bibtex
	@inproceedings{timilsina2022nepberta,
	title={NepBERTa: Nepali language model trained in a large corpus},
	author={Timilsina, Sulav and Gautam, Milan and Bhattarai, Binod},
	booktitle={Proceedings of the 2nd conference of the Asia-pacific chapter of the association for computational linguistics and the 12th international joint conference on natural language processing},
	year={2022},
	organization={Association for Computational Linguistics (ACL)}
	}
	```

	## Citation

	If you use this model in your research, please consider citing it:

	``` bibtex
	@misc{nepali_ner,
	author = {Synapse Technologies},
	title = {Finetuned NepBertA-NER for Nepali},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/SynapseHQ/Finetuned-NER-NepBertA}},
	}

	```