Upload folder using huggingface_hub

e3bdc35 verified 6 days ago

6.69 kB

	---
	language:
	- az
	- tr
	thumbnail: "URL_to_thumbnail_image" # Replace with an actual URL or remove this line if unavailable
	tags:
	- NER
	- token-classification
	- Azerbaijani
	- Turkish
	- transformers
	license: "mit" # Adjust to the correct license you wish to use
	datasets:
	- LocalDoc/azerbaijani-ner-dataset
	metrics:
	- precision
	- recall
	- f1
	base_model: "akdeniz27/bert-base-turkish-cased-ner"
	pipeline_tag: "token-classification"
	---

	# Azeri-Turkish-BERT-NER

	## Model Description

	The Azeri-Turkish-BERT-NER model is a fine-tuned version of the `bert-base-turkish-cased-ner` model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.

	The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.

	## Model Details

	- Base Model: `bert-base-turkish-cased-ner` (adapted from Hugging Face)
	- Task: Named Entity Recognition (NER)
	- Languages: Azerbaijani, Turkish
	- Fine-Tuned On: Custom Azerbaijani NER dataset
	- Input Text Format: Plain text with tokenized words
	- Model Type: BERT-based transformer for token classification

	## Training Details

	The model was fine-tuned using the Hugging Face `transformers` library and `datasets`. Here is a brief summary of the fine-tuning configuration:

	- Tokenizer: `AutoTokenizer` from the `bert-base-turkish-cased-ner` model
	- Max Sequence Length: 128 tokens
	- Batch Size: 128 (training and evaluation)
	- Learning Rate: 2e-5
	- Number of Epochs: 10
	- Weight Decay: 0.005
	- Optimization Strategy: Early stopping with a patience of 5 epochs based on the F1 metric

	### Training Dataset

	The training dataset is a custom Azerbaijani NER dataset sourced from [LocalDoc/azerbaijani-ner-dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset). The dataset was preprocessed to align tokens and NER tags accurately.

	### Label Categories

	The model supports the following entity categories:
	- Person (B-PERSON, I-PERSON)
	- Location (B-LOCATION, I-LOCATION)
	- Organization (B-ORGANISATION, I-ORGANISATION)
	- Date (B-DATE, I-DATE)
	- Time (B-TIME, I-TIME)
	- Money (B-MONEY, I-MONEY)
	- Percentage (B-PERCENTAGE, I-PERCENTAGE)
	- Facility (B-FACILITY, I-FACILITY)
	- Product (B-PRODUCT, I-PRODUCT)
	- ... (additional categories as specified in the training label list)

	### Training Metrics

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \|
	\|-------\|---------------\|-----------------\|-----------\|--------\|-------\|
	\| 1 \| 0.433100 \| 0.306711 \| 0.739000 \| 0.693282 \| 0.715412 \|
	\| 2 \| 0.292700 \| 0.275796 \| 0.781565 \| 0.688937 \| 0.732334 \|
	\| 3 \| 0.250600 \| 0.275115 \| 0.758261 \| 0.709425 \| 0.733031 \|
	\| 4 \| 0.233700 \| 0.273087 \| 0.756184 \| 0.716277 \| 0.735689 \|
	\| 5 \| 0.214800 \| 0.278477 \| 0.756051 \| 0.710996 \| 0.732832 \|
	\| 6 \| 0.199200 \| 0.286102 \| 0.755068 \| 0.717012 \| 0.735548 \|
	\| 7 \| 0.192800 \| 0.297157 \| 0.742326 \| 0.725802 \| 0.733971 \|
	\| 8 \| 0.178900 \| 0.304510 \| 0.743206 \| 0.723930 \| 0.733442 \|
	\| 9 \| 0.171700 \| 0.313845 \| 0.743145 \| 0.725535 \| 0.734234 \|

	### Category-Wise Evaluation Metrics

	\| Category \| Precision \| Recall \| F1-Score \| Support \|
	\|---------------\|-----------\|--------\|----------\|---------\|
	\| ART \| 0.49 \| 0.14 \| 0.21 \| 1988 \|
	\| DATE \| 0.49 \| 0.48 \| 0.49 \| 844 \|
	\| EVENT \| 0.88 \| 0.36 \| 0.51 \| 84 \|
	\| FACILITY \| 0.72 \| 0.68 \| 0.70 \| 1146 \|
	\| LAW \| 0.57 \| 0.64 \| 0.60 \| 1103 \|
	\| LOCATION \| 0.77 \| 0.79 \| 0.78 \| 8806 \|
	\| MONEY \| 0.62 \| 0.57 \| 0.59 \| 532 \|
	\| ORGANISATION \| 0.64 \| 0.65 \| 0.64 \| 527 \|
	\| PERCENTAGE \| 0.77 \| 0.83 \| 0.80 \| 3679 \|
	\| PERSON \| 0.87 \| 0.81 \| 0.84 \| 6924 \|
	\| PRODUCT \| 0.82 \| 0.80 \| 0.81 \| 2653 \|
	\| TIME \| 0.55 \| 0.50 \| 0.52 \| 1634 \|

	- Micro Average: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
	- Macro Average: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
	- Weighted Average: Precision: 0.74, Recall: 0.72, F1-Score: 0.72

	## Usage

	### Loading the Model

	To use the model for NER tasks, you can load it using the Hugging Face `transformers` library:

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

	# Load the model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
	model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")

	# Initialize the NER pipeline
	ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

	# Example text
	text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."

	# Run NER
	results = ner_pipeline(text)
	print(results)
	```

	### Inputs and Outputs

	- Input: Plain text in Azerbaijani or Turkish.
	- Output: List of detected entities with entity types and character offsets.

	Example output:
	```
	[
	{'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
	{'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
	]
	```

	### Evaluation Metrics

	The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.

	## Limitations

	- The model may have limited performance on texts that diverge significantly from the training data distribution.
	- Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
	- Further fine-tuning on larger and more diverse datasets may improve generalizability.

	## Model Card

	A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the [Hugging Face model page](https://huggingface.co/IsmatS/Azeri-Turkish-BERT-NER).

	## Citation

	If you use this model, please consider citing:
	```
	@misc{azeri-turkish-bert-ner,
	author = {Ismat Samadov},
	title = {Azeri-Turkish-BERT-NER},
	year = {2024},
	howpublished = {Hugging Face repository},
	}
	```