language:
- az
- tr
thumbnail: URL_to_thumbnail_image
tags:
- NER
- token-classification
- Azerbaijani
- Turkish
- transformers
license: mit
datasets:
- LocalDoc/azerbaijani-ner-dataset
metrics:
- precision
- recall
- f1
base_model: akdeniz27/bert-base-turkish-cased-ner
pipeline_tag: token-classification
Azeri-Turkish-BERT-NER
Model Description
The Azeri-Turkish-BERT-NER model is a fine-tuned version of the bert-base-turkish-cased-ner
model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.
The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.
Model Details
- Base Model:
bert-base-turkish-cased-ner
(adapted from Hugging Face) - Task: Named Entity Recognition (NER)
- Languages: Azerbaijani, Turkish
- Fine-Tuned On: Custom Azerbaijani NER dataset
- Input Text Format: Plain text with tokenized words
- Model Type: BERT-based transformer for token classification
Training Details
The model was fine-tuned using the Hugging Face transformers
library and datasets
. Here is a brief summary of the fine-tuning configuration:
- Tokenizer:
AutoTokenizer
from thebert-base-turkish-cased-ner
model - Max Sequence Length: 128 tokens
- Batch Size: 128 (training and evaluation)
- Learning Rate: 2e-5
- Number of Epochs: 10
- Weight Decay: 0.005
- Optimization Strategy: Early stopping with a patience of 5 epochs based on the F1 metric
Training Dataset
The training dataset is a custom Azerbaijani NER dataset sourced from LocalDoc/azerbaijani-ner-dataset. The dataset was preprocessed to align tokens and NER tags accurately.
Label Categories
The model supports the following entity categories:
- Person (B-PERSON, I-PERSON)
- Location (B-LOCATION, I-LOCATION)
- Organization (B-ORGANISATION, I-ORGANISATION)
- Date (B-DATE, I-DATE)
- Time (B-TIME, I-TIME)
- Money (B-MONEY, I-MONEY)
- Percentage (B-PERCENTAGE, I-PERCENTAGE)
- Facility (B-FACILITY, I-FACILITY)
- Product (B-PRODUCT, I-PRODUCT)
- ... (additional categories as specified in the training label list)
Training Metrics
Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
---|---|---|---|---|---|
1 | 0.433100 | 0.306711 | 0.739000 | 0.693282 | 0.715412 |
2 | 0.292700 | 0.275796 | 0.781565 | 0.688937 | 0.732334 |
3 | 0.250600 | 0.275115 | 0.758261 | 0.709425 | 0.733031 |
4 | 0.233700 | 0.273087 | 0.756184 | 0.716277 | 0.735689 |
5 | 0.214800 | 0.278477 | 0.756051 | 0.710996 | 0.732832 |
6 | 0.199200 | 0.286102 | 0.755068 | 0.717012 | 0.735548 |
7 | 0.192800 | 0.297157 | 0.742326 | 0.725802 | 0.733971 |
8 | 0.178900 | 0.304510 | 0.743206 | 0.723930 | 0.733442 |
9 | 0.171700 | 0.313845 | 0.743145 | 0.725535 | 0.734234 |
Category-Wise Evaluation Metrics
Category | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
ART | 0.49 | 0.14 | 0.21 | 1988 |
DATE | 0.49 | 0.48 | 0.49 | 844 |
EVENT | 0.88 | 0.36 | 0.51 | 84 |
FACILITY | 0.72 | 0.68 | 0.70 | 1146 |
LAW | 0.57 | 0.64 | 0.60 | 1103 |
LOCATION | 0.77 | 0.79 | 0.78 | 8806 |
MONEY | 0.62 | 0.57 | 0.59 | 532 |
ORGANISATION | 0.64 | 0.65 | 0.64 | 527 |
PERCENTAGE | 0.77 | 0.83 | 0.80 | 3679 |
PERSON | 0.87 | 0.81 | 0.84 | 6924 |
PRODUCT | 0.82 | 0.80 | 0.81 | 2653 |
TIME | 0.55 | 0.50 | 0.52 | 1634 |
- Micro Average: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
- Macro Average: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
- Weighted Average: Precision: 0.74, Recall: 0.72, F1-Score: 0.72
Usage
Loading the Model
To use the model for NER tasks, you can load it using the Hugging Face transformers
library:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
# Initialize the NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text
text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."
# Run NER
results = ner_pipeline(text)
print(results)
Inputs and Outputs
- Input: Plain text in Azerbaijani or Turkish.
- Output: List of detected entities with entity types and character offsets.
Example output:
[
{'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
{'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
]
Evaluation Metrics
The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.
Limitations
- The model may have limited performance on texts that diverge significantly from the training data distribution.
- Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
- Further fine-tuning on larger and more diverse datasets may improve generalizability.
Model Card
A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the Hugging Face model page.
Citation
If you use this model, please consider citing:
@misc{azeri-turkish-bert-ner,
author = {Ismat Samadov},
title = {Azeri-Turkish-BERT-NER},
year = {2024},
howpublished = {Hugging Face repository},
}