|
--- |
|
language: |
|
- az |
|
- tr |
|
thumbnail: "URL_to_thumbnail_image" |
|
tags: |
|
- NER |
|
- token-classification |
|
- Azerbaijani |
|
- Turkish |
|
- transformers |
|
license: "mit" |
|
datasets: |
|
- LocalDoc/azerbaijani-ner-dataset |
|
metrics: |
|
- precision |
|
- recall |
|
- f1 |
|
base_model: "akdeniz27/bert-base-turkish-cased-ner" |
|
pipeline_tag: "token-classification" |
|
--- |
|
|
|
# Azeri-Turkish-BERT-NER |
|
|
|
## Model Description |
|
|
|
The **Azeri-Turkish-BERT-NER** model is a fine-tuned version of the `bert-base-turkish-cased-ner` model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities. |
|
|
|
The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts. |
|
|
|
## Model Details |
|
|
|
- **Base Model**: `bert-base-turkish-cased-ner` (adapted from Hugging Face) |
|
- **Task**: Named Entity Recognition (NER) |
|
- **Languages**: Azerbaijani, Turkish |
|
- **Fine-Tuned On**: Custom Azerbaijani NER dataset |
|
- **Input Text Format**: Plain text with tokenized words |
|
- **Model Type**: BERT-based transformer for token classification |
|
|
|
## Training Details |
|
|
|
The model was fine-tuned using the Hugging Face `transformers` library and `datasets`. Here is a brief summary of the fine-tuning configuration: |
|
|
|
- **Tokenizer**: `AutoTokenizer` from the `bert-base-turkish-cased-ner` model |
|
- **Max Sequence Length**: 128 tokens |
|
- **Batch Size**: 128 (training and evaluation) |
|
- **Learning Rate**: 2e-5 |
|
- **Number of Epochs**: 10 |
|
- **Weight Decay**: 0.005 |
|
- **Optimization Strategy**: Early stopping with a patience of 5 epochs based on the F1 metric |
|
|
|
### Training Dataset |
|
|
|
The training dataset is a custom Azerbaijani NER dataset sourced from [LocalDoc/azerbaijani-ner-dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset). The dataset was preprocessed to align tokens and NER tags accurately. |
|
|
|
### Label Categories |
|
|
|
The model supports the following entity categories: |
|
- **Person (B-PERSON, I-PERSON)** |
|
- **Location (B-LOCATION, I-LOCATION)** |
|
- **Organization (B-ORGANISATION, I-ORGANISATION)** |
|
- **Date (B-DATE, I-DATE)** |
|
- **Time (B-TIME, I-TIME)** |
|
- **Money (B-MONEY, I-MONEY)** |
|
- **Percentage (B-PERCENTAGE, I-PERCENTAGE)** |
|
- **Facility (B-FACILITY, I-FACILITY)** |
|
- **Product (B-PRODUCT, I-PRODUCT)** |
|
- ... (additional categories as specified in the training label list) |
|
|
|
### Training Metrics |
|
|
|
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | |
|
|-------|---------------|-----------------|-----------|--------|-------| |
|
| 1 | 0.433100 | 0.306711 | 0.739000 | 0.693282 | 0.715412 | |
|
| 2 | 0.292700 | 0.275796 | 0.781565 | 0.688937 | 0.732334 | |
|
| 3 | 0.250600 | 0.275115 | 0.758261 | 0.709425 | 0.733031 | |
|
| 4 | 0.233700 | 0.273087 | 0.756184 | 0.716277 | 0.735689 | |
|
| 5 | 0.214800 | 0.278477 | 0.756051 | 0.710996 | 0.732832 | |
|
| 6 | 0.199200 | 0.286102 | 0.755068 | 0.717012 | 0.735548 | |
|
| 7 | 0.192800 | 0.297157 | 0.742326 | 0.725802 | 0.733971 | |
|
| 8 | 0.178900 | 0.304510 | 0.743206 | 0.723930 | 0.733442 | |
|
| 9 | 0.171700 | 0.313845 | 0.743145 | 0.725535 | 0.734234 | |
|
|
|
### Category-Wise Evaluation Metrics |
|
|
|
| Category | Precision | Recall | F1-Score | Support | |
|
|---------------|-----------|--------|----------|---------| |
|
| ART | 0.49 | 0.14 | 0.21 | 1988 | |
|
| DATE | 0.49 | 0.48 | 0.49 | 844 | |
|
| EVENT | 0.88 | 0.36 | 0.51 | 84 | |
|
| FACILITY | 0.72 | 0.68 | 0.70 | 1146 | |
|
| LAW | 0.57 | 0.64 | 0.60 | 1103 | |
|
| LOCATION | 0.77 | 0.79 | 0.78 | 8806 | |
|
| MONEY | 0.62 | 0.57 | 0.59 | 532 | |
|
| ORGANISATION | 0.64 | 0.65 | 0.64 | 527 | |
|
| PERCENTAGE | 0.77 | 0.83 | 0.80 | 3679 | |
|
| PERSON | 0.87 | 0.81 | 0.84 | 6924 | |
|
| PRODUCT | 0.82 | 0.80 | 0.81 | 2653 | |
|
| TIME | 0.55 | 0.50 | 0.52 | 1634 | |
|
|
|
- **Micro Average**: Precision: 0.76, Recall: 0.72, F1-Score: 0.74 |
|
- **Macro Average**: Precision: 0.68, Recall: 0.60, F1-Score: 0.62 |
|
- **Weighted Average**: Precision: 0.74, Recall: 0.72, F1-Score: 0.72 |
|
|
|
## Usage |
|
|
|
### Loading the Model |
|
|
|
To use the model for NER tasks, you can load it using the Hugging Face `transformers` library: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
|
# Load the model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER") |
|
model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER") |
|
|
|
# Initialize the NER pipeline |
|
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
|
# Example text |
|
text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat." |
|
|
|
# Run NER |
|
results = ner_pipeline(text) |
|
print(results) |
|
``` |
|
|
|
### Inputs and Outputs |
|
|
|
- **Input**: Plain text in Azerbaijani or Turkish. |
|
- **Output**: List of detected entities with entity types and character offsets. |
|
|
|
Example output: |
|
``` |
|
[ |
|
{'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98}, |
|
{'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95} |
|
] |
|
``` |
|
|
|
### Evaluation Metrics |
|
|
|
The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section. |
|
|
|
## Limitations |
|
|
|
- The model may have limited performance on texts that diverge significantly from the training data distribution. |
|
- Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores. |
|
- Further fine-tuning on larger and more diverse datasets may improve generalizability. |
|
|
|
## Model Card |
|
|
|
A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the [Hugging Face model page](https://huggingface.co/IsmatS/Azeri-Turkish-BERT-NER). |
|
|
|
## Citation |
|
|
|
If you use this model, please consider citing: |
|
``` |
|
@misc{azeri-turkish-bert-ner, |
|
author = {Ismat Samadov}, |
|
title = {Azeri-Turkish-BERT-NER}, |
|
year = {2024}, |
|
howpublished = {Hugging Face repository}, |
|
} |
|
``` |