IsmatS's picture
Upload folder using huggingface_hub
e3bdc35 verified
---
language:
- az
- tr
thumbnail: "URL_to_thumbnail_image" # Replace with an actual URL or remove this line if unavailable
tags:
- NER
- token-classification
- Azerbaijani
- Turkish
- transformers
license: "mit" # Adjust to the correct license you wish to use
datasets:
- LocalDoc/azerbaijani-ner-dataset
metrics:
- precision
- recall
- f1
base_model: "akdeniz27/bert-base-turkish-cased-ner"
pipeline_tag: "token-classification"
---
# Azeri-Turkish-BERT-NER
## Model Description
The **Azeri-Turkish-BERT-NER** model is a fine-tuned version of the `bert-base-turkish-cased-ner` model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.
The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.
## Model Details
- **Base Model**: `bert-base-turkish-cased-ner` (adapted from Hugging Face)
- **Task**: Named Entity Recognition (NER)
- **Languages**: Azerbaijani, Turkish
- **Fine-Tuned On**: Custom Azerbaijani NER dataset
- **Input Text Format**: Plain text with tokenized words
- **Model Type**: BERT-based transformer for token classification
## Training Details
The model was fine-tuned using the Hugging Face `transformers` library and `datasets`. Here is a brief summary of the fine-tuning configuration:
- **Tokenizer**: `AutoTokenizer` from the `bert-base-turkish-cased-ner` model
- **Max Sequence Length**: 128 tokens
- **Batch Size**: 128 (training and evaluation)
- **Learning Rate**: 2e-5
- **Number of Epochs**: 10
- **Weight Decay**: 0.005
- **Optimization Strategy**: Early stopping with a patience of 5 epochs based on the F1 metric
### Training Dataset
The training dataset is a custom Azerbaijani NER dataset sourced from [LocalDoc/azerbaijani-ner-dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset). The dataset was preprocessed to align tokens and NER tags accurately.
### Label Categories
The model supports the following entity categories:
- **Person (B-PERSON, I-PERSON)**
- **Location (B-LOCATION, I-LOCATION)**
- **Organization (B-ORGANISATION, I-ORGANISATION)**
- **Date (B-DATE, I-DATE)**
- **Time (B-TIME, I-TIME)**
- **Money (B-MONEY, I-MONEY)**
- **Percentage (B-PERCENTAGE, I-PERCENTAGE)**
- **Facility (B-FACILITY, I-FACILITY)**
- **Product (B-PRODUCT, I-PRODUCT)**
- ... (additional categories as specified in the training label list)
### Training Metrics
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
|-------|---------------|-----------------|-----------|--------|-------|
| 1 | 0.433100 | 0.306711 | 0.739000 | 0.693282 | 0.715412 |
| 2 | 0.292700 | 0.275796 | 0.781565 | 0.688937 | 0.732334 |
| 3 | 0.250600 | 0.275115 | 0.758261 | 0.709425 | 0.733031 |
| 4 | 0.233700 | 0.273087 | 0.756184 | 0.716277 | 0.735689 |
| 5 | 0.214800 | 0.278477 | 0.756051 | 0.710996 | 0.732832 |
| 6 | 0.199200 | 0.286102 | 0.755068 | 0.717012 | 0.735548 |
| 7 | 0.192800 | 0.297157 | 0.742326 | 0.725802 | 0.733971 |
| 8 | 0.178900 | 0.304510 | 0.743206 | 0.723930 | 0.733442 |
| 9 | 0.171700 | 0.313845 | 0.743145 | 0.725535 | 0.734234 |
### Category-Wise Evaluation Metrics
| Category | Precision | Recall | F1-Score | Support |
|---------------|-----------|--------|----------|---------|
| ART | 0.49 | 0.14 | 0.21 | 1988 |
| DATE | 0.49 | 0.48 | 0.49 | 844 |
| EVENT | 0.88 | 0.36 | 0.51 | 84 |
| FACILITY | 0.72 | 0.68 | 0.70 | 1146 |
| LAW | 0.57 | 0.64 | 0.60 | 1103 |
| LOCATION | 0.77 | 0.79 | 0.78 | 8806 |
| MONEY | 0.62 | 0.57 | 0.59 | 532 |
| ORGANISATION | 0.64 | 0.65 | 0.64 | 527 |
| PERCENTAGE | 0.77 | 0.83 | 0.80 | 3679 |
| PERSON | 0.87 | 0.81 | 0.84 | 6924 |
| PRODUCT | 0.82 | 0.80 | 0.81 | 2653 |
| TIME | 0.55 | 0.50 | 0.52 | 1634 |
- **Micro Average**: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
- **Macro Average**: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
- **Weighted Average**: Precision: 0.74, Recall: 0.72, F1-Score: 0.72
## Usage
### Loading the Model
To use the model for NER tasks, you can load it using the Hugging Face `transformers` library:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
# Initialize the NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
# Example text
text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."
# Run NER
results = ner_pipeline(text)
print(results)
```
### Inputs and Outputs
- **Input**: Plain text in Azerbaijani or Turkish.
- **Output**: List of detected entities with entity types and character offsets.
Example output:
```
[
{'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
{'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
]
```
### Evaluation Metrics
The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.
## Limitations
- The model may have limited performance on texts that diverge significantly from the training data distribution.
- Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
- Further fine-tuning on larger and more diverse datasets may improve generalizability.
## Model Card
A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the [Hugging Face model page](https://huggingface.co/IsmatS/Azeri-Turkish-BERT-NER).
## Citation
If you use this model, please consider citing:
```
@misc{azeri-turkish-bert-ner,
author = {Ismat Samadov},
title = {Azeri-Turkish-BERT-NER},
year = {2024},
howpublished = {Hugging Face repository},
}
```