IsmatS's picture
Upload folder using huggingface_hub
e3bdc35 verified
metadata
language:
  - az
  - tr
thumbnail: URL_to_thumbnail_image
tags:
  - NER
  - token-classification
  - Azerbaijani
  - Turkish
  - transformers
license: mit
datasets:
  - LocalDoc/azerbaijani-ner-dataset
metrics:
  - precision
  - recall
  - f1
base_model: akdeniz27/bert-base-turkish-cased-ner
pipeline_tag: token-classification

Azeri-Turkish-BERT-NER

Model Description

The Azeri-Turkish-BERT-NER model is a fine-tuned version of the bert-base-turkish-cased-ner model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.

The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.

Model Details

  • Base Model: bert-base-turkish-cased-ner (adapted from Hugging Face)
  • Task: Named Entity Recognition (NER)
  • Languages: Azerbaijani, Turkish
  • Fine-Tuned On: Custom Azerbaijani NER dataset
  • Input Text Format: Plain text with tokenized words
  • Model Type: BERT-based transformer for token classification

Training Details

The model was fine-tuned using the Hugging Face transformers library and datasets. Here is a brief summary of the fine-tuning configuration:

  • Tokenizer: AutoTokenizer from the bert-base-turkish-cased-ner model
  • Max Sequence Length: 128 tokens
  • Batch Size: 128 (training and evaluation)
  • Learning Rate: 2e-5
  • Number of Epochs: 10
  • Weight Decay: 0.005
  • Optimization Strategy: Early stopping with a patience of 5 epochs based on the F1 metric

Training Dataset

The training dataset is a custom Azerbaijani NER dataset sourced from LocalDoc/azerbaijani-ner-dataset. The dataset was preprocessed to align tokens and NER tags accurately.

Label Categories

The model supports the following entity categories:

  • Person (B-PERSON, I-PERSON)
  • Location (B-LOCATION, I-LOCATION)
  • Organization (B-ORGANISATION, I-ORGANISATION)
  • Date (B-DATE, I-DATE)
  • Time (B-TIME, I-TIME)
  • Money (B-MONEY, I-MONEY)
  • Percentage (B-PERCENTAGE, I-PERCENTAGE)
  • Facility (B-FACILITY, I-FACILITY)
  • Product (B-PRODUCT, I-PRODUCT)
  • ... (additional categories as specified in the training label list)

Training Metrics

Epoch Training Loss Validation Loss Precision Recall F1
1 0.433100 0.306711 0.739000 0.693282 0.715412
2 0.292700 0.275796 0.781565 0.688937 0.732334
3 0.250600 0.275115 0.758261 0.709425 0.733031
4 0.233700 0.273087 0.756184 0.716277 0.735689
5 0.214800 0.278477 0.756051 0.710996 0.732832
6 0.199200 0.286102 0.755068 0.717012 0.735548
7 0.192800 0.297157 0.742326 0.725802 0.733971
8 0.178900 0.304510 0.743206 0.723930 0.733442
9 0.171700 0.313845 0.743145 0.725535 0.734234

Category-Wise Evaluation Metrics

Category Precision Recall F1-Score Support
ART 0.49 0.14 0.21 1988
DATE 0.49 0.48 0.49 844
EVENT 0.88 0.36 0.51 84
FACILITY 0.72 0.68 0.70 1146
LAW 0.57 0.64 0.60 1103
LOCATION 0.77 0.79 0.78 8806
MONEY 0.62 0.57 0.59 532
ORGANISATION 0.64 0.65 0.64 527
PERCENTAGE 0.77 0.83 0.80 3679
PERSON 0.87 0.81 0.84 6924
PRODUCT 0.82 0.80 0.81 2653
TIME 0.55 0.50 0.52 1634
  • Micro Average: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
  • Macro Average: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
  • Weighted Average: Precision: 0.74, Recall: 0.72, F1-Score: 0.72

Usage

Loading the Model

To use the model for NER tasks, you can load it using the Hugging Face transformers library:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")

# Initialize the NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example text
text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."

# Run NER
results = ner_pipeline(text)
print(results)

Inputs and Outputs

  • Input: Plain text in Azerbaijani or Turkish.
  • Output: List of detected entities with entity types and character offsets.

Example output:

[
  {'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
  {'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
]

Evaluation Metrics

The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.

Limitations

  • The model may have limited performance on texts that diverge significantly from the training data distribution.
  • Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
  • Further fine-tuning on larger and more diverse datasets may improve generalizability.

Model Card

A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the Hugging Face model page.

Citation

If you use this model, please consider citing:

@misc{azeri-turkish-bert-ner,
  author = {Ismat Samadov},
  title = {Azeri-Turkish-BERT-NER},
  year = {2024},
  howpublished = {Hugging Face repository},
}