--- language: - az - tr thumbnail: "URL_to_thumbnail_image" # Replace with an actual URL or remove this line if unavailable tags: - NER - token-classification - Azerbaijani - Turkish - transformers license: "mit" # Adjust to the correct license you wish to use datasets: - LocalDoc/azerbaijani-ner-dataset metrics: - precision - recall - f1 base_model: "akdeniz27/bert-base-turkish-cased-ner" pipeline_tag: "token-classification" --- # Azeri-Turkish-BERT-NER ## Model Description The **Azeri-Turkish-BERT-NER** model is a fine-tuned version of the `bert-base-turkish-cased-ner` model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities. The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts. ## Model Details - **Base Model**: `bert-base-turkish-cased-ner` (adapted from Hugging Face) - **Task**: Named Entity Recognition (NER) - **Languages**: Azerbaijani, Turkish - **Fine-Tuned On**: Custom Azerbaijani NER dataset - **Input Text Format**: Plain text with tokenized words - **Model Type**: BERT-based transformer for token classification ## Training Details The model was fine-tuned using the Hugging Face `transformers` library and `datasets`. Here is a brief summary of the fine-tuning configuration: - **Tokenizer**: `AutoTokenizer` from the `bert-base-turkish-cased-ner` model - **Max Sequence Length**: 128 tokens - **Batch Size**: 128 (training and evaluation) - **Learning Rate**: 2e-5 - **Number of Epochs**: 10 - **Weight Decay**: 0.005 - **Optimization Strategy**: Early stopping with a patience of 5 epochs based on the F1 metric ### Training Dataset The training dataset is a custom Azerbaijani NER dataset sourced from [LocalDoc/azerbaijani-ner-dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset). The dataset was preprocessed to align tokens and NER tags accurately. ### Label Categories The model supports the following entity categories: - **Person (B-PERSON, I-PERSON)** - **Location (B-LOCATION, I-LOCATION)** - **Organization (B-ORGANISATION, I-ORGANISATION)** - **Date (B-DATE, I-DATE)** - **Time (B-TIME, I-TIME)** - **Money (B-MONEY, I-MONEY)** - **Percentage (B-PERCENTAGE, I-PERCENTAGE)** - **Facility (B-FACILITY, I-FACILITY)** - **Product (B-PRODUCT, I-PRODUCT)** - ... (additional categories as specified in the training label list) ### Training Metrics | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | |-------|---------------|-----------------|-----------|--------|-------| | 1 | 0.433100 | 0.306711 | 0.739000 | 0.693282 | 0.715412 | | 2 | 0.292700 | 0.275796 | 0.781565 | 0.688937 | 0.732334 | | 3 | 0.250600 | 0.275115 | 0.758261 | 0.709425 | 0.733031 | | 4 | 0.233700 | 0.273087 | 0.756184 | 0.716277 | 0.735689 | | 5 | 0.214800 | 0.278477 | 0.756051 | 0.710996 | 0.732832 | | 6 | 0.199200 | 0.286102 | 0.755068 | 0.717012 | 0.735548 | | 7 | 0.192800 | 0.297157 | 0.742326 | 0.725802 | 0.733971 | | 8 | 0.178900 | 0.304510 | 0.743206 | 0.723930 | 0.733442 | | 9 | 0.171700 | 0.313845 | 0.743145 | 0.725535 | 0.734234 | ### Category-Wise Evaluation Metrics | Category | Precision | Recall | F1-Score | Support | |---------------|-----------|--------|----------|---------| | ART | 0.49 | 0.14 | 0.21 | 1988 | | DATE | 0.49 | 0.48 | 0.49 | 844 | | EVENT | 0.88 | 0.36 | 0.51 | 84 | | FACILITY | 0.72 | 0.68 | 0.70 | 1146 | | LAW | 0.57 | 0.64 | 0.60 | 1103 | | LOCATION | 0.77 | 0.79 | 0.78 | 8806 | | MONEY | 0.62 | 0.57 | 0.59 | 532 | | ORGANISATION | 0.64 | 0.65 | 0.64 | 527 | | PERCENTAGE | 0.77 | 0.83 | 0.80 | 3679 | | PERSON | 0.87 | 0.81 | 0.84 | 6924 | | PRODUCT | 0.82 | 0.80 | 0.81 | 2653 | | TIME | 0.55 | 0.50 | 0.52 | 1634 | - **Micro Average**: Precision: 0.76, Recall: 0.72, F1-Score: 0.74 - **Macro Average**: Precision: 0.68, Recall: 0.60, F1-Score: 0.62 - **Weighted Average**: Precision: 0.74, Recall: 0.72, F1-Score: 0.72 ## Usage ### Loading the Model To use the model for NER tasks, you can load it using the Hugging Face `transformers` library: ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline # Load the model and tokenizer tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER") model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER") # Initialize the NER pipeline ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # Example text text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat." # Run NER results = ner_pipeline(text) print(results) ``` ### Inputs and Outputs - **Input**: Plain text in Azerbaijani or Turkish. - **Output**: List of detected entities with entity types and character offsets. Example output: ``` [ {'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98}, {'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95} ] ``` ### Evaluation Metrics The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section. ## Limitations - The model may have limited performance on texts that diverge significantly from the training data distribution. - Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores. - Further fine-tuning on larger and more diverse datasets may improve generalizability. ## Model Card A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the [Hugging Face model page](https://huggingface.co/IsmatS/Azeri-Turkish-BERT-NER). ## Citation If you use this model, please consider citing: ``` @misc{azeri-turkish-bert-ner, author = {Ismat Samadov}, title = {Azeri-Turkish-BERT-NER}, year = {2024}, howpublished = {Hugging Face repository}, } ```