XLM-RoBERTa Azerbaijani NER Model

This model is a fine-tuned version of XLM-RoBERTa for Named Entity Recognition (NER) in the Azerbaijani language. It recognizes several entity types commonly used in Azerbaijani text, providing high accuracy on tasks requiring entity extraction, such as personal names, locations, organizations, and dates.

Model Details

Base Model: xlm-roberta-base
Fine-tuned on: Azerbaijani Named Entity Recognition Dataset
Task: Named Entity Recognition (NER)
Language: Azerbaijani (az)
Dataset: Custom Azerbaijani NER dataset with entity tags such as PERSON, LOCATION, ORGANISATION, DATE, etc.

Data Source

The model was trained on the Azerbaijani NER Dataset, which provides annotated data with 25 distinct entity types specifically for the Azerbaijani language. This dataset is an invaluable resource for improving NLP tasks in Azerbaijani, including entity recognition and language understanding.

Entity Types

The model recognizes the following entities:

PERSON: Names of people
LOCATION: Geographical locations
ORGANISATION: Companies, institutions
DATE: Dates and periods
MONEY: Monetary values
TIME: Time expressions
GPE: Countries, cities, states
FACILITY: Buildings, landmarks, etc.
EVENT: Events and occurrences
...and more

For the full list of entities, please refer to the dataset description.

Performance Metrics

Epoch-wise Performance

Epoch	Training Loss	Validation Loss	Precision	Recall	F1
1	0.323100	0.275503	0.775799	0.694886	0.733117
2	0.272500	0.262481	0.739266	0.739900	0.739583
3	0.248600	0.252498	0.751478	0.741152	0.746280
4	0.236800	0.249968	0.754882	0.741449	0.748105
5	0.223800	0.252187	0.764390	0.740460	0.752235
6	0.218600	0.249887	0.756352	0.741646	0.748927
7	0.209700	0.250748	0.760696	0.739438	0.749916

Detailed Classification Report (Epoch 7)

This table summarizes the precision, recall, and F1-score for each entity type, calculated on the validation dataset.

Entity Type	Precision	Recall	F1-Score	Support
ART	0.54	0.20	0.29	1857
DATE	0.52	0.47	0.50	880
EVENT	0.69	0.35	0.47	96
FACILITY	0.69	0.69	0.69	1170
LAW	0.60	0.61	0.60	1122
LOCATION	0.77	0.82	0.80	9132
MONEY	0.61	0.57	0.59	540
ORGANISATION	0.69	0.68	0.69	544
PERCENTAGE	0.79	0.82	0.81	3591
PERSON	0.87	0.83	0.85	7037
PRODUCT	0.83	0.85	0.84	2808
TIME	0.55	0.51	0.53	1569

Overall Metrics:

Micro Average: Precision = 0.76, Recall = 0.74, F1-Score = 0.75
Macro Average: Precision = 0.68, Recall = 0.62, F1-Score = 0.64
Weighted Average: Precision = 0.75, Recall = 0.74, F1-Score = 0.74

Usage

You can use this model with the Hugging Face transformers library to perform NER on Azerbaijani text. Here’s an example:

Installation

Make sure you have the transformers library installed:

pip install transformers

Inference Example

Load the model and tokenizer, then run the NER pipeline on Azerbaijani text:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load the model and tokenizer
model_name = "IsmatS/xlm-roberta-az-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Set up the NER pipeline
nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example sentence
sentence = "Bakı şəhərində Azərbaycan Respublikasının prezidenti İlham Əliyev."
entities = nlp_ner(sentence)

# Display entities
for entity in entities:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']}")

Sample Output

[
    {
        "entity_group": "PERSON",
        "score": 0.99,
        "word": "İlham Əliyev",
        "start": 34,
        "end": 46
    },
    {
        "entity_group": "LOCATION",
        "score": 0.98,
        "word": "Bakı",
        "start": 0,
        "end": 4
    }
]

Training Details

Training Data: This model was fine-tuned on the Azerbaijani NER Dataset with 25 entity types.
Training Framework: Hugging Face transformers
Optimizer: AdamW
Epochs: 8
Batch Size: 64
Evaluation Metric: F1-score

Limitations

The model is trained specifically for the Azerbaijani language and may not generalize well to other languages.
Certain rare entities may be misclassified due to limited training data in those categories.

Citation

If you use this model in your research or application, please consider citing:

@model{ismats_az_ner_2024,
  title={XLM-RoBERTa Azerbaijani NER Model},
  author={Ismat Samadov},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/IsmatS/xlm-roberta-az-ner}
}

License

This model is available under the MIT License.

IsmatS
/

xlm-roberta-az-ner