Behpouyan-NER / README.md
Behpouyan's picture
Update README.md
1cf3ef9 verified
metadata
library_name: transformers
tags:
  - Persian
  - Named Entity Recognition
  - NER
  - Albert

Model Card for Behpoyan-NER

Behpoyan-NER is a fine-tuned Albert model for Named Entity Recognition (NER) in the Persian language. It is based on the HooshvareLab/albert-fa-zwnj-base-v2-ner model and identifies ten types of entities: Date (DAT), Event (EVE), Facility (FAC), Location (LOC), Money (MON), Organization (ORG), Percent (PCT), Person (PER), Product (PRO), and Time (TIM).

Model Details

Model Description

Behpoyan-NER is designed to recognize named entities in Persian text, improving upon the capabilities of its base model, HooshvareLab/albert-fa-zwnj-base-v2-ner. It was fine-tuned on a dataset combining ARMAN, PEYMA, and WikiANN datasets, which are widely used for NER in the Persian language.

  • Developed by: Behpoyan
  • Model type: Albert for Token Classification
  • Language(s) (NLP): Persian (fa)
  • License: MIT

Model Sources

Direct Use

This model can be directly used for Named Entity Recognition tasks in Persian text. Example applications include text analysis, information extraction, and Persian-language NLP applications.

Downstream Use

The model can be fine-tuned further for domain-specific NER tasks or combined with other models for complex NLP pipelines.

Out-of-Scope Use

The model is not designed for languages other than Persian or tasks outside token classification. Misuse for generating biased or harmful content is discouraged.

Recommendations

While the model performs well for general-purpose NER in Persian, users should validate its performance on their specific datasets. Be cautious of biases in the training data, especially in identifying less-represented entities.

How to Get Started with the Model

Here’s how you can use the model:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("Behpouyan/Behpouyan-NER")
model = AutoModelForTokenClassification.from_pretrained("Behpouyan/Behpouyan-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

# Input example
example = '''
"در سال ۱۴۰۱، شرکت علی‌بابا اعلام کرد که با همکاری بانک ملت، یک پروژه بزرگ برای توسعه زیرساخت‌های تجارت الکترونیک در ایران آغاز خواهد کرد. 
این پروژه در تهران و اصفهان اجرا می‌شود و پیش‌بینی می‌شود تا پایان سال ۱۴۰۲ تکمیل شود."
'''
# Get NER results
ner_results = nlp(example)

# Function to merge subword entities
def merge_entities(entities):
    merged_results = []
    current_entity = None

    for entity in entities:
        if entity['entity'].startswith("B-") or current_entity is None:
            # Start a new entity
            if current_entity:
                merged_results.append(current_entity)
            current_entity = {
                "word": entity['word'].strip(),
                "entity": entity['entity'][2:],  # Remove "B-" prefix
                "score": entity['score'],
                "start": entity['start'],
                "end": entity['end'],
            }
        elif entity['entity'].startswith("I-") and current_entity:
            # Continue the current entity
            current_entity['word'] += entity['word'].strip()
            current_entity['score'] = min(current_entity['score'], entity['score'])  # Use the lowest score
            current_entity['end'] = entity['end']
    
    # Add the last entity if any
    if current_entity:
        merged_results.append(current_entity)

    return merged_results

# Merge the entities
merged_results = merge_entities(ner_results)

# Display the merged results
print("Named Entity Recognition Results:")
for entity in merged_results:
    print(f"- Entity: {entity['word']}")
    print(f"  Type: {entity['entity']}")
    print(f"  Score: {entity['score']:.2f}")
    print(f"  Start: {entity['start']}, End: {entity['end']}")
    print("-" * 40)