Scandi NER Model 🏔️

A multilingual Named Entity Recognition model trained on multiple Scandi language datasets plus English and German. The model identifies Person (PER), Organization (ORG), and Location (LOC) entities.

Model Description

This model is based on MediaCatch/mmBERT-base-scandi-ner and has been fine-tuned for token classification on a combined dataset of Scandi NER corpora. It supports:

🇩🇰 Danish - Multiple high-quality datasets including DaNE
🇸🇪 Swedish - SUC 3.0, Swedish NER corpus, and more
🇳🇴 Norwegian - NorNE (Bokmål and Nynorsk)
🇬🇧 English - CoNLL-2003 and additional datasets

Performance

The model achieves the following performance on the held-out test set:

Metric	Score
F1 Score	0.8330
Precision	0.8455
Recall	0.8208

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/nordic-ner-model")
model = AutoModelForTokenClassification.from_pretrained("your-username/nordic-ner-model")

# Create NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

# Example usage
text = "Barack Obama besökte Stockholm och träffade Stefan Löfven."
entities = ner_pipeline(text)

for entity in entities:
    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.3f})")

Supported Entity Types

The model predicts the following entity types using BIO tagging:

PER (Person): Names of people ORG (Organization): Companies, institutions, organizations LOC (Location): Geographic locations, places

Training Data

The model was trained on a combination of the following datasets:

eriktks/conll2003: 20,682 examples
NbAiLab/norne_bokmaal-7: 20,044 examples
NbAiLab/norne_nynorsk-7: 17,575 examples
KBLab/sucx3_ner_original_lower: 71,915 examples
alexandrainst/dane: 5,508 examples
ljos/norwegian_ner_nynorsk: 17,575 examples
ljos/norwegian_ner_bokmaal: 20,044 examples
chcaa/dansk-ner: 14,651 examples

Dataset Statistics

Total examples: 187,994 Average sequence length: 13.8 tokens Languages: en, no, sv, da, unknown Label distribution:

B-ORG: 34,074 (1.3%)
O: 2,430,605 (93.5%)
B-PER: 49,389 (1.9%)
I-PER: 28,150 (1.1%)
B-LOC: 37,803 (1.5%)
I-ORG: 13,647 (0.5%)
I-LOC: 5,203 (0.2%)

Training Details

Training Hyperparameters

Base model: MediaCatch/mmBERT-base-scandi-ner Training epochs: 30 Batch size: 16 Learning rate: 2e-05 Warmup steps: 5000 Weight decay: 0.01

Training Infrastructure

Mixed precision: False Gradient accumulation: 1 Early stopping: Enabled with patience=3

Usage Examples

Basic NER Tagging

text = "Olof Palme var Sveriges statsminister."
entities = ner_pipeline(text)
# Output: [{'entity_group': 'PER', 'word': 'Olof Palme', 'start': 0, 'end': 10, 'score': 0.999}]

Batch Processing

texts = [
    "Microsoft fue fundada por Bill Gates.",
    "Angela Merkel var förbundskansler i Tyskland.",
    "Universitetet i Oslo ligger i Norge."
]

for text in texts:
    entities = ner_pipeline(text)
    print(f"Text: {text}")
    for entity in entities:
        print(f"  {entity['word']} -> {entity['entity_group']}")

Limitations and Considerations

Domain: Primarily trained on news and Wikipedia text; performance may vary on other domains Subword handling: The model uses subword tokenization; ensure proper aggregation Language mixing: While multilingual, performance is best when languages don't mix within sentences Entity coverage: Limited to PER, ORG, LOC; doesn't detect MISC, DATE, or other entity types

Downloads last month: 693

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for MediaCatch/mmBERT-base-scandi-ner-gold

Base model

jhu-clsp/mmBERT-base

Finetuned

MediaCatch/mmBERT-base-scandi-ner

Finetuned

(1)

this model