State-of-the-Art NER models - Acronyms
					Collection
				
				2 items
				• 
				Updated
					
				
This is a SpanMarker model trained on the Acronym Identification dataset that can be used for Named Entity Recognition. This SpanMarker model uses bert-base-cased as the underlying encoder. See train.py for the training script.
Is your data not (always) capitalized correctly? Then consider using the uncased variant of this model instead for better performance: tomaarsen/span-marker-bert-base-uncased-acronyms.
| Label | Examples | 
|---|---|
| long | "Conversational Question Answering", "controlled natural language", "successive convex approximation" | 
| short | "SODA", "CNL", "CoQA" | 
| Label | Precision | Recall | F1 | 
|---|---|---|---|
| all | 0.9422 | 0.9252 | 0.9336 | 
| long | 0.9308 | 0.9013 | 0.9158 | 
| short | 0.9479 | 0.9374 | 0.9426 | 
from span_marker import SpanMarkerModel
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-acronyms")
# Run inference
entities = model.predict("Compression algorithms like Principal Component Analysis (PCA) can reduce noise and complexity.")
You can finetune this model on your own dataset.
from span_marker import SpanMarkerModel, Trainer
# Download from the 🤗 Hub
model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-acronyms")
# Specify a Dataset with "tokens" and "ner_tag" columns
dataset = load_dataset("conll2003") # For example CoNLL2003
# Initialize a Trainer using the pretrained model & dataset
trainer = Trainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
)
trainer.train()
trainer.save_model("tomaarsen/span-marker-bert-base-acronyms-finetuned")
| Training set | Min | Median | Max | 
|---|---|---|---|
| Sentence length | 4 | 32.3372 | 170 | 
| Entities per sentence | 0 | 2.6775 | 24 | 
| Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy | 
|---|---|---|---|---|---|---|
| 0.3101 | 200 | 0.0083 | 0.9170 | 0.8894 | 0.9030 | 0.9766 | 
| 0.6202 | 400 | 0.0063 | 0.9329 | 0.9149 | 0.9238 | 0.9807 | 
| 0.9302 | 600 | 0.0060 | 0.9279 | 0.9338 | 0.9309 | 0.9819 | 
| 1.2403 | 800 | 0.0058 | 0.9406 | 0.9092 | 0.9247 | 0.9812 | 
| 1.5504 | 1000 | 0.0056 | 0.9453 | 0.9155 | 0.9302 | 0.9825 | 
| 1.8605 | 1200 | 0.0054 | 0.9411 | 0.9271 | 0.9340 | 0.9831 | 
Carbon emissions were measured using CodeCarbon.
@software{Aarsen_SpanMarker,
    author = {Aarsen, Tom},
    license = {Apache-2.0},
    title = {{SpanMarker for Named Entity Recognition}},
    url = {https://github.com/tomaarsen/SpanMarkerNER}
}
Base model
google-bert/bert-base-cased