datasets:
- aehrm/dtaec-lexica
language: de
pipeline_tag: translation
model-index:
- name: aehrm/dtaec-type-normalizer
results:
- task:
name: Historic Text Normalization (type-level)
type: translation
dataset:
name: DTA EvalCorpus Lexicon
type: aehrm/dtaec-lexicon
split: dev
metrics:
- name: Word Accuracy
type: accuracy
value: 0.9546
- name: Word Accuracy OOV
type: accuracy
value: 0.9096
DTAEC Type Normalizer
This model is trained from scratch to normalize historic spelling of German to contemporary one. It is type-based, which means that it takes only a single token (without whitespace) as input, and generates the normalized variant. It achieves the following results on the evaluation set:
- Loss: 0.0308
- Wordacc: 0.9546
- Wordacc Oov: 0.9096
Note: This model is part of a larger system, which uses an additional GPT-based model to disambiguate different normalization forms by taking in the full context. See https://github.com/aehrm/hybrid_textnorm.
Training and evaluation data
The model has been trained on the DTA-EC Parallel Corpus Lexicon (aehrm/dtaec-lexica), which is from a parallel corpus of the Deutsche Textarchiv (German Text Archive), who aligned historic prints of documents with their moden editions in contemporary orthography.
Training was done on type-level, where, given the historic form of a type, the model must predict the corresponding normalized type that appeared most frequent in the parallel corpus.
Demo Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained('aehrm/dtaec-type-normalizer')
model = AutoModelForSeq2SeqLM.from_pretrained('aehrm/dtaec-type-normalizer')
# Note: you CANNOT normalize full sentences, only word for word!
model_in = tokenizer(['Freyheit', 'seyn', 'ſelbstthätig'], return_tensors='pt', padding=True)
model_out = model.generate(**model_in)
print(tokenizer.batch_decode(model_out, skip_special_tokens=True))
# >>> ['Freiheit', 'sein', 'selbsttätig']
Or, more compact using the huggingface pipeline
:
from transformers import pipeline
pipe = pipeline(model="aehrm/dtaec-type-normalizer")
out = pipe(['Freyheit', 'seyn', 'ſelbstthätig'])
print(out)
# >>> [{'generated_text': 'Freiheit'}, {'generated_text': 'sein'}, {'generated_text': 'selbsttätig'}]
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 64
- seed: 12345
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 20
Training results
Training Loss | Epoch | Step | Validation Loss | Wordacc | Wordacc Oov | Gen Len |
---|---|---|---|---|---|---|
0.0912 | 1.0 | 12628 | 0.0698 | 0.8984 | 0.8421 | 12.3456 |
0.0746 | 2.0 | 25256 | 0.0570 | 0.9124 | 0.8584 | 12.3442 |
0.0622 | 3.0 | 37884 | 0.0493 | 0.9195 | 0.8717 | 12.3512 |
0.0584 | 4.0 | 50512 | 0.0465 | 0.9221 | 0.8749 | 12.3440 |
0.0497 | 5.0 | 63140 | 0.0436 | 0.9274 | 0.8821 | 12.3552 |
0.0502 | 6.0 | 75768 | 0.0411 | 0.9311 | 0.8858 | 12.3519 |
0.0428 | 7.0 | 88396 | 0.0396 | 0.9336 | 0.8878 | 12.3444 |
0.0416 | 8.0 | 101024 | 0.0372 | 0.9339 | 0.8887 | 12.3471 |
0.042 | 9.0 | 113652 | 0.0365 | 0.9396 | 0.8944 | 12.3485 |
0.0376 | 10.0 | 126280 | 0.0353 | 0.9412 | 0.8962 | 12.3485 |
0.031 | 11.0 | 138908 | 0.0339 | 0.9439 | 0.9008 | 12.3519 |
0.0298 | 12.0 | 151536 | 0.0337 | 0.9454 | 0.9013 | 12.3479 |
0.0302 | 13.0 | 164164 | 0.0322 | 0.9470 | 0.9043 | 12.3483 |
0.0277 | 14.0 | 176792 | 0.0316 | 0.9479 | 0.9040 | 12.3506 |
0.0277 | 15.0 | 189420 | 0.0323 | 0.9488 | 0.9030 | 12.3514 |
0.0245 | 16.0 | 202048 | 0.0314 | 0.9513 | 0.9072 | 12.3501 |
0.0235 | 17.0 | 214676 | 0.0313 | 0.9520 | 0.9071 | 12.3511 |
0.0206 | 18.0 | 227304 | 0.0310 | 0.9531 | 0.9084 | 12.3502 |
0.0178 | 19.0 | 239932 | 0.0307 | 0.9545 | 0.9094 | 12.3507 |
0.016 | 20.0 | 252560 | 0.0308 | 0.9546 | 0.9096 | 12.3516 |
Framework versions
- Transformers 4.41.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1
License
The model weights are marked with CC0 1.0 Universal.
NOTE: This model and its inferences or derivative works may be considered an Adaptation of the DTA EvalCorpus by Bryan Jurish, Henriette Ast, Marko Drotschmann, and Christian Thomas, licensed under the Creative Commons Attribution-NonCommercial 3.0 Unported License. Limitations to commercial use may apply.