DTAEC Type Normalizer

This model is trained from scratch to normalize historic spelling of German to contemporary one. It is type-based, which means that it takes only a single token (without whitespace) as input, and generates the normalized variant. It achieves the following results on the evaluation set:

  • Loss: 0.0308
  • Wordacc: 0.9546
  • Wordacc Oov: 0.9096

Note: This model is part of a larger system, which uses an additional GPT-based model to disambiguate different normalization forms by taking in the full context. See https://github.com/aehrm/hybrid_textnorm.

Training and evaluation data

The model has been trained on the DTA-EC Parallel Corpus Lexicon (aehrm/dtaec-lexica), which is from a parallel corpus of the Deutsche Textarchiv (German Text Archive), who aligned historic prints of documents with their moden editions in contemporary orthography.

Training was done on type-level, where, given the historic form of a type, the model must predict the corresponding normalized type that appeared most frequent in the parallel corpus.

Demo Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained('aehrm/dtaec-type-normalizer')
model = AutoModelForSeq2SeqLM.from_pretrained('aehrm/dtaec-type-normalizer')

# Note: you CANNOT normalize full sentences, only word for word!
model_in = tokenizer(['Freyheit', 'seyn', 'ſelbstthätig'], return_tensors='pt', padding=True)
model_out = model.generate(**model_in)

print(tokenizer.batch_decode(model_out, skip_special_tokens=True))
# >>> ['Freiheit', 'sein', 'selbsttätig']

Or, more compact using the huggingface pipeline:

from transformers import pipeline

pipe = pipeline(model="aehrm/dtaec-type-normalizer")
out = pipe(['Freyheit', 'seyn', 'ſelbstthätig'])

print(out)
# >>> [{'generated_text': 'Freiheit'}, {'generated_text': 'sein'}, {'generated_text': 'selbsttätig'}]

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 64
  • seed: 12345
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 20

Training results

Training Loss Epoch Step Validation Loss Wordacc Wordacc Oov Gen Len
0.0912 1.0 12628 0.0698 0.8984 0.8421 12.3456
0.0746 2.0 25256 0.0570 0.9124 0.8584 12.3442
0.0622 3.0 37884 0.0493 0.9195 0.8717 12.3512
0.0584 4.0 50512 0.0465 0.9221 0.8749 12.3440
0.0497 5.0 63140 0.0436 0.9274 0.8821 12.3552
0.0502 6.0 75768 0.0411 0.9311 0.8858 12.3519
0.0428 7.0 88396 0.0396 0.9336 0.8878 12.3444
0.0416 8.0 101024 0.0372 0.9339 0.8887 12.3471
0.042 9.0 113652 0.0365 0.9396 0.8944 12.3485
0.0376 10.0 126280 0.0353 0.9412 0.8962 12.3485
0.031 11.0 138908 0.0339 0.9439 0.9008 12.3519
0.0298 12.0 151536 0.0337 0.9454 0.9013 12.3479
0.0302 13.0 164164 0.0322 0.9470 0.9043 12.3483
0.0277 14.0 176792 0.0316 0.9479 0.9040 12.3506
0.0277 15.0 189420 0.0323 0.9488 0.9030 12.3514
0.0245 16.0 202048 0.0314 0.9513 0.9072 12.3501
0.0235 17.0 214676 0.0313 0.9520 0.9071 12.3511
0.0206 18.0 227304 0.0310 0.9531 0.9084 12.3502
0.0178 19.0 239932 0.0307 0.9545 0.9094 12.3507
0.016 20.0 252560 0.0308 0.9546 0.9096 12.3516

Framework versions

  • Transformers 4.41.2
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1

License

The model weights are marked with CC0 1.0 Universal.

NOTE: This model and its inferences or derivative works may be considered an Adaptation of

Conditions on attribution and/or restrictions to commercial use may apply.

Downloads last month
31
Safetensors
Model size
7.93M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train aehrm/dtaec-type-normalizer

Evaluation results