aehrm's picture
Update README
c92a3f5
|
raw
history blame
4.86 kB
metadata
datasets:
  - aehrm/dtaec-lexica
language: de
pipeline_tag: translation
model-index:
  - name: aehrm/dtaec-type-normalizer
    results:
      - task:
          name: Historic Text Normalization (type-level)
          type: translation
        dataset:
          name: DTA-EC Lexicon (dev)
          type: aehrm/dtaec-lexica
          split: dev
        metrics:
          - name: Word Accuracy
            type: accuracy
            value: 0.9546
          - name: Word Accuracy OOV
            type: accuracy
            value: 0.9096

DTAEC Type Normalizer

This model is trained from scratch to normalize historic spelling of German to contemporary one. It is type-based, which means that it takes only a single token (without whitespace) as input, and generates the normalized variant. It achieves the following results on the evaluation set:

  • Loss: 0.0308
  • Wordacc: 0.9546
  • Wordacc Oov: 0.9096

Note: This model is part of a larger system, which uses an additional GPT-based model to disambiguate different normalization forms by taking in the full context.

Training and evaluation data

The model has been trained on the DTA-EC Parallel Corpus Lexicon (aehrm/dtaec-lexica), which is from a parallel corpus of the Deutsche Textarchiv (German Text Archive), who aligned historic prints of documents with their moden editions in contemporary orthography.

Training was done on type-level, where, given the historic form of a type, the model must predict the corresponding normalized type that appeared most frequent in the parallel corpus.

Demo Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained('aehrm/dtaec-type-normalizer')
model = AutoModelForSeq2SeqLM.from_pretrained('aehrm/dtaec-type-normalizer')

# Note: you CANNOT normalize full sentences, only word for word!
model_in = tokenizer(['Freyheit', 'seyn', 'selbstthätig'], return_tensors='pt', padding=True)
model_out = model.generate(**model_in)

print(tokenizer.batch_decode(model_out, skip_special_tokens=True))
# >>> ['Freiheit', 'sein', 'selbsttätig']

Or, more compact using the huggingface pipeline:

from transformers import pipeline

pipe = pipeline(model="aehrm/dtaec-type-normalizer")
out = pipe(['Freyheit', 'seyn', 'selbstthätig'])

print(out)
# >>> [{'generated_text': 'Freiheit'}, {'generated_text': 'sein'}, {'generated_text': 'selbsttätig'}]

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0001
  • train_batch_size: 8
  • eval_batch_size: 64
  • seed: 12345
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 20

Training results

Training Loss Epoch Step Validation Loss Wordacc Wordacc Oov Gen Len
0.0912 1.0 12628 0.0698 0.8984 0.8421 12.3456
0.0746 2.0 25256 0.0570 0.9124 0.8584 12.3442
0.0622 3.0 37884 0.0493 0.9195 0.8717 12.3512
0.0584 4.0 50512 0.0465 0.9221 0.8749 12.3440
0.0497 5.0 63140 0.0436 0.9274 0.8821 12.3552
0.0502 6.0 75768 0.0411 0.9311 0.8858 12.3519
0.0428 7.0 88396 0.0396 0.9336 0.8878 12.3444
0.0416 8.0 101024 0.0372 0.9339 0.8887 12.3471
0.042 9.0 113652 0.0365 0.9396 0.8944 12.3485
0.0376 10.0 126280 0.0353 0.9412 0.8962 12.3485
0.031 11.0 138908 0.0339 0.9439 0.9008 12.3519
0.0298 12.0 151536 0.0337 0.9454 0.9013 12.3479
0.0302 13.0 164164 0.0322 0.9470 0.9043 12.3483
0.0277 14.0 176792 0.0316 0.9479 0.9040 12.3506
0.0277 15.0 189420 0.0323 0.9488 0.9030 12.3514
0.0245 16.0 202048 0.0314 0.9513 0.9072 12.3501
0.0235 17.0 214676 0.0313 0.9520 0.9071 12.3511
0.0206 18.0 227304 0.0310 0.9531 0.9084 12.3502
0.0178 19.0 239932 0.0307 0.9545 0.9094 12.3507
0.016 20.0 252560 0.0308 0.9546 0.9096 12.3516

Framework versions

  • Transformers 4.41.2
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.1
  • Tokenizers 0.19.1