File size: 4,911 Bytes
9106d9e
 
 
 
17a0df6
 
 
 
 
 
 
 
7f548cc
 
bab4876
17a0df6
 
 
 
 
 
 
9106d9e
 
 
 
 
 
 
 
 
 
48d9430
9106d9e
 
 
7f548cc
9106d9e
 
 
 
 
 
24f7972
9106d9e
 
24f7972
9106d9e
c92a3f5
155bb31
24f7972
9106d9e
17a0df6
 
 
 
 
 
 
 
 
 
155bb31
17a0df6
 
 
9106d9e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
datasets:
- aehrm/dtaec-lexica
language: de
pipeline_tag: translation
model-index:
  - name: aehrm/dtaec-type-normalizer
    results:
      - task:
          name: Historic Text Normalization (type-level)
          type: translation
        dataset:
          name: DTA EvalCorpus Lexicon
          type: aehrm/dtaec-lexicon
          split: dev
        metrics:
          - name: Word Accuracy
            type: accuracy
            value: 0.9546
          - name: Word Accuracy OOV
            type: accuracy
            value: 0.9096
---

# DTAEC Type Normalizer

This model is trained from scratch to normalize historic spelling of German to contemporary one. It is type-based, which means that it takes only a single token (without whitespace) as input, and generates the normalized variant.
It achieves the following results on the evaluation set:
- Loss: 0.0308
- Wordacc: 0.9546
- Wordacc Oov: 0.9096

Note: This model is part of a larger system, which uses an additional GPT-based model to disambiguate different normalization forms by taking in the full context. See <https://github.com/aehrm/hybrid_textnorm>.

## Training and evaluation data

The model has been trained on the DTA-EC Parallel Corpus Lexicon ([aehrm/dtaec-lexica](https://huggingface.co/datasets/aehrm/dtaec-lexicon)), which is from a [parallel corpus](https://kaskade.dwds.de/~moocow/software/dtaec/) of the Deutsche Textarchiv (German Text Archive), who aligned historic prints of documents with their moden editions in contemporary orthography.

Training was done on type-level, where, given the historic form of a type, the model must predict the corresponding normalized type *that appeared most frequent in the parallel corpus*.

## Demo Usage

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained('aehrm/dtaec-type-normalizer')
model = AutoModelForSeq2SeqLM.from_pretrained('aehrm/dtaec-type-normalizer')

# Note: you CANNOT normalize full sentences, only word for word!
model_in = tokenizer(['Freyheit', 'seyn', 'ſelbstthätig'], return_tensors='pt', padding=True)
model_out = model.generate(**model_in)

print(tokenizer.batch_decode(model_out, skip_special_tokens=True))
# >>> ['Freiheit', 'sein', 'selbsttätig']
```

Or, more compact using the huggingface `pipeline`:

```python
from transformers import pipeline

pipe = pipeline(model="aehrm/dtaec-type-normalizer")
out = pipe(['Freyheit', 'seyn', 'ſelbstthätig'])

print(out)
# >>> [{'generated_text': 'Freiheit'}, {'generated_text': 'sein'}, {'generated_text': 'selbsttätig'}]
```


## Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 8
- eval_batch_size: 64
- seed: 12345
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 20

## Training results

| Training Loss | Epoch | Step   | Validation Loss | Wordacc | Wordacc Oov | Gen Len |
|:-------------:|:-----:|:------:|:---------------:|:-------:|:-----------:|:-------:|
| 0.0912        | 1.0   | 12628  | 0.0698          | 0.8984  | 0.8421      | 12.3456 |
| 0.0746        | 2.0   | 25256  | 0.0570          | 0.9124  | 0.8584      | 12.3442 |
| 0.0622        | 3.0   | 37884  | 0.0493          | 0.9195  | 0.8717      | 12.3512 |
| 0.0584        | 4.0   | 50512  | 0.0465          | 0.9221  | 0.8749      | 12.3440 |
| 0.0497        | 5.0   | 63140  | 0.0436          | 0.9274  | 0.8821      | 12.3552 |
| 0.0502        | 6.0   | 75768  | 0.0411          | 0.9311  | 0.8858      | 12.3519 |
| 0.0428        | 7.0   | 88396  | 0.0396          | 0.9336  | 0.8878      | 12.3444 |
| 0.0416        | 8.0   | 101024 | 0.0372          | 0.9339  | 0.8887      | 12.3471 |
| 0.042         | 9.0   | 113652 | 0.0365          | 0.9396  | 0.8944      | 12.3485 |
| 0.0376        | 10.0  | 126280 | 0.0353          | 0.9412  | 0.8962      | 12.3485 |
| 0.031         | 11.0  | 138908 | 0.0339          | 0.9439  | 0.9008      | 12.3519 |
| 0.0298        | 12.0  | 151536 | 0.0337          | 0.9454  | 0.9013      | 12.3479 |
| 0.0302        | 13.0  | 164164 | 0.0322          | 0.9470  | 0.9043      | 12.3483 |
| 0.0277        | 14.0  | 176792 | 0.0316          | 0.9479  | 0.9040      | 12.3506 |
| 0.0277        | 15.0  | 189420 | 0.0323          | 0.9488  | 0.9030      | 12.3514 |
| 0.0245        | 16.0  | 202048 | 0.0314          | 0.9513  | 0.9072      | 12.3501 |
| 0.0235        | 17.0  | 214676 | 0.0313          | 0.9520  | 0.9071      | 12.3511 |
| 0.0206        | 18.0  | 227304 | 0.0310          | 0.9531  | 0.9084      | 12.3502 |
| 0.0178        | 19.0  | 239932 | 0.0307          | 0.9545  | 0.9094      | 12.3507 |
| 0.016         | 20.0  | 252560 | 0.0308          | 0.9546  | 0.9096      | 12.3516 |


### Framework versions

- Transformers 4.41.2
- Pytorch 2.3.0+cu121
- Datasets 2.19.1
- Tokenizers 0.19.1