--- license: apache-2.0 base_model: google/byt5-small tags: - generated_from_trainer model-index: - name: model results: [] new_version: ybracke/transnormer-19c-beta-v02 --- # Transnormer 19th century (beta v01) This model normalizes spelling variants in historical German text to the modern spelling. It is a fine-tuned version of [google/byt5-small](https://huggingface.co/google/byt5-small) on a modified version of the [DTA EvalCorpus](https://kaskade.dwds.de/~moocow/software/dtaec/) (1780-1901). ### Demo Usage ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM from transformers.generation import GenerationConfig tokenizer = AutoTokenizer.from_pretrained("ybracke/transnormer-19c-beta-v01") model = AutoModelForSeq2SeqLM.from_pretrained("ybracke/transnormer-19c-beta-v01") gen_cfg = GenerationConfig.from_model_config(model.config) gen_cfg.max_new_tokens = 512 sentence = "Der Officier mußte ſich dazu setzen, man trank und ließ ſich’s wohl ſeyn." inputs = tokenizer(sentence, return_tensors="pt",) outputs = model.generate(**inputs, generation_config=gen_cfg) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) # >>> ['Der Offizier musste sich dazusetzen, man trank und ließ sich es wohl sein.' ``` Here is how to use this model with the [pipeline API](https://huggingface.co/transformers/main_classes/pipelines.html): ```python from transformers import pipeline transnormer = pipeline('text2text-generation', model='ybracke/transnormer-19c-beta-v01') sentence = "Der Officier mußte ſich dazu setzen, man trank und ließ ſich’s wohl ſeyn." print(transnormer(sentence)) # >>> [{'generated_text': 'Der Offizier musste sich dazusetzen, man trank und ließ sich es wohl sein.'}] ``` ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 5e-05 - train_batch_size: 32 - eval_batch_size: 32 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 3.76 ### Framework versions - Transformers 4.31.0 - Pytorch 2.1.0+cu121 - Datasets 2.18.0 - Tokenizers 0.13.3