metadata

language: 'no'
tags:
  - translation
widget:
  - text: >-
      moscow says deployments in eastern europe increase tensions at the same
      time nato says russia has moved troops to belarus
  - text: >-
      dette er en liten test som er laget av per egil kummervold han er en
      forsker som tidligere jobbet ved nasjonalbiblioteket
  - text: >-
      tirsdag var travel for ukrainas president volodymyr zelenskyj på morgenen
      tok han imot polens statsminister mateusz morawiecki
  - text: >-
      el presidente de estados unidos aprovecha su visita al país fronterizo con
      ucrania para reunirse con los ministros de defensa y exteriores en un
      encuentro con refugiados el mandatario calificó al líder ruso como
      carnicero 
license: cc-by-4.0

DeUnCaser

The output from Automated Speak Recognition software is usually uncased and without any punctation. This does not make a very readable text.

The DeUnCaser is a sequence-to-sequence model that is reversing this process. It adds punctation, and capitalises the correct words. In some languages this means adding capital letters at start of sentences and on all proper nouns, in other languages, like German, it means capitalising the first letter of all nouns. It will also make attempts at adding hyphens and parentheses if this is making the meaning clearer.

It is using based on the multi-lingual T5 model. It is finetuned for 130,000 steps on a TPU v4-16 using T5X starting from the mT5.1.1 pretrained model. The finetuning scripts is based on up to 1,000,000 training examples (or as many as exists in OSCAR) from each of the 42 languages with Latin alphabet that is both part of OSCAR and the mT5 training set: Afrikaans, Albanian, Basque, Catalan, Cebuano, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, German, Hungarian, Icelandic, Indonesian, Irish, Italian, Kurdish, Latin, Latvian, Lithuanian, Luxembourgish, Malagasy, Malay, Maltese, Norwegian Bokmål, Norwegian Nynorsk, Polish, Portuguese, Romanian, Slovak, Spanish, Swahili, Swedish, Turkish, Uzbek, Vietnamese, Welsh, West Frisian.

A Notebook for creating the training corpus is available here.