pere
/

nb-nn-translation

Inference Endpoints

Model card Files Files and versions Community

pere commited on Jul 19, 2021

Commit

16faa53

·

1 Parent(s): a012eee

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -27,6 +27,8 @@ Apart from some word-list based engines, there are not any working off-the-shelf
 There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
 ## Finetuning - BLEU-SCORE 88.17 🎉
 Training for [10] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a SACREBLEU score of [88.03] at training and a test score of [**88.17**] after training.
 ## This is not a translator

 There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
 ## Finetuning - BLEU-SCORE 88.17 🎉
+The central finetuning data of the project have been 200k translation units (TU) i.e. aligned pairs of sentences in the respective languages extracted from textbooks of various subjects and newspapers.
 Training for [10] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a SACREBLEU score of [88.03] at training and a test score of [**88.17**] after training.
 ## This is not a translator