pere commited on
Commit
16faa53
·
1 Parent(s): a012eee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -27,6 +27,8 @@ Apart from some word-list based engines, there are not any working off-the-shelf
27
  There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
28
 
29
  ## Finetuning - BLEU-SCORE 88.17 🎉
 
 
30
  Training for [10] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a SACREBLEU score of [88.03] at training and a test score of [**88.17**] after training.
31
 
32
  ## This is not a translator
 
27
  There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
28
 
29
  ## Finetuning - BLEU-SCORE 88.17 🎉
30
+ The central finetuning data of the project have been 200k translation units (TU) i.e. aligned pairs of sentences in the respective languages extracted from textbooks of various subjects and newspapers.
31
+
32
  Training for [10] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a SACREBLEU score of [88.03] at training and a test score of [**88.17**] after training.
33
 
34
  ## This is not a translator