pere
/

nb-nn-translation

Inference Endpoints

Model card Files Files and versions Community

pere commited on Jul 19, 2021

Commit

df27222

·

1 Parent(s): fc4d7a3

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -26,8 +26,8 @@ Apart from some word-list based engines, there are not any working off-the-shelf
 ## Pretraining a T5-base
 There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
-## Finetuning - BLEU-SCORE 88.16 🎉
-Training for [30] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a SACREBLEU score of [87.94] at training and a test score of [**88.16**] after training.
 ## This is not a translator
 We found out that we were able to get almost identical BLEU-score with training it both ways, and letting the model decide if the input is in Bokmål or Nynorsk. This way we can train one model instead of two. We call it a language switcher.

 ## Pretraining a T5-base
 There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
+## Finetuning - BLEU-SCORE 88.17 🎉
+Training for [10] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a SACREBLEU score of [88.03] at training and a test score of [**88.17**] after training.
 ## This is not a translator
 We found out that we were able to get almost identical BLEU-score with training it both ways, and letting the model decide if the input is in Bokmål or Nynorsk. This way we can train one model instead of two. We call it a language switcher.