Update README.md
Browse files
README.md
CHANGED
@@ -26,8 +26,8 @@ Apart from some word-list based engines, there are not any working off-the-shelf
|
|
26 |
## Pretraining a T5-base
|
27 |
There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
|
28 |
|
29 |
-
## Finetuning - BLEU-SCORE 88.
|
30 |
-
Training for [
|
31 |
|
32 |
## This is not a translator
|
33 |
We found out that we were able to get almost identical BLEU-score with training it both ways, and letting the model decide if the input is in Bokmål or Nynorsk. This way we can train one model instead of two. We call it a language switcher.
|
|
|
26 |
## Pretraining a T5-base
|
27 |
There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
|
28 |
|
29 |
+
## Finetuning - BLEU-SCORE 88.17 🎉
|
30 |
+
Training for [10] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a SACREBLEU score of [88.03] at training and a test score of [**88.17**] after training.
|
31 |
|
32 |
## This is not a translator
|
33 |
We found out that we were able to get almost identical BLEU-score with training it both ways, and letting the model decide if the input is in Bokmål or Nynorsk. This way we can train one model instead of two. We call it a language switcher.
|