pere commited on
Commit
2747052
1 Parent(s): 4bf062a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -5
README.md CHANGED
@@ -8,7 +8,7 @@ datasets:
8
  widget:
9
  - text: "Skriv inn en tekst som du ønsker å oversette til en annen målform."
10
  ---
11
- # BLEU-SCORE 88.16 !!!
12
 
13
  # 🇳🇴 Bokmål ⇔ Nynorsk 🇳🇴
14
  Norwegian has two relatively similar written languages; Bokmål and Nynorsk. Historically Nynorsk is a written norm based on dialects curated by the linguist Ivar Aasen in the mid-to-late 1800s, whereas Bokmål is a gradual 'Norwegization' of written Danish.
@@ -18,16 +18,15 @@ For translating between the two languages, there are not any working off-the-she
18
  | | |
19
  |---|---|
20
  | Widget | Try the widget in the top right corner |
21
- | Huggingface Spaces | Go to [mt5](https://huggingface.co/google/mt5-base) |
22
- | Google Docs Add-on (waiting approval) | Watch Gif-demo |
23
  | | |
24
  ## Pretraining a T5-base
25
  There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
26
 
27
  ## Finetuning
28
- Training for [30] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a BLEU score of [87.94] at training and a test score of [88.16] after training. Considering the similarity of the two languages a high score is expected, however a score above 60 is usually taken as a high score.
29
- ![Add-on](bm2nn_demo.gif)
30
 
 
31
  ```python
32
  # Set up the pipeline
33
  from transformers import pipeline
 
8
  widget:
9
  - text: "Skriv inn en tekst som du ønsker å oversette til en annen målform."
10
  ---
11
+ # RECORD BLEU-SCORE 88.16 !!!
12
 
13
  # 🇳🇴 Bokmål ⇔ Nynorsk 🇳🇴
14
  Norwegian has two relatively similar written languages; Bokmål and Nynorsk. Historically Nynorsk is a written norm based on dialects curated by the linguist Ivar Aasen in the mid-to-late 1800s, whereas Bokmål is a gradual 'Norwegization' of written Danish.
 
18
  | | |
19
  |---|---|
20
  | Widget | Try the widget in the top right corner |
21
+ | Huggingface Spaces | Go to [mt5](https://huggingface.co/google/mt5-base) | |
 
22
  | | |
23
  ## Pretraining a T5-base
24
  There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
25
 
26
  ## Finetuning
27
+ Training for [30] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a SACREBLEU score of [87.94] at training and a test score of [**88.16**] after training.
 
28
 
29
+ ## How to use the model
30
  ```python
31
  # Set up the pipeline
32
  from transformers import pipeline