Update README.md
Browse files
README.md
CHANGED
@@ -8,34 +8,30 @@ datasets:
|
|
8 |
widget:
|
9 |
- text: "Skriv inn en tekst som du ønsker å oversette til en annen målform."
|
10 |
---
|
11 |
-
#
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
```
|
32 |
-
|
33 |
-
Or if you like to use the pipeline instead
|
34 |
|
35 |
```python
|
36 |
# Set up the pipeline
|
37 |
from transformers import pipeline
|
38 |
-
translator = pipeline("translation", model='
|
39 |
|
40 |
# Do the translation
|
41 |
text = "Hun vil ikke gi bort sine personlige data."
|
|
|
8 |
widget:
|
9 |
- text: "Skriv inn en tekst som du ønsker å oversette til en annen målform."
|
10 |
---
|
11 |
+
# BLEU-SCORE 88.16 !!!
|
12 |
+
|
13 |
+
# 🇳🇴 Bokmål ⇔ Nynorsk 🇳🇴
|
14 |
+
Norwegian has two relatively similar written languages; Bokmål and Nynorsk. Historically Nynorsk is a written norm based on dialects curated by the linguist Ivar Aasen in the mid-to-late 1800s, whereas Bokmål is a gradual 'Norwegization' of written Danish.
|
15 |
+
The two written languages are considered equal and citizens have a right to receive public service information in their primary and prefered language. Even though this right has been around for a long time only between 5-10% of Norwegian texts are written in Nynorsk. Nynorsk is therefore a low-resource language within a low-resource language.
|
16 |
+
|
17 |
+
For translating between the two languages, there are not any working off-the-shelf machine learning-based translation models.
|
18 |
+
| | |
|
19 |
+
|---|---|
|
20 |
+
| Widget | Try the widget in the top right corner |
|
21 |
+
| Huggingface Spaces | Go to [mt5](https://huggingface.co/google/mt5-base) |
|
22 |
+
| Google Docs Add-on (waiting approval) | Watch Gif-demo |
|
23 |
+
| | |
|
24 |
+
## Pretraining a T5-base
|
25 |
+
There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
|
26 |
+
|
27 |
+
## Finetuning
|
28 |
+
Training for [30] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a BLEU score of [87.94] at training and a test score of [88.16] after training. Considering the similarity of the two languages a high score is expected, however a score above 60 is usually taken as a high score.
|
29 |
+
![Add-on](bm2nn_demo.gif)
|
|
|
|
|
|
|
|
|
30 |
|
31 |
```python
|
32 |
# Set up the pipeline
|
33 |
from transformers import pipeline
|
34 |
+
translator = pipeline("translation", model='pere/nb-nn-translation')
|
35 |
|
36 |
# Do the translation
|
37 |
text = "Hun vil ikke gi bort sine personlige data."
|