File size: 2,470 Bytes
7705e19
6a64ddc
7705e19
 
ec35ede
7705e19
 
6a64ddc
7703619
6a64ddc
2747052
4bf062a
 
 
 
 
 
 
 
 
2747052
4bf062a
 
 
 
 
2747052
fe6a973
2747052
af7a325
 
153e1ff
4bf062a
af7a325
 
 
 
 
fe6a973
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
---
language: no
license: CC-BY 4.0
tags:
- translation
datasets:
- oscar
widget:
- text: "Skriv inn en tekst som du ønsker å oversette til en annen målform."
---
# RECORD BLEU-SCORE 88.16 !!!

# 🇳🇴 Bokmål ⇔ Nynorsk 🇳🇴
Norwegian has two relatively similar written languages; Bokmål and Nynorsk. Historically Nynorsk is a written norm based on dialects curated by the linguist Ivar Aasen in the mid-to-late 1800s, whereas Bokmål is a gradual 'Norwegization' of written Danish.
The two written languages are considered equal and citizens have a right to receive public service information in their primary and prefered language. Even though this right has been around for a long time only between 5-10% of Norwegian texts are written in Nynorsk. Nynorsk is therefore a low-resource language within a low-resource language.

For translating between the two languages, there are not any working off-the-shelf machine learning-based translation models. 
|   |   |
|---|---|
| Widget                                | Try the widget in the top right corner |
| Huggingface Spaces                    | Go to [mt5](https://huggingface.co/google/mt5-base)                           |                       |
|   |   |
## Pretraining a T5-base
There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB. 

## Finetuning
Training for [30] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a SACREBLEU score of [87.94] at training and a test score of [**88.16**] after training. 

## How to use the model
```python
# Set up the pipeline
from transformers import pipeline
translator = pipeline("translation", model='pere/nb-nn-translation')

# Do the translation
text = "Hun vil ikke gi bort sine personlige data."
print(translator(text, max_length=255))

```