pere commited on
Commit
4bf062a
1 Parent(s): 875edb4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -24
README.md CHANGED
@@ -8,34 +8,30 @@ datasets:
8
  widget:
9
  - text: "Skriv inn en tekst som du ønsker å oversette til en annen målform."
10
  ---
11
- # Norwegian mT5 - Translation Bokmål Nynorsk
12
-
13
- ## Description
14
-
15
- This is a sample reference model.
16
-
17
- Here is an example of how to use the model from Python
18
- ```python
19
- # Import libraries
20
- from transformers import T5ForConditionalGeneration, AutoTokenizer
21
- model = T5ForConditionalGeneration.from_pretrained('andrek/nb2nn',from_flax=True)
22
- tokenizer = AutoTokenizer.from_pretrained('andrek/nb2nn')
23
- #Encode the text
24
- text = "Hun vil ikke gi bort sine personlige data."
25
- inputs = tokenizer.encode(text, return_tensors="pt")
26
- outputs = model.generate(inputs, max_length=255, num_beams=4, early_stopping=True)
27
-
28
- #Decode and print the result
29
- print(tokenizer.decode(outputs[0]))
30
-
31
- ```
32
-
33
- Or if you like to use the pipeline instead
34
 
35
  ```python
36
  # Set up the pipeline
37
  from transformers import pipeline
38
- translator = pipeline("translation", model='andrek/nb2nn')
39
 
40
  # Do the translation
41
  text = "Hun vil ikke gi bort sine personlige data."
 
8
  widget:
9
  - text: "Skriv inn en tekst som du ønsker å oversette til en annen målform."
10
  ---
11
+ # BLEU-SCORE 88.16 !!!
12
+
13
+ # 🇳🇴 Bokmål ⇔ Nynorsk 🇳🇴
14
+ Norwegian has two relatively similar written languages; Bokmål and Nynorsk. Historically Nynorsk is a written norm based on dialects curated by the linguist Ivar Aasen in the mid-to-late 1800s, whereas Bokmål is a gradual 'Norwegization' of written Danish.
15
+ The two written languages are considered equal and citizens have a right to receive public service information in their primary and prefered language. Even though this right has been around for a long time only between 5-10% of Norwegian texts are written in Nynorsk. Nynorsk is therefore a low-resource language within a low-resource language.
16
+
17
+ For translating between the two languages, there are not any working off-the-shelf machine learning-based translation models.
18
+ | | |
19
+ |---|---|
20
+ | Widget | Try the widget in the top right corner |
21
+ | Huggingface Spaces | Go to [mt5](https://huggingface.co/google/mt5-base) |
22
+ | Google Docs Add-on (waiting approval) | Watch Gif-demo |
23
+ | | |
24
+ ## Pretraining a T5-base
25
+ There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
26
+
27
+ ## Finetuning
28
+ Training for [30] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a BLEU score of [87.94] at training and a test score of [88.16] after training. Considering the similarity of the two languages a high score is expected, however a score above 60 is usually taken as a high score.
29
+ ![Add-on](bm2nn_demo.gif)
 
 
 
 
30
 
31
  ```python
32
  # Set up the pipeline
33
  from transformers import pipeline
34
+ translator = pipeline("translation", model='pere/nb-nn-translation')
35
 
36
  # Do the translation
37
  text = "Hun vil ikke gi bort sine personlige data."