pere
/

nb-nn-translation

Translation

Transformers

PyTorch

JAX

Norwegian

Inference Endpoints

Model card Files Files and versions Community

pere commited on Jul 18, 2021

Commit

4bf062a

1 Parent(s): 875edb4

Update README.md

Browse files

Files changed (1) hide show

README.md +20 -24

README.md CHANGED Viewed

@@ -8,34 +8,30 @@ datasets:
 widget:
 - text: "Skriv inn en tekst som du ønsker å oversette til en annen målform."
 ---
-# Norwegian mT5 - Translation Bokmål Nynorsk
-## Description
-This is a sample reference model.
-Here is an example of how to use the model from Python
-```python
-# Import libraries
-from transformers import T5ForConditionalGeneration, AutoTokenizer
-model = T5ForConditionalGeneration.from_pretrained('andrek/nb2nn',from_flax=True)
-tokenizer = AutoTokenizer.from_pretrained('andrek/nb2nn')
-#Encode the text
-text = "Hun vil ikke gi bort sine personlige data."
-inputs = tokenizer.encode(text, return_tensors="pt")
-outputs = model.generate(inputs, max_length=255, num_beams=4, early_stopping=True)
-#Decode and print the result
-print(tokenizer.decode(outputs[0]))
-```
-Or if you like to use the pipeline instead
 ```python
 # Set up the pipeline
 from transformers import pipeline
-translator = pipeline("translation", model='andrek/nb2nn')
 # Do the translation
 text = "Hun vil ikke gi bort sine personlige data."

 widget:
 - text: "Skriv inn en tekst som du ønsker å oversette til en annen målform."
 ---
+# BLEU-SCORE 88.16 !!!
+# 🇳🇴 Bokmål ⇔ Nynorsk 🇳🇴
+Norwegian has two relatively similar written languages; Bokmål and Nynorsk. Historically Nynorsk is a written norm based on dialects curated by the linguist Ivar Aasen in the mid-to-late 1800s, whereas Bokmål is a gradual 'Norwegization' of written Danish.
+The two written languages are considered equal and citizens have a right to receive public service information in their primary and prefered language. Even though this right has been around for a long time only between 5-10% of Norwegian texts are written in Nynorsk. Nynorsk is therefore a low-resource language within a low-resource language.
+For translating between the two languages, there are not any working off-the-shelf machine learning-based translation models.
+|   |   |
+|---|---|
+| Widget                                | Try the widget in the top right corner |
+| Huggingface Spaces                    | Go to [mt5](https://huggingface.co/google/mt5-base)                           |
+| Google Docs Add-on (waiting approval) | Watch Gif-demo                         |
+|   |   |
+## Pretraining a T5-base
+There is an [mt5](https://huggingface.co/google/mt5-base) that includes Norwegian. Unfortunately a very small part of this is Nynorsk; there is only around 1GB Nynorsk text in mC4. Despite this, the mt5 also gives a BLEU score above 80. During the project we extracted all available Nynorsk text from the [Norwegian Colossal Corpus](https://github.com/NBAiLab/notram/blob/master/guides/corpus_v2_summary.md) at the National Library of Norway, and matched it (by material type i.e. book, newspapers and so on) with an equal amount of Bokmål. The corpus collection is described [here](https://github.com/NBAiLab/notram/blob/master/guides/nb_nn_balanced_corpus.md) and the total size is 19GB.
+## Finetuning
+Training for [30] epochs with a learning rate of [7e-4], a batch size of [32] and a max source and target length of [512] fine tuning reached a BLEU score of [87.94] at training and a test score of [88.16] after training. Considering the similarity of the two languages a high score is expected, however a score above 60 is usually taken as a high score.
+![Add-on](bm2nn_demo.gif)
 ```python
 # Set up the pipeline
 from transformers import pipeline
+translator = pipeline("translation", model='pere/nb-nn-translation')
 # Do the translation
 text = "Hun vil ikke gi bort sine personlige data."