|
--- |
|
license: cc-by-2.5 |
|
language: |
|
- lt |
|
- en |
|
datasets: |
|
- scoris/en-lt-merged-data |
|
--- |
|
# Overview |
|
![Scoris logo](https://scoris.lt/logo_smaller.png) |
|
This is an Lithuanian-English translation model (Seq2Seq). For English-Lithuanian translation check another model [scoris/scoris-mt-en-lt](https://huggingface.co/scoris/scoris-mt-en-lt) |
|
|
|
Original model: [Helsinki-NLP/opus-mt-tc-big-lt-en](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-lt-en) |
|
|
|
Fine-tuned on large merged data set: [scoris/en-lt-merged-data](https://huggingface.co/datasets/scoris/en-lt-merged-data) (5.4 million sentence pairs) |
|
|
|
|
|
Trained on 3 epochs. |
|
|
|
Made by [Scoris](https://scoris.lt) team |
|
|
|
# Evaluation: |
|
| LT-EN| BLEU | |
|
|-|------| |
|
| scoris/scoris-mt-lt-en| 43.8 | |
|
| Helsinki-NLP/opus-mt-tc-big-en-lt| 36.8 | |
|
| Google Translate| 31.9 | |
|
| Deepl| 36.1 | |
|
|
|
_Evaluated on scoris/en-lt-merged-data validation set. Google and Deepl evaluated using a random sample of 1000 sentence pairs._ |
|
|
|
According to [Google](https://cloud.google.com/translate/automl/docs/evaluate) BLEU score interpretation is following: |
|
|
|
| BLEU Score | Interpretation |
|
|----------|---------| |
|
| < 10 | Almost useless |
|
| 10 - 19 | Hard to get the gist |
|
| 20 - 29 | The gist is clear, but has significant grammatical errors |
|
| 30 - 40 | Understandable to good translations |
|
| **40 - 50** | **High quality translations** |
|
| 50 - 60 | Very high quality, adequate, and fluent translations |
|
| > 60 | Quality often better than human |
|
|
|
# Usage |
|
You can use the model in the following way: |
|
```python |
|
from transformers import MarianMTModel, MarianTokenizer |
|
|
|
# Specify the model identifier on Hugging Face Model Hub |
|
model_name = "scoris/scoris/scoris-mt-lt-en" |
|
|
|
# Load the model and tokenizer from Hugging Face |
|
tokenizer = MarianTokenizer.from_pretrained(model_name) |
|
model = MarianMTModel.from_pretrained(model_name) |
|
|
|
src_text = [ |
|
"Kartą, senų senovėje, buvo viena mergaitė ir gyveno ji su savo mama mažoje jaukioje trobelėje prie miško. ", |
|
"Mergaitę žmonės vadino Raudonkepuraite, nes ji dažnai dėvėdavo raudoną apsiaustėlį su kapišonu. ", |
|
"Mergaitė mielai gobdavosi šiuo apsiaustėliu, nes jį buvo gavusi iš savo močiutės, kuri gyveno namelyje už miško ir labai mylėjo Raudonkepuraitę. ", |
|
"Vieną dieną mama priruošė Raudonkepuraitei pilną krepšelį įvairiausių gėrybių.", |
|
"Pridėjo obuoliukų, kriaušaičių, braškių, taip pat skanių pyragėlių, kuriuos pati buvo iškepusi, sūrio ir gabalėlį mėsos bei didelį išdabintą tortą." |
|
] |
|
|
|
# Tokenize the text and generate translations |
|
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True)) |
|
|
|
# Print out the translations |
|
for t in translated: |
|
print(tokenizer.decode(t, skip_special_tokens=True)) |
|
|
|
#Once upon a time there was a girl, and she lived with her mother in a small cozy hut by the forest. |
|
#The girl was called the Red cape because she often wore a red cape. |
|
#The girl would gladly wear this coat, because she had it from her grandmother, who lived in a house outside the forest and loved Redcape very much. |
|
#One day my mother prepared a basket full of all kinds of good things for the Red cape. |
|
#He added apples, pears, strawberries, as well as delicious cakes that he had baked, cheese and a piece of meat, and a large cake. |
|
``` |