File size: 3,291 Bytes

85de692
 
 
 
 
 
 
 
cc5f18a
77151a4
835c5f8
b22e365
fd9f536
4c50fa6
77151a4
b22e365
fd9f536
 
4c50fa6
fd9f536
85de692
 
77151a4
b22e365
cc5f18a
83ead02
 
 
 
 
 
408d074
6720e0a
cc5f18a
 
 
 
 
 
 
 
 
 
 
 
 
 
18c9c45
ac6e17e
85de692
 
 
ffcbdc8
85de692

---
license: cc-by-2.5
language:
- lt
- en
datasets:
- scoris/en-lt-merged-data
---
# Overview
![Scoris logo](https://scoris.lt/logo_smaller.png)
This is an Lithuanian-English translation model

Original model: [Helsinki-NLP/opus-mt-tc-big-lt-en](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-lt-en)

Fine-tuned on large merged data set: [scoris/en-lt-merged-data](https://huggingface.co/datasets/scoris/en-lt-merged-data) (5.4 million sentence pairs)



For English-Lithuanian translation check another model [scoris/scoris-mt-en-lt](https://huggingface.co/scoris/scoris-mt-en-lt)

Trained on 3 epochs. 

Made by [Scoris](https://scoris.lt) team

# Evaluation:
| LT-EN| BLEU |
|-|------|
| scoris/scoris-mt-lt-en| 43.8 |
| Helsinki-NLP/opus-mt-tc-big-en-lt| 36.8 |
| Google Translate| 31.9 |
| Deepl| 36.1 |

_Evaluated on scoris/en-lt-merged-data validation set. Google and Deepl evaluated using a random sample of 1000 sentence pairs._

According to [Google](https://cloud.google.com/translate/automl/docs/evaluate) BLEU score interpretation is following:

| BLEU Score | Interpretation
|----------|---------|
| < 10 | Almost useless
| 10 - 19 | Hard to get the gist
| 20 - 29 | The gist is clear, but has significant grammatical errors
| 30 - 40 | Understandable to good translations
| **40 - 50** | **High quality translations**
| 50 - 60 | Very high quality, adequate, and fluent translations
| > 60 | Quality often better than human

# Usage
You can use the model in the following way:
```python
from transformers import MarianMTModel, MarianTokenizer

# Specify the model identifier on Hugging Face Model Hub
model_name = "scoris/scoris/scoris-mt-lt-en"

# Load the model and tokenizer from Hugging Face
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

src_text = [
    "Kartą, senų senovėje, buvo viena mergaitė ir gyveno ji su savo mama mažoje jaukioje trobelėje prie miško. ",
    "Mergaitę žmonės vadino Raudonkepuraite, nes ji dažnai dėvėdavo raudoną apsiaustėlį su kapišonu. ",
    "Mergaitė mielai gobdavosi šiuo apsiaustėliu, nes jį buvo gavusi iš savo močiutės, kuri gyveno namelyje už miško ir labai mylėjo Raudonkepuraitę. ",
    "Vieną dieną mama priruošė Raudonkepuraitei pilną krepšelį įvairiausių gėrybių.",
    "Pridėjo obuoliukų, kriaušaičių, braškių, taip pat skanių pyragėlių, kuriuos pati buvo iškepusi, sūrio ir gabalėlį mėsos bei didelį išdabintą tortą."
]

# Tokenize the text and generate translations
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

# Print out the translations
for t in translated:
    print(tokenizer.decode(t, skip_special_tokens=True))

#Once upon a time there was a girl, and she lived with her mother in a small cozy hut by the forest.
#The girl was called the Red cape because she often wore a red cape.
#The girl would gladly wear this coat, because she had it from her grandmother, who lived in a house outside the forest and loved Redcape very much.
#One day my mother prepared a basket full of all kinds of good things for the Red cape.
#He added apples, pears, strawberries, as well as delicious cakes that he had baked, cheese and a piece of meat, and a large cake.
```