scoris
/

scoris-mt-lt-en

Text2Text Generation

Inference Endpoints

Model card Files Files and versions Community

scoris-mt-lt-en / README.md

scoris's picture

Update README.md

259c103 verified 12 months ago

|

3.3 kB

	---
	license: cc-by-2.5
	language:
	- lt
	- en
	datasets:
	- scoris/en-lt-merged-data
	---
	# Overview
	![Scoris logo](https://scoris.lt/logo_smaller.png)
	This is an Lithuanian-English translation model (Seq2Seq). For English-Lithuanian translation check another model [scoris/scoris-mt-en-lt](https://huggingface.co/scoris/scoris-mt-en-lt)

	Original model: [Helsinki-NLP/opus-mt-tc-big-lt-en](https://huggingface.co/Helsinki-NLP/opus-mt-tc-big-lt-en)

	Fine-tuned on large merged data set: [scoris/en-lt-merged-data](https://huggingface.co/datasets/scoris/en-lt-merged-data) (5.4 million sentence pairs)


	Trained on 3 epochs.

	Made by [Scoris](https://scoris.lt) team

	# Evaluation:
	\| LT-EN\| BLEU \|
	\|-\|------\|
	\| scoris/scoris-mt-lt-en\| 43.8 \|
	\| Helsinki-NLP/opus-mt-tc-big-en-lt\| 36.8 \|
	\| Google Translate\| 31.9 \|
	\| Deepl\| 36.1 \|

	_Evaluated on scoris/en-lt-merged-data validation set. Google and Deepl evaluated using a random sample of 1000 sentence pairs._

	According to [Google](https://cloud.google.com/translate/automl/docs/evaluate) BLEU score interpretation is following:

	\| BLEU Score \| Interpretation
	\|----------\|---------\|
	\| < 10 \| Almost useless
	\| 10 - 19 \| Hard to get the gist
	\| 20 - 29 \| The gist is clear, but has significant grammatical errors
	\| 30 - 40 \| Understandable to good translations
	\| 40 - 50 \| High quality translations
	\| 50 - 60 \| Very high quality, adequate, and fluent translations
	\| > 60 \| Quality often better than human

	# Usage
	You can use the model in the following way:
	```python
	from transformers import MarianMTModel, MarianTokenizer

	# Specify the model identifier on Hugging Face Model Hub
	model_name = "scoris/scoris/scoris-mt-lt-en"

	# Load the model and tokenizer from Hugging Face
	tokenizer = MarianTokenizer.from_pretrained(model_name)
	model = MarianMTModel.from_pretrained(model_name)

	src_text = [
	"Kartą, senų senovėje, buvo viena mergaitė ir gyveno ji su savo mama mažoje jaukioje trobelėje prie miško. ",
	"Mergaitę žmonės vadino Raudonkepuraite, nes ji dažnai dėvėdavo raudoną apsiaustėlį su kapišonu. ",
	"Mergaitė mielai gobdavosi šiuo apsiaustėliu, nes jį buvo gavusi iš savo močiutės, kuri gyveno namelyje už miško ir labai mylėjo Raudonkepuraitę. ",
	"Vieną dieną mama priruošė Raudonkepuraitei pilną krepšelį įvairiausių gėrybių.",
	"Pridėjo obuoliukų, kriaušaičių, braškių, taip pat skanių pyragėlių, kuriuos pati buvo iškepusi, sūrio ir gabalėlį mėsos bei didelį išdabintą tortą."
	]

	# Tokenize the text and generate translations
	translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

	# Print out the translations
	for t in translated:
	print(tokenizer.decode(t, skip_special_tokens=True))

	#Once upon a time there was a girl, and she lived with her mother in a small cozy hut by the forest.
	#The girl was called the Red cape because she often wore a red cape.
	#The girl would gladly wear this coat, because she had it from her grandmother, who lived in a house outside the forest and loved Redcape very much.
	#One day my mother prepared a basket full of all kinds of good things for the Red cape.
	#He added apples, pears, strawberries, as well as delicious cakes that he had baked, cheese and a piece of meat, and a large cake.
	```