metadata
language:
- en
- el
- multilingual
license: apache-2.0
tags:
- translation
metrics:
- bleu
widget:
- text: '''Katerina'', is the best name for a girl.'
English to Greek NMT
By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)
- source languages: en
- target languages: el
- licence: apache-2.0
- dataset: Opus, CCmatrix
- model: transformer(fairseq)
- pre-processing: tokenization + BPE segmentation
- metrics: bleu, chrf
Model description
Trained using the Fairseq framework, transformer_iwslt_de_en architecture.\ BPE segmentation (20k codes).\ Mixed-case model.
How to use
from transformers import FSMTTokenizer, FSMTForConditionalGeneration
mname = "lighteternal/SSE-TUC-mt-en-el-cased"
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)
text = " 'Katerina', is the best name for a girl."
encoded = tokenizer.encode(text, return_tensors='pt')
outputs = model.generate(encoded, num_beams=5, num_return_sequences=5, early_stopping=True)
for i, output in enumerate(outputs):
i += 1
print(f"{i}: {output.tolist()}")
decoded = tokenizer.decode(output, skip_special_tokens=True)
print(f"{i}: {decoded}")
Training data
Consolidated corpus from Opus and CC-Matrix (~6.6GB in total)
Eval results
Results on Tatoeba testset (EN-EL):
BLEU | chrF |
---|---|
76.9 | 0.733 |
Results on XNLI parallel (EN-EL):
BLEU | chrF |
---|---|
65.4 | 0.624 |
BibTeX entry and citation info
Dimitris Papadopoulos, et al. "PENELOPIE: Enabling Open Information Extraction for the Greek Language through Machine Translation." (2021). Accepted at EACL 2021 SRW
Acknowledgement
The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number:50, 2nd call)