|
--- |
|
language: |
|
- en |
|
- el |
|
tags: |
|
- translation |
|
widget: |
|
- text: "'Katerina', is the best name for a girl." |
|
license: apache-2.0 |
|
metrics: |
|
- bleu |
|
--- |
|
|
|
## English to Greek NMT |
|
## By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC) |
|
|
|
* source languages: en |
|
* target languages: el |
|
* licence: apache-2.0 |
|
* dataset: Opus, CCmatrix |
|
* model: transformer(fairseq) |
|
* pre-processing: tokenization + BPE segmentation |
|
* metrics: bleu, chrf |
|
|
|
### Model description |
|
|
|
Trained using the Fairseq framework, transformer_iwslt_de_en architecture.\ |
|
BPE segmentation (20k codes).\ |
|
Mixed-case model. |
|
|
|
### How to use |
|
|
|
``` |
|
from transformers import FSMTTokenizer, FSMTForConditionalGeneration |
|
|
|
mname = " <your_downloaded_model_folderpath_here> " |
|
|
|
tokenizer = FSMTTokenizer.from_pretrained(mname) |
|
model = FSMTForConditionalGeneration.from_pretrained(mname) |
|
|
|
text = " 'Katerina', is the best name for a girl." |
|
|
|
encoded = tokenizer.encode(text, return_tensors='pt') |
|
|
|
outputs = model.generate(encoded, num_beams=5, num_return_sequences=5, early_stopping=True) |
|
for i, output in enumerate(outputs): |
|
i += 1 |
|
print(f"{i}: {output.tolist()}") |
|
|
|
decoded = tokenizer.decode(output, skip_special_tokens=True) |
|
print(f"{i}: {decoded}") |
|
``` |
|
|
|
|
|
## Training data |
|
|
|
Consolidated corpus from Opus and CC-Matrix (~6.6GB in total) |
|
|
|
|
|
## Eval results |
|
|
|
|
|
Results on Tatoeba testset (EN-EL): |
|
|
|
| BLEU | chrF | |
|
| ------ | ------ | |
|
| 76.9 | 0.733 | |
|
|
|
|
|
Results on XNLI parallel (EN-EL): |
|
|
|
| BLEU | chrF | |
|
| ------ | ------ | |
|
| 65.4 | 0.624 | |
|
|
|
### BibTeX entry and citation info |
|
TODO |
|
|