File size: 1,558 Bytes
169fa69 dcdbe08 169fa69 add0ccf cae4c98 dcdbe08 cae4c98 169fa69 56fb6b2 3b77fca 56fb6b2 0d04a56 56fb6b2 0d04a56 56fb6b2 3b77fca 56fb6b2 3b77fca 56fb6b2 3b77fca 56fb6b2 3b77fca 0fae1c8 3b77fca 02d0d41 56fb6b2 02d0d41 3b77fca 56fb6b2 02d0d41 3b77fca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
---
language:
- en
- el
tags:
- translation
widget:
- text: "'Katerina', is the best name for a girl."
license: apache-2.0
metrics:
- bleu
---
## English to Greek NMT from Hellenic Army Academy (SSE) and Technical University of Crete (TUC)
* source languages: en
* target languages: el
* licence: apache-2.0
* dataset: Opus, CCmatrix
* model: transformer(fairseq)
* pre-processing: tokenization + BPE segmentation
* metrics: bleu, chrf
### Model description
Trained using the Fairseq framework, transformer_iwslt_de_en architecture.\
BPE segmentation (20k codes).\
Mixed-case model.
### How to use
```
from transformers import FSMTTokenizer, FSMTForConditionalGeneration
mname = " <your_downloaded_model_folderpath_here> "
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)
text = " 'Katerina', is the best name for a girl."
encoded = tokenizer.encode(text, return_tensors='pt')
outputs = model.generate(encoded, num_beams=5, num_return_sequences=5, early_stopping=True)
for i, output in enumerate(outputs):
i += 1
print(f"{i}: {output.tolist()}")
decoded = tokenizer.decode(output, skip_special_tokens=True)
print(f"{i}: {decoded}")
```
## Training data
Consolidated corpus from Opus and CC-Matrix (~6.6GB in total)
## Eval results
Results on Tatoeba testset (EN-EL):
| BLEU | chrF |
| ------ | ------ |
| 76.9 | 0.733 |
Results on XNLI parallel (EN-EL):
| BLEU | chrF |
| ------ | ------ |
| 65.4 | 0.624 |
### BibTeX entry and citation info
TODO
|