--- language: - el - en tags: - translation widget: - text: "Κάνω διδακτορικό στην υπολογιστική γλωσσολογία." license: apache-2.0 metrics: - bleu --- ## Greek to English NMT ## By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC) * source languages: el * target languages: en * licence: apache-2.0 * dataset: Opus, CCmatrix * model: transformer(fairseq) * pre-processing: tokenization + BPE segmentation * metrics: bleu, chrf ### Model description Trained using the Fairseq framework, transformer_iwslt_de_en architecture.\ BPE segmentation (20k codes).\ Mixed-case model. ### How to use ``` from transformers import FSMTTokenizer, FSMTForConditionalGeneration mname = " " tokenizer = FSMTTokenizer.from_pretrained(mname) model = FSMTForConditionalGeneration.from_pretrained(mname) text = "Κάνω διδακτορικό στην υπολογιστική γλωσσολογία." encoded = tokenizer.encode(text, return_tensors='pt') outputs = model.generate(encoded, num_beams=5, num_return_sequences=5, early_stopping=True) for i, output in enumerate(outputs): i += 1 print(f"{i}: {output.tolist()}") decoded = tokenizer.decode(output, skip_special_tokens=True) print(f"{i}: {decoded}") ``` ## Training data Consolidated corpus from Opus and CC-Matrix (~6.6GB in total) ## Eval results Results on Tatoeba testset (EL-EN): | BLEU | chrF | | ------ | ------ | | 79.3 | 0.795 | Results on XNLI parallel (EL-EN): | BLEU | chrF | | ------ | ------ | | 66.2 | 0.623 | ### BibTeX entry and citation info TODO