opus-mt-tc-bible-big-ira-deu_eng_fra_por_spa

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from Iranian languages (ira) to unknown (deu+eng+fra+por+spa).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2024-05-30
License: Apache-2.0
Language(s):
- Source Language(s): bal ckb diq fas glk jdt kmr kur lrc mzn oss pal pes prs pus sdh tgk tly zza
- Target Language(s): deu eng fra por spa
- Valid Target Language Labels: >>deu<< >>eng<< >>fra<< >>por<< >>spa<< >>xxx<<
Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>deu<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>deu<< Replace this with text in an accepted source language.",
    ">>spa<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-ira-deu_eng_fra_por_spa"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-ira-deu_eng_fra_por_spa")
print(pipe(">>deu<< Replace this with text in an accepted source language."))

Training

Data: opusTCv20230926max50+bt+jhubc (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
Training Scripts: GitHub Repo

Evaluation

Model scores at the OPUS-MT dashboard
test set translations: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.test.txt
test set scores: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-29.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
fas-deu	tatoeba-test-v2021-08-07	0.59737	36.1	3185	25590
fas-eng	tatoeba-test-v2021-08-07	0.59871	35.8	3762	31480
fas-fra	tatoeba-test-v2021-08-07	0.58095	36.3	376	3377
kur_Latn-deu	tatoeba-test-v2021-08-07	0.40276	24.9	223	1323
pes-eng	tatoeba-test-v2021-08-07	0.60717	42.3	3757	31411
ckb-deu	flores101-devtest	0.40117	11.6	1012	25094
ckb-eng	flores101-devtest	0.48321	21.6	1012	24721
ckb-fra	flores101-devtest	0.44260	17.2	1012	28343
ckb-por	flores101-devtest	0.43179	16.2	1012	26519
fas-eng	flores101-devtest	0.61134	34.4	1012	24721
pus-eng	flores101-devtest	0.49556	22.7	1012	24721
pus-fra	flores101-devtest	0.45248	17.8	1012	28343
tgk-eng	flores101-devtest	0.53630	25.4	1012	24721
tgk-fra	flores101-devtest	0.49084	21.0	1012	28343
tgk-spa	flores101-devtest	0.43524	15.5	1012	29199
ckb-deu	flores200-devtest	0.40369	11.7	1012	25094
ckb-eng	flores200-devtest	0.48447	21.5	1012	24721
ckb-fra	flores200-devtest	0.44026	17.1	1012	28343
ckb-por	flores200-devtest	0.43192	16.4	1012	26519
pes-deu	flores200-devtest	0.51542	21.5	1012	25094
pes-eng	flores200-devtest	0.61372	34.9	1012	24721
pes-fra	flores200-devtest	0.56347	29.2	1012	28343
pes-por	flores200-devtest	0.55676	28.5	1012	26519
pes-spa	flores200-devtest	0.48334	19.8	1012	29199
prs-deu	flores200-devtest	0.50562	21.2	1012	25094
prs-eng	flores200-devtest	0.60716	35.1	1012	24721
prs-fra	flores200-devtest	0.54769	27.8	1012	28343
prs-por	flores200-devtest	0.54073	27.2	1012	26519
prs-spa	flores200-devtest	0.46850	18.6	1012	29199
tgk-deu	flores200-devtest	0.43115	14.2	1012	25094
tgk-eng	flores200-devtest	0.53705	25.6	1012	24721
tgk-fra	flores200-devtest	0.48902	20.7	1012	28343
tgk-por	flores200-devtest	0.48519	20.7	1012	26519
tgk-spa	flores200-devtest	0.43563	15.7	1012	29199
fas-deu	ntrex128	0.47408	16.7	1997	48761
fas-eng	ntrex128	0.55350	26.4	1997	47673
fas-fra	ntrex128	0.50311	22.1	1997	53481
fas-por	ntrex128	0.48005	19.1	1997	51631
fas-spa	ntrex128	0.50973	23.6	1997	54107
prs-deu	ntrex128	0.45191	14.9	1997	48761
prs-eng	ntrex128	0.54761	26.6	1997	47673
prs-fra	ntrex128	0.47819	19.9	1997	53481
prs-por	ntrex128	0.46241	17.4	1997	51631
prs-spa	ntrex128	0.48712	21.4	1997	54107
pus-eng	ntrex128	0.43901	17.4	1997	47673
pus-spa	ntrex128	0.40812	14.1	1997	54107
tgk_Cyrl-eng	ntrex128	0.46839	18.6	1997	47673
tgk_Cyrl-fra	ntrex128	0.42569	15.1	1997	53481
tgk_Cyrl-por	ntrex128	0.41632	13.7	1997	51631
tgk_Cyrl-spa	ntrex128	0.43763	16.8	1997	54107
ckb-eng	tico19-test	0.61905	40.1	2100	56315
ckb-fra	tico19-test	0.45070	19.7	2100	64661
ckb-por	tico19-test	0.49617	22.9	2100	62729
ckb-spa	tico19-test	0.50543	24.9	2100	66563
fas-eng	tico19-test	0.64016	37.3	2100	56315
fas-fra	tico19-test	0.53319	26.1	2100	64661
fas-por	tico19-test	0.58008	30.6	2100	62729
fas-spa	tico19-test	0.59239	33.3	2100	66563
prs-eng	tico19-test	0.61702	34.8	2100	56824
prs-fra	tico19-test	0.51218	24.0	2100	64661
prs-por	tico19-test	0.55888	28.6	2100	62729
prs-spa	tico19-test	0.57494	31.1	2100	66563
pus-eng	tico19-test	0.57586	32.1	2100	56315
pus-fra	tico19-test	0.46091	19.2	2100	64661
pus-por	tico19-test	0.51033	24.1	2100	62729
pus-spa	tico19-test	0.51857	25.9	2100	66563

Citation Information

Publications: Democratizing neural machine translation with OPUS-MT and OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

transformers version: 4.45.1
OPUS-MT git hash: 0882077
port time: Tue Oct 8 11:54:09 EEST 2024
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-bible-big-ira-deu_eng_fra_por_spa