Edit model card

opus-mt-tc-bible-big-deu_eng_fra_por_spa-inc

Table of Contents

Model Details

Neural machine translation model for translating from unknown (deu+eng+fra+por+spa) to Indic languages (inc).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

  • Developed by: Language Technology Research Group at the University of Helsinki
  • Model Type: Translation (transformer-big)
  • Release: 2024-05-30
  • License: Apache-2.0
  • Language(s):
    • Source Language(s): deu eng fra por spa
    • Target Language(s): anp asm awa ben bho bpy div dty gbm guj hif hin hne hns kas kok lah mag mai mar nep npi ori pan pli rhg rmy rom san sin skr snd syl urd
    • Valid Target Language Labels: >>aee<< >>aeq<< >>anp<< >>anr<< >>asm<< >>awa<< >>bdv<< >>ben<< >>bfb<< >>bfy<< >>bfz<< >>bgc<< >>bgd<< >>bge<< >>bgw<< >>bha<< >>bhb<< >>bhd<< >>bhe<< >>bhi<< >>bho<< >>bht<< >>bhu<< >>bjj<< >>bkk<< >>bmj<< >>bns<< >>bpx<< >>bpy<< >>bra<< >>btv<< >>ccp<< >>cdh<< >>cdi<< >>cdj<< >>cih<< >>clh<< >>ctg<< >>dcc<< >>dhn<< >>dho<< >>div<< >>dmk<< >>dml<< >>doi<< >>dry<< >>dty<< >>dub<< >>duh<< >>dwz<< >>emx<< >>gas<< >>gbk<< >>gbl<< >>gbm<< >>gdx<< >>ggg<< >>ghr<< >>gig<< >>gjk<< >>glh<< >>gra<< >>guj<< >>gwc<< >>gwf<< >>gwt<< >>haj<< >>hca<< >>hif<< >>hif_Latn<< >>hii<< >>hin<< >>hin_Latn<< >>hlb<< >>hne<< >>hns<< >>jdg<< >>jml<< >>jnd<< >>jns<< >>kas<< >>kas_Arab<< >>kas_Deva<< >>kbu<< >>keq<< >>key<< >>kfr<< >>kfs<< >>kft<< >>kfu<< >>kfv<< >>kfx<< >>kfy<< >>khn<< >>khw<< >>kjo<< >>kls<< >>kok<< >>kra<< >>ksy<< >>kvx<< >>kxp<< >>kyw<< >>lah<< >>lbm<< >>lhl<< >>lmn<< >>lss<< >>luv<< >>mag<< >>mai<< >>mar<< >>mby<< >>mjl<< >>mjz<< >>mkb<< >>mke<< >>mki<< >>mvy<< >>mwr<< >>nag<< >>nep<< >>nhh<< >>nli<< >>nlx<< >>noe<< >>noi<< >>npi<< >>odk<< >>omr<< >>ori<< >>ort<< >>pan<< >>pan_Guru<< >>paq<< >>pcl<< >>pgg<< >>phd<< >>phl<< >>pli<< >>plk<< >>plp<< >>pmh<< >>psh<< >>psi<< >>psu<< >>pwr<< >>raj<< >>rei<< >>rhg<< >>rhg_Latn<< >>rjs<< >>rkt<< >>rmi<< >>rmq<< >>rmt<< >>rmy<< >>rom<< >>rtw<< >>san<< >>san_Deva<< >>saz<< >>sbn<< >>sck<< >>scl<< >>sdg<< >>sdr<< >>shd<< >>sin<< >>sjp<< >>skr<< >>smm<< >>smv<< >>snd<< >>snd_Arab<< >>soi<< >>srx<< >>ssi<< >>sts<< >>syl<< >>syl_Sylo<< >>tdb<< >>the<< >>thl<< >>thq<< >>thr<< >>tkb<< >>tkt<< >>tnv<< >>tra<< >>trw<< >>urd<< >>ush<< >>vaa<< >>vah<< >>vas<< >>vav<< >>ved<< >>vgr<< >>wsv<< >>wtm<< >>xka<< >>xxx<<
  • Original Model: opusTCv20230926max50+bt+jhubc_transformer-big_2024-05-30.zip
  • Resources for more information:

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>anp<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>anp<< Replace this with text in an accepted source language.",
    ">>urd<< This is the second sentence."
]

model_name = "pytorch-models/opus-mt-tc-bible-big-deu_eng_fra_por_spa-inc"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-inc")
print(pipe(">>anp<< Replace this with text in an accepted source language."))

Training

Evaluation

langpair testset chr-F BLEU #sent #words
eng-ben tatoeba-test-v2021-08-07 0.48316 18.1 2500 11654
eng-hin tatoeba-test-v2021-08-07 0.52587 28.1 5000 32904
eng-mar tatoeba-test-v2021-08-07 0.52516 24.2 10396 61140
eng-urd tatoeba-test-v2021-08-07 0.46228 18.8 1663 12155
deu-ben flores101-devtest 0.44269 10.8 1012 21155
deu-hin flores101-devtest 0.48314 21.9 1012 27743
eng-ben flores101-devtest 0.51768 17.4 1012 21155
eng-guj flores101-devtest 0.54325 22.7 1012 23840
eng-hin flores101-devtest 0.58472 34.1 1012 27743
fra-ben flores101-devtest 0.44304 11.1 1012 21155
fra-hin flores101-devtest 0.48245 22.5 1012 27743
deu-ben flores200-devtest 0.44696 11.3 1012 21155
deu-guj flores200-devtest 0.40939 12.0 1012 23840
deu-hin flores200-devtest 0.48864 22.7 1012 27743
deu-hne flores200-devtest 0.43166 14.2 1012 26582
deu-mag flores200-devtest 0.43058 14.2 1012 26516
deu-urd flores200-devtest 0.41167 14.3 1012 28098
eng-ben flores200-devtest 0.52088 17.7 1012 21155
eng-guj flores200-devtest 0.54758 23.2 1012 23840
eng-hin flores200-devtest 0.58825 34.4 1012 27743
eng-hne flores200-devtest 0.46144 19.1 1012 26582
eng-mag flores200-devtest 0.50291 21.9 1012 26516
eng-mar flores200-devtest 0.49344 15.6 1012 21810
eng-pan flores200-devtest 0.45635 18.4 1012 27451
eng-sin flores200-devtest 0.45683 11.8 1012 23278
eng-urd flores200-devtest 0.48224 20.6 1012 28098
fra-ben flores200-devtest 0.44486 11.1 1012 21155
fra-guj flores200-devtest 0.41021 12.2 1012 23840
fra-hin flores200-devtest 0.48632 22.7 1012 27743
fra-hne flores200-devtest 0.42777 13.8 1012 26582
fra-mag flores200-devtest 0.42725 14.3 1012 26516
fra-urd flores200-devtest 0.40901 13.6 1012 28098
por-ben flores200-devtest 0.43877 10.7 1012 21155
por-hin flores200-devtest 0.50121 23.9 1012 27743
por-hne flores200-devtest 0.42270 14.1 1012 26582
por-mag flores200-devtest 0.42146 13.7 1012 26516
por-san flores200-devtest 9.879 0.4 1012 18253
por-urd flores200-devtest 0.41225 14.5 1012 28098
spa-ben flores200-devtest 0.42040 8.8 1012 21155
spa-hin flores200-devtest 0.43977 16.4 1012 27743
eng-hin newstest2014 0.51541 24.0 2507 60872
eng-guj newstest2019 0.57815 25.7 998 21924
deu-ben ntrex128 0.44384 9.9 1997 40095
deu-hin ntrex128 0.43252 17.0 1997 55219
deu-urd ntrex128 0.41844 14.8 1997 54259
eng-ben ntrex128 0.52381 17.3 1997 40095
eng-guj ntrex128 0.49386 17.2 1997 45335
eng-hin ntrex128 0.52696 27.4 1997 55219
eng-mar ntrex128 0.45244 10.8 1997 42375
eng-nep ntrex128 0.43339 8.8 1997 40570
eng-pan ntrex128 0.46534 19.5 1997 54355
eng-sin ntrex128 0.44124 10.5 1997 44429
eng-urd ntrex128 0.50060 22.4 1997 54259
fra-ben ntrex128 0.42857 9.4 1997 40095
fra-hin ntrex128 0.42777 17.4 1997 55219
fra-urd ntrex128 0.41229 14.3 1997 54259
por-ben ntrex128 0.44134 10.1 1997 40095
por-hin ntrex128 0.43461 17.7 1997 55219
por-urd ntrex128 0.41777 14.5 1997 54259
spa-ben ntrex128 0.45329 10.6 1997 40095
spa-hin ntrex128 0.43747 17.9 1997 55219
spa-urd ntrex128 0.41929 14.6 1997 54259
eng-ben tico19-test 0.51850 18.6 2100 51695
eng-hin tico19-test 0.62999 41.9 2100 62680
eng-mar tico19-test 0.45968 13.0 2100 50872
eng-nep tico19-test 0.54373 18.7 2100 48363
eng-urd tico19-test 0.50920 21.7 2100 65312
fra-hin tico19-test 0.48666 25.6 2100 62680
fra-nep tico19-test 0.41414 10.0 2100 48363
por-ben tico19-test 0.45609 12.7 2100 51695
por-hin tico19-test 0.55530 31.2 2100 62680
por-mar tico19-test 0.40344 9.7 2100 50872
por-nep tico19-test 0.47698 12.4 2100 48363
por-urd tico19-test 0.44747 15.6 2100 65312
spa-ben tico19-test 0.46418 13.3 2100 51695
spa-hin tico19-test 0.55526 31.0 2100 62680
spa-mar tico19-test 0.41189 10.0 2100 50872
spa-nep tico19-test 0.47414 12.1 2100 48363
spa-urd tico19-test 0.44788 15.6 2100 65312

Citation Information

@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{\"o}rg and Aulamo, Mikko and Bakshandaeva, Daria and Boggia, Michele and Gr{\"o}nroos, Stig-Arne and Nieminen, Tommi and Raganato, Alessandro and Scherrer, Yves and Vazquez, Raul and Virpioja, Sami},
  journal={Language Resources and Evaluation},
  number={58},
  pages={713--755},
  year={2023},
  publisher={Springer Nature},
  issn={1574-0218},
  doi={10.1007/s10579-023-09704-w}
}

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the HPLT project, funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland, and the EuroHPC supercomputer LUMI.

Model conversion info

  • transformers version: 4.45.1
  • OPUS-MT git hash: 0882077
  • port time: Tue Oct 8 10:09:07 EEST 2024
  • port machine: LM0-400-22516.local
Downloads last month
8
Safetensors
Model size
240M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Helsinki-NLP/opus-mt-tc-bible-big-deu_eng_fra_por_spa-inc

Evaluation results