opus-mt-tc-big-gmw-gmw

Model Details
Uses
Risks, Limitations and Biases
How to Get Started With the Model
Training
Evaluation
Citation Information
Acknowledgements

Model Details

Neural machine translation model for translating from West Germanic languages (gmw) to West Germanic languages (gmw).

This model is part of the OPUS-MT project, an effort to make neural machine translation models widely available and accessible for many languages in the world. All models are originally trained using the amazing framework of Marian NMT, an efficient NMT implementation written in pure C++. The models have been converted to pyTorch using the transformers library by huggingface. Training data is taken from OPUS and training pipelines use the procedures of OPUS-MT-train. Model Description:

Developed by: Language Technology Research Group at the University of Helsinki
Model Type: Translation (transformer-big)
Release: 2022-08-11
License: CC-BY-4.0
Language(s):
- Source Language(s): afr deu eng enm fry gos gsw hrx ksh ltz multi nds nld pdc sco stq swg tpi yid
- Target Language(s): afr ang deu eng enm fry gos ltz multi nds nld sco tpi yid
- Language Pair(s): afr-deu afr-eng afr-nld deu-afr deu-deu deu-eng deu-nds deu-nld eng-afr eng-deu eng-eng eng-nld fry-eng fry-nld gos-deu gos-eng gos-nld hrx-deu hrx-eng ltz-deu ltz-eng ltz-nld multi-multi nds-deu nds-eng nds-nld nld-afr nld-deu nld-eng nld-fry nld-nds nld-nld
- Valid Target Language Labels: >>act<< >>afr<< >>afs<< >>aig<< >>ang<< >>ang_Latn<< >>bah<< >>bar<< >>bis<< >>bjs<< >>brc<< >>bzj<< >>bzj_Latn<< >>bzk<< >>cim<< >>dcr<< >>deu<< >>djk<< >>djk_Latn<< >>drt<< >>drt_Latn<< >>dum<< >>eng<< >>enm<< >>enm_Latn<< >>fpe<< >>frk<< >>frr<< >>fry<< >>gcl<< >>gct<< >>geh<< >>gmh<< >>gml<< >>goh<< >>gos<< >>gpe<< >>gsw<< >>gul<< >>gyn<< >>hrx<< >>hrx_Latn<< >>hwc<< >>icr<< >>jam<< >>jvd<< >>kri<< >>ksh<< >>kww<< >>lim<< >>lng<< >>ltz<< >>mhn<< >>nds<< >>nld<< >>odt<< >>ofs<< >>ofs_Latn<< >>oor<< >>osx<< >>pcm<< >>pdc<< >>pdt<< >>pey<< >>pfl<< >>pih<< >>pih_Latn<< >>pis<< >>pis_Latn<< >>qlm<< >>rop<< >>sco<< >>sdz<< >>skw<< >>sli<< >>srm<< >>srm_Latn<< >>srn<< >>stl<< >>stq<< >>svc<< >>swg<< >>sxu<< >>tch<< >>tcs<< >>tgh<< >>tpi<< >>trf<< >>twd<< >>uln<< >>vel<< >>vic<< >>vls<< >>vmf<< >>wae<< >>wep<< >>wes<< >>wes_Latn<< >>wym<< >>ydd<< >>yec<< >>yid<< >>yih<< >>zea<<
Original Model: opusTCv20210807_transformer-big_2022-08-11.zip
Resources for more information:
- OPUS-MT-train GitHub Repo
- More information about released models for this language pair: OPUS-MT gmw-gmw README
- More information about MarianNMT models in the transformers library
- [Tatoeba Translation Challenge](https://github.com/Helsinki-NLP/Tatoeba-Challenge/

This is a multilingual translation model with multiple target languages. A sentence initial language token is required in the form of >>id<< (id = valid target language ID), e.g. >>afr<<

Uses

This model can be used for translation and text-to-text generation.

Risks, Limitations and Biases

CONTENT WARNING: Readers should be aware that the model is trained on various public data sets that may contain content that is disturbing, offensive, and can propagate historical and current stereotypes.

Significant research has explored bias and fairness issues with language models (see, e.g., Sheng et al. (2021) and Bender et al. (2021)).

How to Get Started With the Model

A short example code:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>nds<< Red keinen Quatsch.",
    ">>eng<< Findet ihr das nicht etwas übereilt?"
]

model_name = "pytorch-models/opus-mt-tc-big-gmw-gmw"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))

for t in translated:
    print( tokenizer.decode(t, skip_special_tokens=True) )

# expected output:
#     Kiek ok bi: Rott.
#     Aren't you in a hurry?

You can also use OPUS-MT models with the transformers pipelines, for example:

from transformers import pipeline
pipe = pipeline("translation", model="Helsinki-NLP/opus-mt-tc-big-gmw-gmw")
print(pipe(">>nds<< Red keinen Quatsch."))

# expected output: Kiek ok bi: Rott.

Training

Data: opusTCv20210807 (source)
Pre-processing: SentencePiece (spm32k,spm32k)
Model Type: transformer-big
Original MarianNMT Model: opusTCv20210807_transformer-big_2022-08-11.zip
Training Scripts: GitHub Repo

Evaluation

test set translations: opusTCv20210807_transformer-big_2022-08-11.test.txt
test set scores: opusTCv20210807_transformer-big_2022-08-11.eval.txt
benchmark results: benchmark_results.txt
benchmark output: benchmark_translations.zip

langpair	testset	chr-F	BLEU	#sent	#words
afr-deu	tatoeba-test-v2020-07-28-v2021-08-07	0.68633	50.3	1583	9105
afr-eng	tatoeba-test-v2020-07-28-v2021-08-07	0.70502	56.4	1374	9622
afr-nld	tatoeba-test-v2020-07-28-v2021-08-07	0.71500	55.5	1056	6710
deu-afr	tatoeba-test-v2020-07-28-v2021-08-07	0.70191	54.2	1583	9507
deu-deu	tatoeba-test-v2020-07-28-v2021-08-07	0.57304	34.6	2500	20797
deu-eng	tatoeba-test-v2020-07-28-v2021-08-07	0.65919	48.4	17565	149415
deu-nds	tatoeba-test-v2020-07-28-v2021-08-07	0.48028	23.2	9999	76119
deu-nld	tatoeba-test-v2020-07-28-v2021-08-07	0.71366	54.4	10218	75208
deu-yid	tatoeba-test-v2020-07-28-v2021-08-07	9.234	0.4	853	5353
eng-afr	tatoeba-test-v2020-07-28-v2021-08-07	0.71940	56.4	1374	10314
eng-deu	tatoeba-test-v2020-07-28-v2021-08-07	0.62912	41.8	17565	151539
eng-eng	tatoeba-test-v2020-07-28-v2021-08-07	0.80136	66.3	12062	115099
eng-nld	tatoeba-test-v2020-07-28-v2021-08-07	0.70929	54.3	12696	91769
eng-yid	tatoeba-test-v2020-07-28-v2021-08-07	9.648	0.4	2483	16388
fry-eng	tatoeba-test-v2020-07-28-v2021-08-07	0.40304	24.5	220	1573
fry-nld	tatoeba-test-v2020-07-28-v2021-08-07	0.54939	40.5	260	1854
gos-deu	tatoeba-test-v2020-07-28-v2021-08-07	0.45302	25.4	207	1168
gos-eng	tatoeba-test-v2020-07-28-v2021-08-07	0.37587	23.9	1154	5634
gos-nld	tatoeba-test-v2020-07-28-v2021-08-07	0.45701	26.1	1852	9902
hrx-deu	tatoeba-test-v2020-07-28-v2021-08-07	0.51840	30.0	471	2805
hrx-eng	tatoeba-test-v2020-07-28-v2021-08-07	0.42778	29.2	221	1235
ltz-deu	tatoeba-test-v2020-07-28-v2021-08-07	0.37005	21.0	347	2208
ltz-eng	tatoeba-test-v2020-07-28-v2021-08-07	0.37764	30.1	293	1840
ltz-nld	tatoeba-test-v2020-07-28-v2021-08-07	0.32392	26.4	292	1685
multi-multi	tatoeba-test-v2020-07-28-v2021-08-07	0.59400	40.4	10000	74505
nds-deu	tatoeba-test-v2020-07-28-v2021-08-07	0.63898	45.5	9999	74544
nds-eng	tatoeba-test-v2020-07-28-v2021-08-07	0.55112	38.4	2500	17584
nds-nld	tatoeba-test-v2020-07-28-v2021-08-07	0.66676	49.8	1657	11489
nld-afr	tatoeba-test-v2020-07-28-v2021-08-07	0.76610	62.3	1056	6823
nld-deu	tatoeba-test-v2020-07-28-v2021-08-07	0.73047	56.7	10218	74121
nld-eng	tatoeba-test-v2020-07-28-v2021-08-07	0.73940	60.2	12696	89970
nld-fry	tatoeba-test-v2020-07-28-v2021-08-07	0.47959	31.0	260	1857
nld-nds	tatoeba-test-v2020-07-28-v2021-08-07	0.43743	20.0	1657	11711
nld-nld	tatoeba-test-v2020-07-28-v2021-08-07	0.63646	44.9	1000	7196
swg-deu	tatoeba-test-v2020-07-28-v2021-08-07	0.40319	16.3	1523	15630
yid-deu	tatoeba-test-v2020-07-28-v2021-08-07	6.304	0.1	853	5172
yid-eng	tatoeba-test-v2020-07-28-v2021-08-07	3.715	0.1	2483	15449
yid-yid	tatoeba-test-v2020-07-28-v2021-08-07	6.596	0.1	292	1802
deu-eng	newssyscomb2009	0.54992	28.2	502	11821
eng-deu	newssyscomb2009	0.53867	23.2	502	11271
deu-eng	news-test2008	0.54584	27.2	2051	49380
eng-deu	news-test2008	0.53204	23.7	2051	47427
deu-eng	newstest2009	0.53749	25.9	2525	65402
eng-deu	newstest2009	0.53283	22.9	2525	62816
deu-eng	newstest2010	0.58356	30.6	2489	61724
eng-deu	newstest2010	0.54886	25.8	2489	61511
deu-eng	newstest2011	0.54883	26.3	3003	74681
eng-deu	newstest2011	0.52712	23.1	3003	72981
deu-eng	newstest2012	0.56160	28.5	3003	72812
eng-deu	newstest2012	0.52662	23.3	3003	72886
deu-eng	newstest2013	0.57770	31.4	3000	64505
eng-deu	newstest2013	0.55774	27.8	3000	63737
deu-eng	newstest2014-deen	0.59826	33.2	3003	67337
eng-deu	newstest2014-deen	0.59441	29.6	3003	62964
deu-eng	newstest2015-ende	0.59660	33.4	2169	46443
eng-deu	newstest2015-ende	0.59889	32.3	2169	44260
deu-eng	newstest2016-ende	0.64736	39.8	2999	64126
eng-deu	newstest2016-ende	0.64429	38.3	2999	62670
deu-eng	newstest2017-ende	0.60933	35.2	3004	64399
eng-deu	newstest2017-ende	0.59258	30.7	3004	61291
deu-eng	newstest2018-ende	0.66796	42.6	2998	67013
eng-deu	newstest2018-ende	0.69605	46.5	2998	64276
deu-eng	newstest2019-deen	0.63766	39.8	2000	39282
eng-deu	newstest2019-ende	0.66880	43.3	1997	48969

Citation Information

Publications: OPUS-MT – Building open translation services for the World and The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT (Please, cite if you use this model.)

@inproceedings{tiedemann-thottingal-2020-opus,
    title = "{OPUS}-{MT} {--} Building open translation services for the World",
    author = {Tiedemann, J{\"o}rg  and Thottingal, Santhosh},
    booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
    month = nov,
    year = "2020",
    address = "Lisboa, Portugal",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2020.eamt-1.61",
    pages = "479--480",
}

@inproceedings{tiedemann-2020-tatoeba,
    title = "The Tatoeba Translation Challenge {--} Realistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wmt-1.139",
    pages = "1174--1182",
}

Acknowledgements

The work is supported by the European Language Grid as pilot project 2866, by the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113), and the MeMAD project, funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No 780069. We are also grateful for the generous computational resources and IT infrastructure provided by CSC -- IT Center for Science, Finland.

Model conversion info

transformers version: 4.16.2
OPUS-MT git hash: c1980b5
port time: Sun Oct 8 14:39:59 EEST 2023
port machine: LM0-400-22516.local

Helsinki-NLP
/

opus-mt-tc-big-gmw-gmw