--- language: - ru - zh tags: - translation license: mit datasets: - ccmatrix - joefox/newstest-2017-2019-ru_zh metrics: - sacrebleu --- # mBART-large Russian to Chinese and Chinese to Russian multilingual machine translation This model is a fine-tuned checkpoint of [mBART-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt). `joefox/mbart-large-ru-zh-ru-many-to-many-mmt` is fine-tuned for ru-zh and zh-ru machine translation. The model can translate directly between any pair of Russian and Chinese languages. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the `forced_bos_token_id` parameter to the `generate` method.. This post in Russian gives more details. Example translate Russian to Chinese ```python from transformers import MBartForConditionalGeneration, MBart50TokenizerFast model = MBartForConditionalGeneration.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt") tokenizer = MBart50TokenizerFast.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt", src_lang="ru_RU", tgt_lang="zh_CN") src_text = "Съешь ещё этих мягких французских булок." # translate Russian to Chinese tokenizer.src_lang = "ru_RU" encoded_ru = tokenizer(src_text, return_tensors="pt") generated_tokens = model.generate( **encoded_ru, forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"] ) result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) print(result) #再吃一些法国面包。 ``` and Example translate Chinese to Russian ```python from transformers import MBartForConditionalGeneration, MBart50TokenizerFast model = MBartForConditionalGeneration.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt") tokenizer = MBart50TokenizerFast.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt", src_lang="ru_RU", tgt_lang="zh_CN") src_text = "吃一些软的法国面包。" # translate Chinese to Russian tokenizer.src_lang = "zh_CN" encoded_zh = tokenizer(src_text, return_tensors="pt") generated_tokens = model.generate( **encoded_zh, forced_bos_token_id=tokenizer.lang_code_to_id["ru_RU"] ) result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) print(result) #Ешьте немного французского хлеба. ``` ## Languages covered Russian (ru_RU), Sinhala (si_LK), Chinese (zh_CN)