metadata
language:
- ru
- zh
tags:
- translation
license: mit
datasets:
- ccmatrix
- joefox/newstest-2017-2019-ru_zh
metrics:
- sacrebleu
mBART-large Russian to Chinese and Chinese to Russian multilingual machine translation
This model is a fine-tuned checkpoint of mBART-large-50-many-to-many-mmt. joefox/mbart-large-ru-zh-ru-many-to-many-mmt
is fine-tuned for ru-zh and zh-ru machine translation.
The model can translate directly between any pair of Russian and Chinese languages. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the forced_bos_token_id
parameter to the generate
method..
This post in Russian gives more details.
Example translate Russian to Chinese
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt", src_lang="ru_RU", tgt_lang="zh_CN")
src_text = "Съешь ещё этих мягких французских булок."
# translate Russian to Chinese
tokenizer.src_lang = "ru_RU"
encoded_ru = tokenizer(src_text, return_tensors="pt")
generated_tokens = model.generate(
**encoded_ru,
forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"]
)
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(result)
#再吃一些法国面包。
and Example translate Chinese to Russian
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt", src_lang="ru_RU", tgt_lang="zh_CN")
src_text = "吃一些软的法国面包。"
# translate Chinese to Russian
tokenizer.src_lang = "zh_CN"
encoded_zh = tokenizer(src_text, return_tensors="pt")
generated_tokens = model.generate(
**encoded_zh,
forced_bos_token_id=tokenizer.lang_code_to_id["ru_RU"]
)
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(result)
#Ешьте немного французского хлеба.
Languages covered
Russian (ru_RU), Sinhala (si_LK), Chinese (zh_CN)