|
--- |
|
language: |
|
- ru |
|
- zh |
|
tags: |
|
- translation |
|
license: mit |
|
datasets: |
|
- ccmatrix |
|
- joefox/newstest-2017-2019-ru_zh |
|
metrics: |
|
- sacrebleu |
|
--- |
|
|
|
# mBART-large Russian to Chinese and Chinese to Russian multilingual machine translation |
|
|
|
This model is a fine-tuned checkpoint of [mBART-large-50-many-to-many-mmt](https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt). `joefox/mbart-large-ru-zh-ru-many-to-many-mmt` is fine-tuned for ru-zh and zh-ru machine translation. |
|
|
|
|
|
|
|
|
|
The model can translate directly between any pair of Russian and Chinese languages. To translate into a target language, the target language id is forced as the first generated token. To force the target language id as the first generated token, pass the `forced_bos_token_id` parameter to the `generate` method.. |
|
|
|
This [post in Russian](https://habr.com/ru/post/721330/) gives more details. |
|
|
|
|
|
|
|
Example translate Russian to Chinese |
|
|
|
```python |
|
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast |
|
|
|
model = MBartForConditionalGeneration.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt") |
|
tokenizer = MBart50TokenizerFast.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt") |
|
|
|
src_text = "Съешь ещё этих мягких французских булок." |
|
|
|
# translate Russian to Chinese |
|
tokenizer.src_lang = "ru_RU" |
|
encoded_ru = tokenizer(src_text, return_tensors="pt") |
|
generated_tokens = model.generate( |
|
**encoded_ru, |
|
forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"] |
|
) |
|
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
|
|
print(result) |
|
#再吃一些法国面包。 |
|
``` |
|
|
|
and Example translate Chinese to Russian |
|
|
|
|
|
|
|
```python |
|
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast |
|
|
|
model = MBartForConditionalGeneration.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt") |
|
tokenizer = MBart50TokenizerFast.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt") |
|
|
|
src_text = "吃一些软的法国面包。" |
|
# translate Chinese to Russian |
|
tokenizer.src_lang = "zh_CN" |
|
encoded_zh = tokenizer(src_text, return_tensors="pt") |
|
generated_tokens = model.generate( |
|
**encoded_zh, |
|
forced_bos_token_id=tokenizer.lang_code_to_id["ru_RU"] |
|
) |
|
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True) |
|
print(result) |
|
#Ешьте немного французского хлеба. |
|
``` |
|
|
|
## Languages covered |
|
|
|
Russian (ru_RU), Chinese (zh_CN) |
|
|
|
|