|
---
|
|
language:
|
|
- zh
|
|
- ja
|
|
- en
|
|
|
|
tags:
|
|
- translation
|
|
|
|
widget:
|
|
- text: "ja2zh: 吾輩は猫である。名前はまだ無い。"
|
|
|
|
license: cc-by-nc-sa-4.0
|
|
---
|
|
|
|
This model is finetuned from [mt5-base](https://huggingface.co/google/mt5-base).
|
|
|
|
The model vocabulary is trimmed to ~1/3 by selecting top 85000 tokens in the training data. The code to trim the vocabulary can be found [here](https://gist.github.com/K024/4a100a0f4f4b07208958e0f3244da6ad).
|
|
|
|
Usage:
|
|
```python
|
|
from transformers import (
|
|
T5Tokenizer,
|
|
MT5ForConditionalGeneration,
|
|
Text2TextGenerationPipeline,
|
|
)
|
|
|
|
path = "K024/mt5-zh-ja-en-trimmed"
|
|
pipe = Text2TextGenerationPipeline(
|
|
model=MT5ForConditionalGeneration.from_pretrained(path),
|
|
tokenizer=T5Tokenizer.from_pretrained(path),
|
|
)
|
|
|
|
sentence = "ja2zh: 吾輩は猫である。名前はまだ無い。"
|
|
res = pipe(sentence, max_length=100, num_beams=4)
|
|
res[0]['generated_text']
|
|
```
|
|
|
|
Training data:
|
|
```
|
|
wikimedia-en-ja
|
|
wikimedia-en-zh
|
|
wikimedia-ja-zh
|
|
wikititles-ja-en
|
|
wikititles-zh-en
|
|
wikimatrix-ja-zh
|
|
news-commentary-en-ja
|
|
news-commentary-en-zh
|
|
news-commentary-ja-zh
|
|
ted2020-en-ja
|
|
ted2020-en-zh
|
|
ted2020-ja-zh
|
|
```
|
|
|
|
License: [![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]
|
|
|
|
[cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/
|
|
[cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png
|
|
|