K024
/

mt5-zh-ja-en-trimmed

text2text-generation

Inference Endpoints

Model card Files Files and versions Community

mt5-zh-ja-en-trimmed / README.md

K024's picture

Update README.md

6da3352 over 2 years ago

|

history blame contribute delete

1.39 kB

	---
	language:
	- zh
	- ja
	- en

	tags:
	- translation

	widget:
	- text: "ja2zh: 吾輩は猫である。名前はまだ無い。"

	license: cc-by-nc-sa-4.0
	---

	This model is finetuned from [mt5-base](https://huggingface.co/google/mt5-base).

	The model vocabulary is trimmed to ~1/3 by selecting top 85000 tokens in the training data. The code to trim the vocabulary can be found [here](https://gist.github.com/K024/4a100a0f4f4b07208958e0f3244da6ad).

	Usage:
	```python
	from transformers import (
	T5Tokenizer,
	MT5ForConditionalGeneration,
	Text2TextGenerationPipeline,
	)

	path = "K024/mt5-zh-ja-en-trimmed"
	pipe = Text2TextGenerationPipeline(
	model=MT5ForConditionalGeneration.from_pretrained(path),
	tokenizer=T5Tokenizer.from_pretrained(path),
	)

	sentence = "ja2zh: 吾輩は猫である。名前はまだ無い。"
	res = pipe(sentence, max_length=100, num_beams=4)
	res[0]['generated_text']
	```

	Training data:
	```
	wikimedia-en-ja
	wikimedia-en-zh
	wikimedia-ja-zh
	wikititles-ja-en
	wikititles-zh-en
	wikimatrix-ja-zh
	news-commentary-en-ja
	news-commentary-en-zh
	news-commentary-ja-zh
	ted2020-en-ja
	ted2020-en-zh
	ted2020-ja-zh
	```

	License: [![CC BY-NC-SA 4.0][cc-by-nc-sa-image]][cc-by-nc-sa]

	[cc-by-nc-sa]: http://creativecommons.org/licenses/by-nc-sa/4.0/
	[cc-by-nc-sa-image]: https://licensebuttons.net/l/by-nc-sa/4.0/88x31.png