|
--- |
|
language: |
|
- en |
|
- ja |
|
license: mit |
|
datasets: |
|
- snow_simplified_japanese_corpus |
|
tags: |
|
- ja |
|
- japanese |
|
- tokenizer |
|
widget: |
|
- text: "誰が一番に着くか私には分かりません。" |
|
--- |
|
|
|
# Japanese Dummy Tokenizer |
|
|
|
Repository containing a dummy Japanese Tokenizer trained on ```snow_simplified_japanese_corpus``` dataset. The tokenizer has been trained using Hugging Face datasets in a streaming manner. |
|
|
|
## Intended uses & limitations |
|
|
|
You can use this tokenizer to tokenize Japanese sentences. |
|
|
|
## How to use it |
|
|
|
``` |
|
from transformers import AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("ybelkada/japanese-dummy-tokenizer") |
|
``` |
|
|
|
## How to train the tokenizer |
|
|
|
Check the file ```tokenizer.py```, you can freely adapt it to other datasets. This tokenizer is based on the tokenizer from ```csebuetnlp/mT5_multilingual_XLSum```. |