metadata
language:
- bn
thumbnail: null
tags:
- null
license: apache-2.0
datasets:
- oscar
- wikipedia
metrics:
- null
[WIP] Albert Bengali - dev version
Model description
For the moment, only the tokenizer is available. The tokenizer is based on SentencePiece with Unigram language model segmentation algorithm.
Taking into account certain characteristics of the language, we chose that:
- the tokenizer passes in lower case all the texts because the Bengali language is a monocameral scrip (no difference between capital and lower case);
- the sentence pieces can't go beyond the boundary of a word because the words are spaced by white spaces in the Bengali language.
Intended uses & limitations
This tokenizer is adapted to the Bengali language. You can use it to pre-train an Albert model on the Bengali language.
How to use
To tokenize:
from transformers import AlbertTokenizer
tokenizer = AlbertTokenizer.from_pretrained('SaulLu/albert-bn-dev')
text = "পোকেমন জাপানী ভিডিও গেম কোম্পানি নিনটেন্ডো কর্তৃক প্রকাশিত একটি মিডিয়া ফ্র্যাঞ্চাইজি।"
encoded_input = tokenizer(text, return_tensors='pt')
Limitations and bias
Provide examples of latent issues and potential remediations.
Training data
The tokenizer was trained on a random subset of 4M sentences of Bengali Oscar and Bengali Wikipedia.
Training procedure
Tokenizer
The tokenizer was trained with the SentencePiece on 8 x Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz with 16GB RAM and 36GB SWAP.
import sentencepiece as spm
config = {
"input": "./dataset/oscar_bn.txt,./dataset/wikipedia_bn.txt",
"input_format": "text",
"model_type": "unigram",
"vocab_size": 32000,
"self_test_sample_size": 0,
"character_coverage": 0.9995,
"shuffle_input_sentence": true,
"seed_sentencepiece_size": 1000000,
"shrinking_factor": 0.75,
"num_threads": 8,
"num_sub_iterations": 2,
"max_sentencepiece_length": 16,
"max_sentence_length": 4192,
"split_by_unicode_script": true,
"split_by_number": true,
"split_digits": true,
"control_symbols": "[MASK]",
"byte_fallback": false,
"vocabulary_output_piece_score": true,
"normalization_rule_name": "nmt_nfkc_cf",
"add_dummy_prefix": true,
"remove_extra_whitespaces": true,
"hard_vocab_limit": true,
"unk_id": 1,
"bos_id": 2,
"eos_id": 3,
"pad_id": 0,
"bos_piece": "[CLS]",
"eos_piece": "[SEP]",
"train_extremely_large_corpus": true,
"split_by_whitespace": true,
"model_prefix": "./spiece",
"input_sentence_size": 4000000,
"user_defined_symbols": "(,),-,.,–,£,।",
}
spm.SentencePieceTrainer.train(**config)