Malayalam Unigram Tokenizer
A Unigram (SentencePiece-style) tokenizer trained on Malayalam text corpus. Trained using the HuggingFace tokenizers library with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam Unicode conjuncts.
Details
| Property | Value |
|---|---|
| Algorithm | Unigram (SentencePiece) |
| Vocabulary size | 16,000 |
| Pre-tokenizer | Metaspace (▁) |
| Normalizer | NFC + Strip |
| Special tokens | <s>, </s>, <unk>, <pad>, <mask> |
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-unigram-tokenizer")
text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്"
tokens = tokenizer.tokenize(text)
print(tokens)
encoded = tokenizer(text, return_tensors="pt")
print(encoded)
Notes
- Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte UTF-8 sequences into invalid bytes.
- NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently.
- Trained and published from smc/malayalam-tokenizer.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support