Malayalam Unigram Tokenizer

A Unigram (SentencePiece-style) tokenizer trained on Malayalam text corpus. Trained using the HuggingFace tokenizers library with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam Unicode conjuncts.

Details

Property Value
Algorithm Unigram (SentencePiece)
Vocabulary size 16,000
Pre-tokenizer Metaspace ()
Normalizer NFC + Strip
Special tokens <s>, </s>, <unk>, <pad>, <mask>

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-unigram-tokenizer")

text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്"
tokens = tokenizer.tokenize(text)
print(tokens)

encoded = tokenizer(text, return_tensors="pt")
print(encoded)

Notes

  • Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte UTF-8 sequences into invalid bytes.
  • NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently.
  • Trained and published from smc/malayalam-tokenizer.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support