Malayalam Unigram Tokenizer

A Unigram (SentencePiece-style) tokenizer trained on Malayalam text corpus. Trained using the HuggingFace tokenizers library with Metaspace pre-tokenization and NFC normalization for correct handling of Malayalam Unicode conjuncts.

Details

Property	Value
Algorithm	Unigram (SentencePiece)
Vocabulary size	16,000
Pre-tokenizer	Metaspace (`▁`)
Normalizer	NFC + Strip
Special tokens	`<s>`, `</s>`, `<unk>`, `<pad>`, `<mask>`

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("smc/malayalam-unigram-tokenizer")

text = "മലയാളം ഒരു ദ്രാവിഡ ഭാഷയാണ്"
tokens = tokenizer.tokenize(text)
print(tokens)

encoded = tokenizer(text, return_tensors="pt")
print(encoded)

Notes

Use Metaspace (not ByteLevel) pre-tokenization — ByteLevel splits Malayalam's multibyte UTF-8 sequences into invalid bytes.
NFC normalization ensures Malayalam conjuncts formed via ZWJ/ZWNJ are handled consistently.
Trained and published from smc/malayalam-tokenizer.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support