Romanized Sinhala Tokenizer

This tokenizer is specifically trained for Romanized Sinhala text (Sinhala written in Latin alphabet).

Details

  • Based on mBART's tokenization approach (BPE)
  • Trained on the Swabhasha Romanized Sinhala Dataset
  • Includes custom language code "si_rom" for Romanized Sinhala
  • Compatible with sequence-to-sequence models

Usage

from transformers import PreTrainedTokenizerFast


from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "deshanksuman/romanized-sinhala-tokenizer",
    token="hf Token"
)

# Just tokenize and get tensors
encoded = tokenizer("api ada mkda krnne", return_tensors="pt")
print(encoded)

# To see tokens in text form
print(tokenizer.convert_ids_to_tokens(encoded["input_ids"][0]))

Citation

@article{sumanathilaka2025swa,
  title={Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources},
  author={Sumanathilaka, Deshan and Perera, Sameera and Dharmasiri, Sachithya and Athukorala, Maneesha and Herath, Anuja Dilrukshi and Dias, Rukshan and Gamage, Pasindu and Weerasinghe, Ruvan and Priyadarshana, YHPP},
  journal={arXiv preprint arXiv:2507.09245},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train deshanksuman/romanized-sinhala-tokenizer