Xenova/the-tokenizer-playground · Bug in token display for complex scripts

Feb 22

The display of tokens based on color has a bug for complex scripts(Indic scripts, arabic etc).
For example,

I analysed the text "സാങ്കേതികമായി നന്ദി വിജയിച്ചെങ്കിലും സാമ്പത്തിക കാരണങ്ങളാൽ പദ്ധതി നിർത്തിവെച്ചു." using Gemma model. The content is in Malayalam - an Indic lanuage

As you can observe, the number of tokens rendered does not match the actual number of tokens. Let me provide a simplest version to understand this better

The input text is "സേ"

The two tokens are "സ", "േ". But together they form a ligature "സേ". Because of this, probably due to the bug in the color highlighting tool, only one color is used and that shows both the tokens in green color.

Xenova

Owner Feb 22

Thanks for the report! This should be fixed by this commit. PR should be merged soon.

Xenova

Owner Mar 4

Sorry for the delay - I forgot to respond with the update! It's fixed now :)

Xenova changed discussion status to closed Mar 4