Bug in token display for complex scripts
#5
by
santhosh
- opened
The display of tokens based on color has a bug for complex scripts(Indic scripts, arabic etc).
For example,
I analysed the text "സാങ്കേതികമായി നന്ദി വിജയിച്ചെങ്കിലും സാമ്പത്തിക കാരണങ്ങളാൽ പദ്ധതി നിർത്തിവെച്ചു." using Gemma model. The content is in Malayalam - an Indic lanuage
As you can observe, the number of tokens rendered does not match the actual number of tokens. Let me provide a simplest version to understand this better
The input text is "സേ"
The two tokens are "സ", "േ". But together they form a ligature "സേ". Because of this, probably due to the bug in the color highlighting tool, only one color is used and that shows both the tokens in green color.
Xenova
changed discussion status to
closed