Different behaviour of tokenizer

#5
by mizoru - opened

Hello, have you made changes to the tokenizer?
image.png
image.png
As you can see, the same strings result in longer token sequences with your tokenizer, the same thing happens if I use WhisperTokenizer, I'm looking fot the reason behind this.

Okay, it's the timestamp tokens that broke
image.png

Sign up or log in to comment