Bloom's tokenizer vocab is messy code
#216
by
ShaneSue
- opened
The tokenizer operates on bytes, so it's normal for the tokens to contain weird characters. If your goal is to manually inspect individual tokens you can convert them back to strings using the tokenizer's convert_tokens_to_string
method.
I got it, thanks a lot
christopher
changed discussion status to
closed