Why add_prefix_space=false?
#5
by
hankcs
- opened
Hi, thank you for sharing your work, it's great!
I've a querstion regarding the BPE tokenizer. I saw its add_prefix_space is set to false, which means the same word in a text will be tokenized differently depending on its position. E.g., consider the word "Hello" in the text:
Hello everyone. Hello world.
will be tokenized to:
['[CLS]', 'Hello', 'Ġeveryone', '.', 'ĠHello', 'Ġworld', '.', '[SEP]']
This leads to redundant vocabulary and conflicting semantics: a token without prefix space is usually a subtoken, but here "Hello" doesn't have prefix space.
How did you tokenize your pretraining corpus? Is it a mistake to set add_prefix_space to false during the conversion of tokenizers?