Why add_prefix_space=false?

#5
by hankcs - opened

Hi, thank you for sharing your work, it's great!

I've a querstion regarding the BPE tokenizer. I saw its add_prefix_space is set to false, which means the same word in a text will be tokenized differently depending on its position. E.g., consider the word "Hello" in the text:

Hello everyone. Hello world.

will be tokenized to:

['[CLS]', 'Hello', 'Ġeveryone', '.', 'ĠHello', 'Ġworld', '.', '[SEP]']

This leads to redundant vocabulary and conflicting semantics: a token without prefix space is usually a subtoken, but here "Hello" doesn't have prefix space.

How did you tokenize your pretraining corpus? Is it a mistake to set add_prefix_space to false during the conversion of tokenizers?

Sign up or log in to comment