Fast and slow tokenizers mismatch due to custom normalization
#16
by
Xenova
HF staff
- opened
Probably not worth fixing since we use the fast tokenizer by default, but just adding here since it's something I ran into.
from transformers import AutoTokenizer
text = '1\u00002\uFFFD3'
model_id = 'jinaai/jina-embeddings-v2-base-zh'
slow_tok = AutoTokenizer.from_pretrained(model_id, use_fast=False)
fast_tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
slow_encoded = slow_tok.encode(text)
fast_encoded = fast_tok.encode(text)
assert slow_encoded == fast_encoded, f'{slow_encoded} != {fast_encoded}'
results in
AssertionError: [0, 47, 3, 48, 179, 161, 159, 49, 2] != [0, 47, 5, 48, 20321, 49, 2]