Fast and slow tokenizers mismatch due to custom normalization

#16
by Xenova HF staff - opened

Probably not worth fixing since we use the fast tokenizer by default, but just adding here since it's something I ran into.

from transformers import AutoTokenizer

text = '1\u00002\uFFFD3'
model_id = 'jinaai/jina-embeddings-v2-base-zh'
slow_tok = AutoTokenizer.from_pretrained(model_id, use_fast=False)
fast_tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)

slow_encoded = slow_tok.encode(text)
fast_encoded = fast_tok.encode(text)

assert slow_encoded == fast_encoded, f'{slow_encoded} != {fast_encoded}'

results in

AssertionError: [0, 47, 3, 48, 179, 161, 159, 49, 2] != [0, 47, 5, 48, 20321, 49, 2]

Sign up or log in to comment