Gemma tokenizer issue

#37
by Akshayextreme - opened

from transformers import AutoTokenizer

model_id = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.decode(106)

<start_of_turn>

tokenizer.encode("<start_of_turn>", add_special_tokens=False)

[235322, 2997, 235298, 559, 235298, 15508, 235313]

tokenizer.encode(tokenizer.decode(106), add_special_tokens=False)

[235322, 2997, 235298, 559, 235298, 15508, 235313]

What am I missing here?
Ideal output should be as below

tokenizer.encode(tokenizer.decode(1), add_special_tokens=False)

[1]

Google org
edited Jul 5

Hi @Akshayextreme , Sorry for late the response. Please try again by setting the add_special_tokens=True. Thank you.

Sign up or log in to comment