GemmaTokenizerFast and GemmaTokenizer

#74
by PedramR - opened

Hi,
I'm curious to know if there's any difference between AutoTokenizer and GemmaTokenizer. I ran the following code and noticed a slight difference in the vocabularies of them:

from transformers import GemmaTokenizer, AutoTokenizer

model_name = "google/gemma-7b"

auto_tokenizer = AutoTokenizer.from_pretrained(model_name)
gemma_tokenizer = GemmaTokenizer.from_pretrained(model_name)

print(set(auto_tokenizer.get_vocab().keys()) - set(gemma_tokenizer.get_vocab().keys()))
print(set(gemma_tokenizer.get_vocab().keys()) - set(auto_tokenizer.get_vocab().keys()))

I got the following results by running the code on Colab:

{'\t', '<start_of_image>'}
{'<0x09>', '<unused99>'}

** Update: I think the issue is related to GemmaTokenizerFast, while GemmaTokenizer works fine.

PedramR changed discussion title from AutoTokenizer and GemmaTokenizer to GemmaTokenizerFast and GemmaTokenizer

I think this was fixed, right?

Yes, it was solved in newer versions.

PedramR changed discussion status to closed

Sign up or log in to comment