GemmaTokenizerFast and GemmaTokenizer
#74
by
PedramR
- opened
Hi,
I'm curious to know if there's any difference between AutoTokenizer
and GemmaTokenizer
. I ran the following code and noticed a slight difference in the vocabularies of them:
from transformers import GemmaTokenizer, AutoTokenizer
model_name = "google/gemma-7b"
auto_tokenizer = AutoTokenizer.from_pretrained(model_name)
gemma_tokenizer = GemmaTokenizer.from_pretrained(model_name)
print(set(auto_tokenizer.get_vocab().keys()) - set(gemma_tokenizer.get_vocab().keys()))
print(set(gemma_tokenizer.get_vocab().keys()) - set(auto_tokenizer.get_vocab().keys()))
I got the following results by running the code on Colab:
{'\t', '<start_of_image>'}
{'<0x09>', '<unused99>'}
** Update: I think the issue is related to GemmaTokenizerFast
, while GemmaTokenizer
works fine.
lol
PedramR
changed discussion title from
AutoTokenizer and GemmaTokenizer
to GemmaTokenizerFast and GemmaTokenizer
I think this was fixed, right?
PedramR
changed discussion status to
closed