Tokenizer vocab size doesn't match model vocab size

#151
by frreiss - opened

The gpt-oss-20b model configuration specifies a vocab size of 201088. The tokenizer in this repository has 199998 tokens in its data file and 21 additional special tokens in tokenizer_config.json, for a total of 200019 tokens.

Code to replicate these numbers:

>>> import transformers
>>> model = transformers.AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b")
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
>>> print(f"{model.vocab_size=}")
model.vocab_size=201088
>>> print(f"{tokenizer.vocab_size=}")
tokenizer.vocab_size=199998
>>> print(f"{len(tokenizer)=}")
len(tokenizer)=200019

Users who attempt to, for example, configure constrained decoding using the tokenizer's vocab size of 200019 will encounter errors when the model emits token IDs that are greater than 200019.

Can you please upload an updated tokenizer config that adds the missing 1069 reserved tokens?

Hi @frreiss , code to convert the tokenizer (tiktoken) in gpt-oss to a HF compatible format.

Tested with Python 3.10+

from pathlib import Path
from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import encoding_name_for_model
from transformers import PreTrainedTokenizerFast

model_name = "gpt-oss-20b"

encoding = encoding_name_for_model(model_name)

outdir = Path(f"tokenizer/{encoding}")
outdir.mkdir(parents=True, exist_ok=True)

convert_tiktoken_to_fast(encoding, outdir)

tokenizer = PreTrainedTokenizerFast.from_pretrained(outdir)

print(f"{len(tokenizer)=}")
vocab_size = tokenizer.vocab_size
added_vocab_size = len(tokenizer.get_added_vocab().keys())

print(f"{vocab_size=}")
print(f"{added_vocab_size=}")
print(f"{(vocab_size+added_vocab_size)=}")

Output

len(tokenizer)=201089
vocab_size=199998
added_vocab_size=1091
(vocab_size+added_vocab_size)=201089

Sign up or log in to comment