Tokenizer vocab size doesn't match model vocab size
#151
by
frreiss
- opened
The gpt-oss-20b model configuration specifies a vocab size of 201088. The tokenizer in this repository has 199998 tokens in its data file and 21 additional special tokens in tokenizer_config.json, for a total of 200019 tokens.
Code to replicate these numbers:
>>> import transformers
>>> model = transformers.AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b")
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("openai/gpt-oss-20b")
>>> print(f"{model.vocab_size=}")
model.vocab_size=201088
>>> print(f"{tokenizer.vocab_size=}")
tokenizer.vocab_size=199998
>>> print(f"{len(tokenizer)=}")
len(tokenizer)=200019
Users who attempt to, for example, configure constrained decoding using the tokenizer's vocab size of 200019 will encounter errors when the model emits token IDs that are greater than 200019.
Can you please upload an updated tokenizer config that adds the missing 1069 reserved tokens?
Hi
@frreiss
, code to convert the tokenizer (tiktoken) in gpt-oss to a HF compatible format.
Tested with Python 3.10+
from pathlib import Path
from transformers.integrations.tiktoken import convert_tiktoken_to_fast
from tiktoken import encoding_name_for_model
from transformers import PreTrainedTokenizerFast
model_name = "gpt-oss-20b"
encoding = encoding_name_for_model(model_name)
outdir = Path(f"tokenizer/{encoding}")
outdir.mkdir(parents=True, exist_ok=True)
convert_tiktoken_to_fast(encoding, outdir)
tokenizer = PreTrainedTokenizerFast.from_pretrained(outdir)
print(f"{len(tokenizer)=}")
vocab_size = tokenizer.vocab_size
added_vocab_size = len(tokenizer.get_added_vocab().keys())
print(f"{vocab_size=}")
print(f"{added_vocab_size=}")
print(f"{(vocab_size+added_vocab_size)=}")
Output
len(tokenizer)=201089
vocab_size=199998
added_vocab_size=1091
(vocab_size+added_vocab_size)=201089