`vocab_size` mismatch?
Hey!
The model card says A vocabulary size of 250,680
.len(tokenizer)
returns 250680
.
However, the config has "vocab_size": 250880
.
Also the docstring of BloomConfig
still has vocab_size (
int, *optional*, defaults to 50257):
which is coming from GPT2 I believe.
Is there a reason for these mismatches?
And at the same time, the word_embeddings
matrix is of size Embedding(250880, hidden_size)
. I am missing something 😅
Hello!
There is indeed an explanation for this difference in numbers. In the config.json
file, the variable vocab_size
is only used to define the size of the word_embeddings
and lm_head
matrices. The constraint we have is that the size of these matrices must be greater than or equal to the number of tokens known by the tokenizer, the difference of 200 corresponding to "dummy" tokens that are not used.
There are several reasons in the development of BLOOM that led to this difference. The size of the word_embeddings
and lm_head
matrixes had to be divisible by a certain number (4*128 from memory) so that the model could be parallelized with tensor parallelism. Then, the tokenizer was produced before all the model design was finished and it was safer to leave tokens available if we needed to add special tokens for training (for PII for example).
Thanks
@VictorSanh
&
@SaulLu
for the explanation!
I agree that the docstring of BloomConfig
is slightly confusing, I propose to address this issue in https://github.com/huggingface/transformers/pull/19336/files !