`vocab_size` mismatch?

#120

by VictorSanh - opened Oct 4, 2022

Discussion

VictorSanh

BigScience Workshop org Oct 4, 2022

Hey!

The model card says A vocabulary size of 250,680.
len(tokenizer) returns 250680.

However, the config has "vocab_size": 250880.

Also the docstring of BloomConfig still has vocab_size (int, *optional*, defaults to 50257): which is coming from GPT2 I believe.

Is there a reason for these mismatches?

VictorSanh

BigScience Workshop org Oct 4, 2022

•

edited Oct 4, 2022

And at the same time, the word_embeddings matrix is of size Embedding(250880, hidden_size). I am missing something 😅

SaulLu

BigScience Workshop org Oct 5, 2022

Hello!

There is indeed an explanation for this difference in numbers. In the config.json file, the variable vocab_size is only used to define the size of the word_embeddings and lm_head matrices. The constraint we have is that the size of these matrices must be greater than or equal to the number of tokens known by the tokenizer, the difference of 200 corresponding to "dummy" tokens that are not used.

There are several reasons in the development of BLOOM that led to this difference. The size of the word_embeddings and lm_head matrixes had to be divisible by a certain number (4*128 from memory) so that the model could be parallelized with tensor parallelism. Then, the tokenizer was produced before all the model design was finished and it was safer to leave tokens available if we needed to add special tokens for training (for PII for example).

ybelkada

BigScience Workshop org Oct 5, 2022

Thanks @VictorSanh & @SaulLu for the explanation!
I agree that the docstring of BloomConfig is slightly confusing, I propose to address this issue in https://github.com/huggingface/transformers/pull/19336/files !

VictorSanh

BigScience Workshop org Oct 5, 2022

Thank you for the explanation and the PR @SaulLu & @ybelkada !
I understand now, closing this.

VictorSanh changed discussion status to closed Oct 5, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment