Model token size is bigger than tokenizer size?

#97
by fahadh4ilyas - opened

Tokenizer vocab size is 50295 but embedding and head size is 51200. is it intentional?

This is a good reference: https://huggingface.co/microsoft/phi-2/discussions/22#659d8ba950c1bbee5be6f179

We ended up setting 51200 as the vocabulary size just to accommodate any new tokens that we might need in the future. You can follow @Deepakvictor answer and it should fix the issue.

As far as I know, no tokens from 50295+ should be generated because those embeddings were not trained. Though, depending on the generation's parameters, they could appear (low probabilities however).

Microsoft org

Thanks for the answer @iliyaML !

gugarosa changed discussion status to closed

Sign up or log in to comment