Discrepancy between vocabulary size in model and tokenizer leading to bugs
Hi! Had a quick question about the discrepancy between the input embeddings:
model = AutoModel.from_pretrained('UFNLP/gatortron-base')
model.embeddings.word_embeddings.shape
There are 50176 in this module, but the tokenizer has 50101 vocabulary items (https://huggingface.co/UFNLP/gatortron-base/raw/main/vocab.txt).
Is there a reason for this discrepancy? It is making us hard-code the vocabulary size to fix this, and we hope we are correctly initializing from gatortron.
Otherwise, thank you so much for open sourcing this! It is extremely helpful :)
NVIDIA implements it by padding the vocabulary to be a multiple of 8 to effectively utilize tensor cores during training, particularly when using mixed precision. Please see the documents: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensor-core-shape.
Understood -- super helpful! The discrepancy is 75 tokens though; is that normal / is there code for where the vocabulary size selects a multiple of 8?
Asking because other standard codebases choose the nearest multiple of 8 for the vocabulary size: https://github.com/pytorch-labs/gpt-fast/blob/main/model.py#L61
It sounds like we are safe using the first 50101 vocabulary items then. Appreciate the help!