datasets: | |
- cerebras/SlimPajama-627B | |
language: | |
- en | |
The pre-trained 3B model with the vocabulary size 43K in the paper [Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies](https://huggingface.co/papers/2407.13623). We investigate how vocabulary size | |
impacts language model scaling law in this paper. | |
Based on our approach, we predict the optimal vocabulary size for 3B model is about 43K. | |
Then, we train a Llama-based 3B model on a sampled version Slimpajama datasets. The model with 43K vocabulary outperforms the model with the common vocabulary size, 32K, despite using fewer training tokens. | |
It is noteworthy that the proposed approach can be used for different model sizes. |