Vocab size 32001 causes problems for quantisation
Hi there
Thanks for the model, looks good.
I'm doing quantisations which will be uploaded at:
- https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GGML
- https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GPTQ
I just wanted to let you know that because you've increased the vocab size to 32001, this breaks compatibility with the latest GGML quantisation methods, called k-quant.
This may be resolved some time in the future, but for now it means I can only release the older formats.
My understanding is that that extra 32001th token, the PAD, was added as something of a hack very early in the history of Llama open source models. It was a hack used by one particular model creator I think only because they forgot to set up special_tokens_map.json correctly :) Since then it's stuck around, being copied from model to model, despite not being needed. Unfortunately WizardLM inherited it for example, and a number of models have used their code since.
I'm starting a campaign to try and get it phased out, because it causes tons of problems for developers outside the sphere of Python inference.
Just thought I'd let you know for your next model - and also so I can point people to this post when they inevitably ask me why I've not released the latest quantisation GGML formats for your model :)
Thanks
PS. You can read about the issue with k-quants here: https://github.com/ggerganov/llama.cpp/issues/1919
Hi there
Thanks for the model, looks good.
I'm doing quantisations which will be uploaded at:
- https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GGML
- https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GPTQ
I just wanted to let you know that because you've increased the vocab size to 32001, this breaks compatibility with the latest GGML quantisation methods, called k-quant.
This may be resolved some time in the future, but for now it means I can only release the older formats.
My understanding is that that extra 32001th token, the PAD, was added as something of a hack very early in the history of Llama open source models. It was a hack used by one particular model creator I think only because they forgot to set up special_tokens_map.json correctly :) Since then it's stuck around, being copied from model to model, despite not being needed. Unfortunately WizardLM inherited it for example, and a number of models have used their code since.
I'm starting a campaign to try and get it phased out, because it causes tons of problems for developers outside the sphere of Python inference.
Just thought I'd let you know for your next model - and also so I can point people to this post when they inevitably ask me why I've not released the latest quantisation GGML formats for your model :)
Thanks
Hi there,
Thanks for the message! We will eliminate the token in the next models. Thanks again!
Dumb question: Is there any possibility that we can manually eliminate the extra token (PAD)? If possible at all, what'd be some pointers that we can chase? Thank you! If not possible, what'd be the reasoning? Curious to know. Thanks for the great model!