GGUF files have tokenizer issues

#1
by JohannesGaessler - opened

The models in this repository seem to have tokenizer issues, see https://github.com/ggerganov/llama.cpp/pull/6936#issuecomment-2107368738 , causing degraded results. This is indicated by the following warning when running the models:

llm_load_vocab: missing pre-tokenizer type, using: 'default'
llm_load_vocab:                                             
llm_load_vocab: ************************************        
llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED!        
llm_load_vocab: CONSIDER REGENERATING THE MODEL             
llm_load_vocab: ************************************        
llm_load_vocab:                                             
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
Quant Factory org

@JohannesGaessler I’ll look into this and get back. V1 was done before the BPE fix and this is done after the BPE fix

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment