unknown pre-tokenizer type: 'smaug-bpe'
Love your models! Having an issue with the Llama3 70B instruct Abliterated. v3.5-GGUFs that I didn't have with the v3 variants. I'm using oogabooga.
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'smaug-bpe'
Any thoughts on what the issue may be?
I might have answered my own question. It seems Smaug support was added 3 days ago to llama.ccp in commit https://github.com/ggerganov/llama.cpp/releases/tag/b3001
llama-cpp-python is just a few days behind. Guess I'll wait for the oobabooga update.
But I guess it shouldn't be smaug-bpe
for plain llama 3.
How on earth did it get smaug bpe?
How on earth did it get smaug bpe?
llama.cpp/gguf-py/scripts$ python gguf-dump.py Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5.gguf --json
"tokenizer.ggml.pre": {
"index": 17,
"offset": 619,
"type": "STRING",
"value": "smaug-bpe"
},
So I fixed it in a super hacky way:
in /gguf-py/scripts:
python gguf-new-metadata.py --remove-metadata tokenizer.ggml.pre Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5.gguf Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5-ptf.gguf
nano gguf-new-metadata.py
add in in between one of the if statements after all the args lists, white space lined up with the if below and above it:
new_metadata[gguf.Keys.Tokenizer.PRE] = MetadataDetails(gguf.GGUFValueType.STRING, "llama-bpe")
then save it and delete the original model and run:
python gguf-new-metadata.py --verbose Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5-ptf.gguf Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5.gguf
@failspy
are you sure this is the right model, not smaug? As I can see convert script checks hashes of tokenized string: https://github.com/ggerganov/llama.cpp/blob/975ec63ff26cdf96156d1126d86f75a395fdc43a/convert-hf-to-gguf.py#L476 so the only way I see it could be smaug-bpe
is that the model was indeed smaug :)
The difference between those 2 tokenizers is that original llama has "ignore_merges": true
and smaug has "ignore_merges": false
. In your model there is no such config at all: https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5/raw/main/tokenizer.json so it probably defaults to false and that's why convert script recognizes it as smaug-bpe
. But it's defined in the previous version: https://huggingface.co/failspy/Smaug-Llama-3-70B-Instruct-abliterated-v3/raw/main/tokenizer.json So looks like something is wrong with your safetensors model.
Goes to show me for trying to fix the tokenizer manually. Dammit Meta-Llama for not giving me access to the original repo. This is based on the correct model. Thanks for doing the detailed investigation @kurnevsky
Goes to show me for trying to fix the tokenizer manually. Dammit Meta-Llama for not giving me access to the original repo. This is based on the correct model. Thanks for doing the detailed investigation @kurnevsky
Sorry but I think I got confused somewhere in the middle -- can you tell me if I should redownload this or requantize it or is the hack good enough? Thanks again for the wonderful work.
The hack is good enough!
They added a new arg to fix tokenizers: https://github.com/ggerganov/llama.cpp/pull/7627
@failspy will you fix the safetensors model as well? Because it will have wrong tokenization as well with those missing tokenizer configs.
Okay, well I finally managed to get my hands on the fixed tokenizer_config.json
as it appears in the meta-llama repo. I've published it to the safetensors repos, and fixed Llama-3-8B-Instruct-abliterated-v3s
Meta-Llama-3-70B-Instruct-abliterated-v3.5-GGUF is presently being uploaded. Sorry about this y'all. Thanks for your patience.
But the problem is not with tokenizer_config.json
, it's with tokenizer.json
.
@kurnevsky
You're right. By fixed tokenizer_config.json, I mean the config that fixed the original EOS token issues that many faced with Llama-3.
I've uploaded an updated tokenizer.json
to address the ignore_merges
issue