unknown pre-tokenizer type: 'smaug-bpe'

by kep359 - opened May 29

May 29

Love your models! Having an issue with the Llama3 70B instruct Abliterated. v3.5-GGUFs that I didn't have with the v3 variants. I'm using oogabooga.

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'smaug-bpe'

Any thoughts on what the issue may be?

kep359

May 29

I might have answered my own question. It seems Smaug support was added 3 days ago to llama.ccp in commit https://github.com/ggerganov/llama.cpp/releases/tag/b3001

llama-cpp-python is just a few days behind. Guess I'll wait for the oobabooga update.

kurnevsky

May 29

But I guess it shouldn't be smaug-bpe for plain llama 3.

failspy

Owner May 29

How on earth did it get smaug bpe?

Jobaar

May 29

How on earth did it get smaug bpe?

llama.cpp/gguf-py/scripts$ python gguf-dump.py Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5.gguf --json
"tokenizer.ggml.pre": {
"index": 17,
"offset": 619,
"type": "STRING",
"value": "smaug-bpe"
},

Jobaar

May 29

•

edited May 29

So I fixed it in a super hacky way:

in /gguf-py/scripts:

python gguf-new-metadata.py --remove-metadata tokenizer.ggml.pre Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5.gguf Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5-ptf.gguf
 
nano gguf-new-metadata.py

add in in between one of the if statements after all the args lists, white space lined up with the if below and above it:

new_metadata[gguf.Keys.Tokenizer.PRE] = MetadataDetails(gguf.GGUFValueType.STRING, "llama-bpe")

then save it and delete the original model and run:

  python gguf-new-metadata.py --verbose Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5-ptf.gguf Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5.gguf

kep359

May 29

@failspy Before I try my hand at that hacky fix, do you have any plans to upload fixed GGUFs? "v4.0"

failspy

Owner May 29

@Jobaar That's a wonderful hack! I'll do that and re-upload and credit you for the fix. Saves me having to requant everything

kurnevsky

May 29

•

edited May 29

@failspy are you sure this is the right model, not smaug? As I can see convert script checks hashes of tokenized string: https://github.com/ggerganov/llama.cpp/blob/975ec63ff26cdf96156d1126d86f75a395fdc43a/convert-hf-to-gguf.py#L476 so the only way I see it could be smaug-bpe is that the model was indeed smaug :)

kurnevsky

May 29

•

edited May 29

The difference between those 2 tokenizers is that original llama has "ignore_merges": true and smaug has "ignore_merges": false. In your model there is no such config at all: https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5/raw/main/tokenizer.json so it probably defaults to false and that's why convert script recognizes it as smaug-bpe. But it's defined in the previous version: https://huggingface.co/failspy/Smaug-Llama-3-70B-Instruct-abliterated-v3/raw/main/tokenizer.json So looks like something is wrong with your safetensors model.

failspy

Owner May 29

Goes to show me for trying to fix the tokenizer manually. Dammit Meta-Llama for not giving me access to the original repo. This is based on the correct model. Thanks for doing the detailed investigation @kurnevsky

Jobaar

May 29

Goes to show me for trying to fix the tokenizer manually. Dammit Meta-Llama for not giving me access to the original repo. This is based on the correct model. Thanks for doing the detailed investigation @kurnevsky

Sorry but I think I got confused somewhere in the middle -- can you tell me if I should redownload this or requantize it or is the hack good enough? Thanks again for the wonderful work.

failspy

Owner May 29

The hack is good enough!

kurnevsky

May 30

They added a new arg to fix tokenizers: https://github.com/ggerganov/llama.cpp/pull/7627

@failspy will you fix the safetensors model as well? Because it will have wrong tokenization as well with those missing tokenizer configs.

failspy

Owner May 30

Okay, well I finally managed to get my hands on the fixed tokenizer_config.json as it appears in the meta-llama repo. I've published it to the safetensors repos, and fixed Llama-3-8B-Instruct-abliterated-v3s

Meta-Llama-3-70B-Instruct-abliterated-v3.5-GGUF is presently being uploaded. Sorry about this y'all. Thanks for your patience.

kurnevsky

May 30

But the problem is not with tokenizer_config.json, it's with tokenizer.json.

failspy

Owner May 30

@kurnevsky You're right. By fixed tokenizer_config.json, I mean the config that fixed the original EOS token issues that many faced with Llama-3.
I've uploaded an updated tokenizer.json to address the ignore_merges issue

failspy changed discussion status to closed May 30

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment