Doesn't work - should wait for PR #19283

#2
by ddh0 - opened

These GGUFs no longer work - we should wait for the PR to be officially merged, and then need to reconvert: https://github.com/ggml-org/llama.cpp/pull/19283

llama_model_load: error loading model: error loading model hyperparameters: key not found in model: step35.swiglu_clamp_exp
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/media/T9/gguf/Step-3.5-Flash-Q4_K_S-00001-of-00003.gguf'
srv    load_model: failed to load model, '/media/T9/gguf/Step-3.5-Flash-Q4_K_S-00001-of-00003.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

Hey, tested Q4_K_M and working here, building llama.cpp based on the "fork" from the official repo https://github.com/stepfun-ai/Step-3.5-Flash/tree/main/llama.cpp.

slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  1 | task 1421 | processing task
slot update_slots: id  1 | task 1421 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 19
slot update_slots: id  1 | task 1421 | need to evaluate at least 1 token for each active slot (n_past = 19, task.n_tokens() = 19)
slot update_slots: id  1 | task 1421 | n_past was set to 18
slot update_slots: id  1 | task 1421 | n_tokens = 18, memory_seq_rm [18, end)
slot update_slots: id  1 | task 1421 | prompt processing progress, n_tokens = 19, batch.n_tokens = 1, progress = 1.000000
slot update_slots: id  1 | task 1421 | prompt done, n_tokens = 19, batch.n_tokens = 1
slot print_timing: id  1 | task 1421 | 
prompt eval time =      31.16 ms /     1 tokens (   31.16 ms per token,    32.10 tokens per second)
       eval time =   14613.19 ms /   574 tokens (   25.46 ms per token,    39.28 tokens per second)
      total time =   14644.35 ms /   575 tokens
slot      release: id  1 | task 1421 | stop processing: n_tokens = 592, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.10.0.114 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 1996 | processing task
slot update_slots: id  0 | task 1996 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 469
slot update_slots: id  0 | task 1996 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 1996 | prompt processing progress, n_tokens = 405, batch.n_tokens = 405, progress = 0.863539
slot update_slots: id  0 | task 1996 | n_tokens = 405, memory_seq_rm [405, end)
slot update_slots: id  0 | task 1996 | prompt processing progress, n_tokens = 469, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  0 | task 1996 | prompt done, n_tokens = 469, batch.n_tokens = 64
slot update_slots: id  0 | task 1996 | created context checkpoint 1 of 8 (pos_min = 0, pos_max = 404, size = 52.212 MiB)

Really clever model btw, one shoted a tetris python game without issues.

image

That's cool, but that's not what's going to get merged into llama.cpp. The PR I linked will become the official llama.cpp / GGUF implementation, and it no longer works with these files. They will need to be regenerated when support lands officially.

Thank you for telling me, I'll regenerate the files when the PR is merged. Unless someone makes a quantization too, in that case I'll remove this repo

I remade the Q6_K locally from scratch with the latest changes of the PR, and I find it works better than my original Q6_k

The PR is merged already

I started the script to convert and upload the model, the quantized files will be uploaded soon (about 1h for the Q2_K, and I think about 15-30 minutes between each quantization type)

thank you for this.

Sign up or log in to comment