Doesn't work - should wait for PR #19283

by ddh0 - opened 9 days ago

ddh0

9 days ago

These GGUFs no longer work - we should wait for the PR to be officially merged, and then need to reconvert: https://github.com/ggml-org/llama.cpp/pull/19283

llama_model_load: error loading model: error loading model hyperparameters: key not found in model: step35.swiglu_clamp_exp
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/media/T9/gguf/Step-3.5-Flash-Q4_K_S-00001-of-00003.gguf'
srv    load_model: failed to load model, '/media/T9/gguf/Step-3.5-Flash-Q4_K_S-00001-of-00003.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error

snown3d

9 days ago

Hey, tested Q4_K_M and working here, building llama.cpp based on the "fork" from the official repo https://github.com/stepfun-ai/Step-3.5-Flash/tree/main/llama.cpp.

slot launch_slot_: id  1 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  1 | task 1421 | processing task
slot update_slots: id  1 | task 1421 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 19
slot update_slots: id  1 | task 1421 | need to evaluate at least 1 token for each active slot (n_past = 19, task.n_tokens() = 19)
slot update_slots: id  1 | task 1421 | n_past was set to 18
slot update_slots: id  1 | task 1421 | n_tokens = 18, memory_seq_rm [18, end)
slot update_slots: id  1 | task 1421 | prompt processing progress, n_tokens = 19, batch.n_tokens = 1, progress = 1.000000
slot update_slots: id  1 | task 1421 | prompt done, n_tokens = 19, batch.n_tokens = 1
slot print_timing: id  1 | task 1421 | 
prompt eval time =      31.16 ms /     1 tokens (   31.16 ms per token,    32.10 tokens per second)
       eval time =   14613.19 ms /   574 tokens (   25.46 ms per token,    39.28 tokens per second)
      total time =   14644.35 ms /   575 tokens
slot      release: id  1 | task 1421 | stop processing: n_tokens = 592, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.10.0.114 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 1996 | processing task
slot update_slots: id  0 | task 1996 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 469
slot update_slots: id  0 | task 1996 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 1996 | prompt processing progress, n_tokens = 405, batch.n_tokens = 405, progress = 0.863539
slot update_slots: id  0 | task 1996 | n_tokens = 405, memory_seq_rm [405, end)
slot update_slots: id  0 | task 1996 | prompt processing progress, n_tokens = 469, batch.n_tokens = 64, progress = 1.000000
slot update_slots: id  0 | task 1996 | prompt done, n_tokens = 469, batch.n_tokens = 64
slot update_slots: id  0 | task 1996 | created context checkpoint 1 of 8 (pos_min = 0, pos_max = 404, size = 52.212 MiB)

Really clever model btw, one shoted a tetris python game without issues.

ddh0

9 days ago

That's cool, but that's not what's going to get merged into llama.cpp. The PR I linked will become the official llama.cpp / GGUF implementation, and it no longer works with these files. They will need to be regenerated when support lands officially.

AliceThirty

Owner 9 days ago

•

edited 9 days ago

Thank you for telling me, I'll regenerate the files when the PR is merged. Unless someone makes a quantization too, in that case I'll remove this repo

I remade the Q6_K locally from scratch with the latest changes of the PR, and I find it works better than my original Q6_k

concedo

7 days ago

The PR is merged already

AliceThirty

Owner 7 days ago

•

edited 7 days ago

I started the script to convert and upload the model, the quantized files will be uploaded soon (about 1h for the Q2_K, and I think about 15-30 minutes between each quantization type)

ArtemProc

7 days ago

thank you for this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment