Cannot run on llama.cpp and koboldcpp
I have download the q5_1.bin and try to run in llama.cpp and koboldcpp, but it does not work.
I have checked the file SHA256 and it is the same.
Here is the llama.cpp's error code:
main -m ./models/starcoder-ggml-q5_1.bin -t 12 -n -1 -c 2048 --keep -1 --repeat_last_n 2048 --top_k 160 --top_p 0.95 --color -ins -r "User:" --keep -1 --interactive-first
main: build = 536 (cdd5350)
main: seed = 1684312164
llama.cpp: loading model from ./models/starcoder-ggml-q5_1.bin
error loading model: missing tok_embeddings.weight
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/starcoder-ggml-q5_1.bin'
main: error: unable to load model
certutil -hashfile starcoder-ggml-q5_1.bin SHA256
SHA256 hash of starcoder-ggml-q5_1.bin:
c52e0cd23878c3373a8a7f6adb484a00dbae11b1d6bbd84aa20e82378cbb4bfa
CertUtil: -hashfile command completed successfully.
And here is koboldcpp:
```
Welcome to KoboldCpp - Version 1.21.1
For command line arguments, please refer to --help
Otherwise, please manually select ggml file:
Attempting to use OpenBLAS library for faster prompt ingestion. A compatible libopenblas will be required.
Initializing dynamic library: koboldcpp_openblas.dll
Loading model: D:\program\koboldcpp\starcoder-ggml-q5_1.bin
[Threads: 12, BlasThreads: 12, SmartContext: True]
Identified as GPT-NEO-X model: (ver 401)
Attempting to Load...
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
stablelm_model_load: loading model from 'D:\program\koboldcpp\starcoder-ggml-q5_1.bin' - please wait ...
stablelm_model_load: n_vocab = 49152
stablelm_model_load: n_ctx = 8192
stablelm_model_load: n_embd = 6144
stablelm_model_load: n_head = 48
stablelm_model_load: n_layer = 40
stablelm_model_load: n_rot = 1009
stablelm_model_load: ftype = 49152
GGML_ASSERT: ggml.c:3446: wtype != GGML_TYPE_COUNT
There is an issue in llama.cpp repo - https://github.com/ggerganov/llama.cpp/issues/1441
For now there is only example code here - https://github.com/ggerganov/ggml/tree/master/examples/starcoder
This code works, but not very useful: it loads model, generates reply to single prompt and shutting down. Now I keep experimenting with this code to get conversation loop, but have troubles with it - looks like I didn't get how to correctly manage memory. It breaks after single iteration of loop with "not enough memory in context". Will see if I can do better.
Also relates - https://github.com/LostRuins/koboldcpp/issues/181
For now koboldcpp supports starcoder gglm models.