FIXED: Error with llama-server `unknown pre-tokenizer type: 'deepseek-r1-qwen'`
FIXED
Turns out a few months ago when llama.cpp switched from Make to cmake, I kept using my stale llama-server
instead of the fresh one in build/bin/llama-server
. I noticed my git sha didn't line up with the binary. Haha.. Ooops.
Everything is working good now!
Original Post
Anyone else getting this error on llama-server? I just pulled latest, rebuilt, and ran the following command.
I'm currently downloading the unsloth version to try to see if that one works?
Will keep trolling r/LocalLLaMA for any other tips. Thanks!
Command
$ git rev-parse --short HEAD
80d0d6b4
$ ./llama-server \
--model "../models/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf" \
--n-gpu-layers 99 \
--ctx-size 8192 \
--parallel 1 \
--cache-type-k f16 \
--cache-type-v f16 \
--threads 16 \
--flash-attn \
--mlock \
--host 127.0.0.1 \
--port 8080
Trace
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
build: 3985 (524afeec) with cc (GCC) 14.2.1 20240910 for x86_64-pc-linux-gnu
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 31
main: loading model
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090 Ti) - 23083 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 771 tensors from ../models/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 32B
llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv 4: general.size_label str = 32B
llama_model_loader: - kv 5: qwen2.block_count u32 = 64
llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
llama_model_loader: - kv 7: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 27648
llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 23: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 24: general.quantization_version u32 = 2
llama_model_loader: - kv 25: general.file_type u32 = 15
llama_model_loader: - kv 26: quantize.imatrix.file str = /models_out/DeepSeek-R1-Distill-Qwen-...
llama_model_loader: - kv 27: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
llama_model_loader: - kv 28: quantize.imatrix.entries_count i32 = 448
llama_model_loader: - kv 29: quantize.imatrix.chunks_count i32 = 128
llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q6_K: 65 tensors
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
llama_load_model_from_file: failed to load model
common_init_from_params: failed to load model '../models/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf'
srv load_model: failed to load model, '../models/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf'
main: exiting due to model loading error
It says build: 3985 (524afeec)
in your output.. can you double check that you're building properly?
oh just saw your update, glad you sorted it :)
Wow latest llama.cpp
with your quants is blowing away vllm
with bnb-4bit
at the moment on my hardware. Really appreciate your efforts in the community!
llama.cpp w/ Q4_K_M
with 16k context gives ~38 tok/secvllm w/ bnb-4bit
with 8k context gives ~23 tok/sec
./llama-server \
--model "../models/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF/DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf" \
--n-gpu-layers 65 \
--ctx-size 16384 \
--parallel 1 \
--cache-type-k f16 \
--cache-type-v f16 \
--threads 16 \
--flash-attn \
--mlock \
--host 127.0.0.1 \
--port 8080
prompt eval time = 166.66 ms / 224 tokens ( 0.74 ms per token, 1344.03 tokens per second)
eval time = 105753.57 ms / 4096 tokens ( 25.82 ms per token, 38.73 tokens per second)
total time = 105920.24 ms / 4320 tokens
woah that's quite surprising/impressive :O