Is it the same architecture than GLM 4.5 ?

#3
by AliceThirty - opened

I can perfectly load GLM-4.5 with the latest version of koboldcpp (a fork of llama.cpp), but I can't load GLM-4.6. It says a tensor is missing. I noticed you fixed the chat template, is it related? I verified the hash of each part file before merging them with llama-gguf-split.exe, and I tried with both UD-Q3_K_XL and UD-Q4_K_XL, but same error. Should I raise this issue to the koboldcpp github instead?

***
Welcome to KoboldCpp - Version 1.99.4
Loading .kcpps configuration file...
Overriding Config Value: gpulayers
Overriding Config Value: quiet
System: Windows 10.0.26100 AMD64 AMD64 Family 26 Model 68 Stepping 0, AuthenticAMD
Detected Available GPU Memory: 32607 MB
Detected Available RAM: 183799 MB
Initializing dynamic library: koboldcpp_cublas.dll
Loading Text Model: C:\models\GLM-4.6-UD-Q3_K_XL.gguf

The reported GGUF Arch is: glm4moe
Arch Category: 9

---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True

Applying Tensor Split...
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
---
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 30841 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 5090) (0000:03:00.0) - 30841 MiB free
llama_model_loader: loaded meta data with 57 key-value pairs and 1759 tensors from C:\models\GLM-4.6-UD-Q3_K_XL.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size   = 147.21 GiB (3.54 BPW)
init_tokenizer: initializing tokenizer for type 2
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')
load:   - 151338 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch             = glm4moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 202752
print_info: n_embd           = 5120
print_info: n_layer          = 93
print_info: n_head           = 96
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 12
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 160
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 202752
print_info: rope_finetuned   = unknown
print_info: model type       = 355B.A32B
print_info: model params     = 356.79 B
print_info: general.name     = Glm-4.6
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151331 '[gMASK]'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: EOM token        = 151338 '<|observation|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151330 '[MASK]'
print_info: LF token         = 198 'Γ„S'
print_info: FIM PRE token    = 151347 '<|code_prefix|>'
print_info: FIM SUF token    = 151349 '<|code_suffix|>'
print_info: FIM MID token    = 151348 '<|code_middle|>'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: EOG token        = 151338 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false)
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 35389440 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 2949120 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 2949120 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 35389440 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_exps.weight (size = 540672000 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_exps.weight (size = 540672000 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_exps.weight (size = 540672000 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 4423680 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 5406720 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 4423680 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 22528000 bytes) -- ignoring
llama_model_load: error loading model: missing tensor 'blk.92.nextn.embed_tokens.weight'
llama_model_load_from_file_impl: failed to load model
Process Process-2:
Traceback (most recent call last):
  File "multiprocessing\process.py", line 315, in _bootstrap
  File "multiprocessing\process.py", line 108, in run
  File "koboldcpp.py", line 7230, in kcpp_main_process
  File "koboldcpp.py", line 1445, in load_model
OSError: exception: access violation reading 0x0000000000000004

I am receiving the same error: "llama_model_load: error loading model: missing tensor 'blk.92.nextn.embed_tokens.weight'" when trying to load it via llama.cpp/llama-server. This is with UD-Q2-K-XL.

Ok, my version of llama.cpp produced the same error. But I tried with the latest version of llama.cpp and now it works, I suppose koboldcpp will be updated soon to include the recent changes to llama.cpp

Yup, looks like the issue has been addressed with: https://github.com/ggml-org/llama.cpp/pull/16359
Running latest llama.cpp fixes it for me as well

@AliceThirty

Upgrading to the newest version of llama.cpp seems to have fixed it for me, too.

Unsloth AI org

Yep please rebuild llama.cpp from source!

Sign up or log in to comment