Is it the same architecture than GLM 4.5 ?

by AliceThirty - opened Oct 1

Oct 1

•

I can perfectly load GLM-4.5 with the latest version of koboldcpp (a fork of llama.cpp), but I can't load GLM-4.6. It says a tensor is missing. I noticed you fixed the chat template, is it related? I verified the hash of each part file before merging them with llama-gguf-split.exe, and I tried with both UD-Q3_K_XL and UD-Q4_K_XL, but same error. Should I raise this issue to the koboldcpp github instead?

***
Welcome to KoboldCpp - Version 1.99.4
Loading .kcpps configuration file...
Overriding Config Value: gpulayers
Overriding Config Value: quiet
System: Windows 10.0.26100 AMD64 AMD64 Family 26 Model 68 Stepping 0, AuthenticAMD
Detected Available GPU Memory: 32607 MB
Detected Available RAM: 183799 MB
Initializing dynamic library: koboldcpp_cublas.dll
Loading Text Model: C:\models\GLM-4.6-UD-Q3_K_XL.gguf

The reported GGUF Arch is: glm4moe
Arch Category: 9

---
Identified as GGUF model.
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
CUDA MMQ: True

Applying Tensor Split...
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes (only for first launch)...
---
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 30841 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 5090) (0000:03:00.0) - 30841 MiB free
llama_model_loader: loaded meta data with 57 key-value pairs and 1759 tensors from C:\models\GLM-4.6-UD-Q3_K_XL.gguf (version GGUF V3 (latest))
print_info: file format = GGUF V3 (latest)
print_info: file size   = 147.21 GiB (3.54 BPW)
init_tokenizer: initializing tokenizer for type 2
load: special_eot_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special_eom_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 151329 ('<|endoftext|>')
load:   - 151336 ('<|user|>')
load:   - 151338 ('<|observation|>')
load: special tokens cache size = 36
load: token to piece cache size = 0.9713 MB
print_info: arch             = glm4moe
print_info: vocab_only       = 0
print_info: n_ctx_train      = 202752
print_info: n_embd           = 5120
print_info: n_layer          = 93
print_info: n_head           = 96
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 12
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 160
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 202752
print_info: rope_finetuned   = unknown
print_info: model type       = 355B.A32B
print_info: model params     = 356.79 B
print_info: general.name     = Glm-4.6
print_info: vocab type       = BPE
print_info: n_vocab          = 151552
print_info: n_merges         = 318088
print_info: BOS token        = 151331 '[gMASK]'
print_info: EOS token        = 151329 '<|endoftext|>'
print_info: EOT token        = 151336 '<|user|>'
print_info: EOM token        = 151338 '<|observation|>'
print_info: UNK token        = 151329 '<|endoftext|>'
print_info: PAD token        = 151330 '[MASK]'
print_info: LF token         = 198 'ÄS'
print_info: FIM PRE token    = 151347 '<|code_prefix|>'
print_info: FIM SUF token    = 151349 '<|code_suffix|>'
print_info: FIM MID token    = 151348 '<|code_middle|>'
print_info: EOG token        = 151329 '<|endoftext|>'
print_info: EOG token        = 151336 '<|user|>'
print_info: EOG token        = 151338 '<|observation|>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = false)
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 35389440 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 2949120 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 2949120 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 35389440 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_exps.weight (size = 540672000 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_exps.weight (size = 540672000 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_exps.weight (size = 540672000 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 4423680 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 5406720 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 4423680 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 22528000 bytes) -- ignoring
llama_model_load: error loading model: missing tensor 'blk.92.nextn.embed_tokens.weight'
llama_model_load_from_file_impl: failed to load model
Process Process-2:
Traceback (most recent call last):
  File "multiprocessing\process.py", line 315, in _bootstrap
  File "multiprocessing\process.py", line 108, in run
  File "koboldcpp.py", line 7230, in kcpp_main_process
  File "koboldcpp.py", line 1445, in load_model
OSError: exception: access violation reading 0x0000000000000004

x-polyglot-x

Oct 1

I am receiving the same error: "llama_model_load: error loading model: missing tensor 'blk.92.nextn.embed_tokens.weight'" when trying to load it via llama.cpp/llama-server. This is with UD-Q2-K-XL.

AliceThirty

Oct 1

Ok, my version of llama.cpp produced the same error. But I tried with the latest version of llama.cpp and now it works, I suppose koboldcpp will be updated soon to include the recent changes to llama.cpp

Lowkey-Loki

Oct 1

•

edited Oct 1

Yup, looks like the issue has been addressed with: https://github.com/ggml-org/llama.cpp/pull/16359
Running latest llama.cpp fixes it for me as well

x-polyglot-x

Oct 1

@AliceThirty

Upgrading to the newest version of llama.cpp seems to have fixed it for me, too.

danielhanchen

Unsloth AI org Oct 2

Yep please rebuild llama.cpp from source!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment