AesSedai/GLM-4.6-REAP-266B-A32B

Downtown-Case

6 days ago

•

edited 6 days ago

Can you share how you ran the repo's script? What parameters? I assume it’s a 25% cull?

What hardware did it require?

Is there any other data you can share, like data used to select the experts to cull?

Lockout

6 days ago

I assume it needs HW to run full/fp8 pytorch model. I'd love some GGUF of any of these reaps at least. Higher quants/faster speeds. EXL3 in 96gb looking good too at the smaller size.

The method itself I think uses inference to figure out what parameters to cut so it isn't blind.

Downtown-Case

6 days ago

•

edited 6 days ago

Yeah.

I’m asking because even a 10-15% prune would be perfect for me (depending on how lossy it actually is), and if the “list” of experts to cull is already made, perhaps other sizes could be too.

I’m also interested in the reference dataset. Was it coding, which seems to be the suggested default in the scripts? Perhaps optimal culls for different tasks are different.

Lockout

6 days ago

Yea I'd hate for it to be coding and I'm trying to do conversations. Without quants I have no way of trying them out. It's too much to download. They could be perfect or completely awful.

AesSedai

Owner 6 days ago

I made some real scuffed modifications to the code via fork: https://github.com/AesSedai/reap

Then rented some cloud compute to perform the REAP, here were my approx. steps that I recorded. The whole nvme formatting bit can be ignored, was more something with how the instance was set up with attached unformatted storage:

git clone https://github.com/AesSedai/reap.git

# python.h header files needed for zstandard, needed for helm (https://github.com/stanford-crfm/helm.git)
sudo apt-get update
sudo apt-get install python3.12-dev

source .venv/bin/activate

# format nvme storage, 3.5TB per disk so one disk is fine
lsblk

sudo parted /dev/nvme1n1
sudo parted /dev/nvme1n1 mklabel gpt
sudo parted -a opt /dev/nvme1n1 mkpart primary ext4 0% 100%
sudo mkfs.ext4 /dev/nvme1n1p1
sudo mkdir /mnt/slush
sudo mount /dev/nvme1n1p1 /mnt/slush
lsblk -f
df -h /mnt/slush
sudo mkdir -p /mnt/slush/pruned_models
sudo chown -R user:user /mnt/slush/
ln -s /mnt/slush/pruned_models /home/user/reap/artifacts/GLM-4.6/evol-codealpaca-v1/pruned_models

python ./scripts/patch_glm.py

bash experiments/pruning-cli.sh 0,1,2,3,4,5,6,7 zai-org/GLM-4.6 reap 42 0.25 theblackcat102/evol-codealpaca-v1 true true true false false

The lm_eval worked, but evalplus ran into issues cloning the dataset down from github so I didn't have time to finish troubleshooting the issue there.

vkerkez

6 days ago

Is there a way to get a awq version of this? that would be awesome but I cannot find it.

Downtown-Case

5 days ago

•

edited 5 days ago

Awesome! Thanks for the steps.

bash experiments/pruning-cli.sh 0,1,2,3,4,5,6,7 zai-org/GLM-4.6 reap 42 0.25 theblackcat102/evol-codealpaca-v1 true true true false false
The `lm_eval` worked, but `evalplus` ran into issues cloning the dataset down from github so I didn't have time to finish troubleshooting the issue there.

What do you mean by this? Do you change 'theblackcat102/evol-codealpaca-v1' to some lm_eval benchmark, and it will use that for the pruning?

AesSedai

Owner 5 days ago

The theblackcat102/evol-codealpaca-v1 is the dataset used for producing the list of experts to prune if I understand the REAP code correctly. That is the default dataset arg that they provided in their repo's README so I just went with that.

lm_eval uses its own defaults, those five true true true false false flags at the end are what tells it which benchmarks to run:

lm_eval
evalplus
livecodebench
math
wildbench

It did the lm_eval benchmark successfully, but was failing when trying to run the evalplus benchmark due to the host I was running on not being able to setup the evalplus dataset due to some HTTP error that prevented it from downloading from github. Not sure if it was some sort of odd, possibly region or IP-specific issue since I was able to git clone from github to set up REAP on the host. But I kept getting an HTTP connection reset and it wasn't able to download.

Downtown-Case

5 days ago

•

edited 5 days ago

Hah, sounds like one of those “works on our lab machine at this point in time” kind of repo things.

Thanks.

I want to try this with tiny MoEs and see if other datasets work (and prune a notably different set of experts), then rent something to try on 4.6.

AesSedai

Owner 5 days ago

Judging from the way the git submodules are configured in the original repo, pointing to a private user version of the upstream dependencies, that's probably true. One of the first changes I had to do was to point to the public upstream git repos for lm_eval, evalplus, helm, etc.

ddh0

5 days ago

I am stumbling my way around preserving MTP tensors during REAP here: ddh0/reap:mtp. Aes Sedai now has push access to this fork as well so hopefully we can get something working soon. 🤞

gghfez

5 days ago

@Downtown-Case

then rent something to try on 4.6

I haven't looked at the code yet, but I wasn't able to create an exllamav3 quant of this one, due to the number of experts (120) not being divisible by 32.

So that might be something to consider when you prune GLM-4.6. (eg. keeping 128 experts instead of 120).

@ddh0

I am stumbling my way around preserving MTP tensors during REAP

Are these necessary for a quant, or more for future use when MTP gets implemented in one of the popular inference engines?

ddh0

5 days ago

I am stumbling my way around preserving MTP tensors during REAP

Are these necessary for a quant, or more for future use when MTP gets implemented in one of the popular inference engines?

Yes, llama.cpp expects them to be present and won't load without them. We could maybe patch llama.cpp but I think it's better to preserve the MTP tensors to avoid having to re-quant later when MTP becomes supported

gghfez

5 days ago

@ddh0

Yes, llama.cpp expects them to be present and won't load without them.

Oh, I managed to create a gguf without them and it seems coherent.

That's just a quick Q2_K with no imatrix ^

Just swap this from 1 -> 0 before running the conversion script:

https://huggingface.co/AesSedai/GLM-4.6-REAP-266B-A32B/blob/main/config.json#L32

gghfez

5 days ago

@AesSedai Great work, this is awesome! (I'm only testing in English)

ddh0

5 days ago

Oh, that's good to know. But I think for a proper release it's better to keep the tensors included to support MTP once it's implemented

AesSedai

Owner 5 days ago

@ddh0

Yes, llama.cpp expects them to be present and won't load without them.

Oh, I managed to create a gguf without them and it seems coherent.

That's just a quick Q2_K with no imatrix ^

Just swap this from 1 -> 0 before running the conversion script:

https://huggingface.co/AesSedai/GLM-4.6-REAP-266B-A32B/blob/main/config.json#L32

Interesting, good to know that's a workaround at least!

Lockout

5 days ago

Oh, that's good to know. But I think for a proper release it's better to keep the tensors included to support MTP once it's implemented

MTP isn't going to help you on hybrid inference. The PR in llama.cpp is already proving that out. Same as how nobody got much juice from deepseek MTP. Since they never pruned the MTP layer (prolly how this got started) it's not likely to be functional even if you hack it back in.

Kudos on seeing the Q2K.. just holding out for larger Q3/Q4 quants. Am using the big one at Q3K_XL so IQ4_NL or one of them.. whatever is around 120-130 with imatrix. The EXL3 is probably going to be lit as well. No more 2.01bpw, maybe I get in the 3s. This would be done but I dunno who I have to shank for better internet..So... the question is...

How is it?

gghfez

4 days ago

Am using the big one at Q3K_XL so IQ4_NL or one of them..

Check out this quant if you haven't already: Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF.
It seems like the best bang-for-buck if you can run it.

Kudos on seeing the Q2K..

just holding out for larger Q3/Q4 quants.

I'll put the Q4_K up for a little while then (until I get close to the HF public storage limit again), but these guy are probably going to make proper i-matrix'd quants once they get MTP re-implemented.

How is it?

I only tested it briefly, it seemed "normal" to me. I didn't imatrix it. A pruned model like this without retraining, I doubt will be better than the big one at Q3K_XL.

Downtown-Case

4 days ago

•

edited 4 days ago

Well the interesting question is where the “crossover “ is.

If one can only run a Q2KL mix, would pruning 12% and jumping up above 3bpw be better? The paper certainly suggests so. There’s a steep cliff between IQ2KL and IQ3KS.

Same with exl3, as dropping from 3bpw to 2 is painful.

Even if the losses are really domain specific (with the default being alpaca code style questions), thats still interesting, as prunes could be “specialized” with no retraining.

Downtown-Case

4 days ago

•

edited 4 days ago

Oh, that's good to know. But I think for a proper release it's better to keep the tensors included to support MTP once it's implemented

MTP isn't going to help you on hybrid inference. The PR in llama.cpp is already proving that out. Same as how nobody got much juice from deepseek MTP. Since they never pruned the MTP layer (prolly how this got started) it's not likely to be functional even if you hack it back in.

I was arguing with someone about this the other day, and suspected this, as the CPU doesn’t have the “extra” compute to make MTP so cheap like GPU.

Kudos on seeing the Q2K.. just holding out for larger Q3/Q4 quants. Am using the big one at Q3K_XL so IQ4_NL or one of them.. whatever is around 120-130 with imatrix. The EXL3 is probably going to be lit as well. No more 2.01bpw, maybe I get in the 3s. This would be done but I dunno who I have to shank for better internet..

I can't make the imatrix with 128GB RAM (can I?), but I can make quants once someone else does.

So... the question is...

How is it?

That is an excellent question.

Is KLD testing this vs full GLM valid? Or are benchmarks the only reliable way?

Lockout

4 days ago

I don't know if KLD is valid but it's certainly going to be enlightening. KLD against the full model and KLD against the pruned model. Guess I better start downloading, its probably going to take overnight. Q3K_XL is 158gb and this is 162gb so I keep speed the same but in theory gain fidelity.

Dataset was english, right? So that means we're shedding the CN experts. Something that was already done with qwen in a more brutal way. IIRC experts only expert in series of tokens like punctuation, etc.

ddh0

4 days ago

Dataset was english, right? So that means we're shedding the CN experts.

No

Downtown-Case

4 days ago

Yeah I’m not sure I buy that. Don’t LLMs abstract language away, hence they pick up stuff even if trained in another language?

The testing will be enlightening. I am AFK, but will try some stuff with smaller MoEs.

Lockout

4 days ago

It's not something to buy, it's the results of a similar experiment that kalomaze did: https://huggingface.co/kalomaze/Qwen3-16B-A3B/discussions/14

gghfez

4 days ago

Using the model at Q4_K, it surprised me by getting some logic tests right that the full version (IQ3_KS and Z.AI API) get wrong...?

I can't make the imatrix with 128GB RAM (can I?), but I can make quants once someone else does.

I've created an imatrix from the Q8_0 using ubergarm's calibration data.

https://huggingface.co/gghfez/GLM-4.6-REAP-266B-A32B-Q4_K/blob/main/GLM-4.6-REAP-266B-A32B-imatrix.dat

Note: That Q4_K quant in that repo is not calibrated though.

similar experiment that kalomaze did

That model lost a lot more than just non-English though lol.

gghfez

4 days ago

I used the model a bit more now at Q4_K.
It seemed good for 1-shot translations, throw-away app generation, etc. But it seems to degrade quickly with multi-turn conversations.
Still coherent, but it misses information randomly (Full version at iq2_ks doesn't do this).

Lockout

4 days ago

•

edited 3 days ago

I have to nut up and go USB the last 15gb shard to my server. They are calling the more pruned one dumb. Be sad if this all didn't work. How are the actual cerebras ones doing?

So the prompt processing is a little faster, the t/g a little slower. Model is a bit strange:

SVG kitty

Lockout

3 days ago

and uh... can any of you get better out of it?

Downtown-Case

3 days ago

There’s a lot of testing to be done (which I can only start in like two days), but I’m thinking:

SimpleQA subset on full GLM and 75% REAP GLM, at the same quant, for general world knowledge.
KLD between them and the Q8
A sanity check of the same thing with Cerebras's official releases. It’s possible something tiny went wrong with 4.6 specifically.

Lockout

3 days ago

Surprising little info on the cerebras releases. You'd think people would give some feedback. I kinda like what happened to this model's writing and personality, especially OOD. Feels like some of the positivity bias got pruned. Wonder how the smaller one fares, will find out tomorrow.

Their claims of being worse for creative but coding intact appears flipped.

Downtown-Case

3 days ago

They did some testing, and, in fact, showed gains with GLM Air on WildBench, a more creative general benchmark, and small losses for tool use:

https://arxiv.org/html/2510.13999v1

Downtown-Case

3 days ago

•

edited 3 days ago

Oh I get what you’re saying. The low community feedback is kinda par, heh.

GLM 4.6 Air “distill” was a total placebo (a carbon copy of Air 4.5), yet seemed to have gotten a lot of vllm deployments and such which no one even noticing for weeks. And this release is still new.

Lockout

3 days ago

One thing I found is that full GLM knows who filian is and the pruned thinks it's a dude who plays minecraft. A couple of other vtubers are intact. I checked lower temp with top-n-sigma and it persists. Looks like that fact got pruned. My guess is that alpaca coding is probably not the best dataset to use. A mix of creative/tools/coding might be better.

Thankful that a quant/prune was made at all; this method, however, needs some refinement. Make a dataset of slop and prune the inverse some way?

AesSedai

Owner 3 days ago

I'm going to go with the "set nextn to 0" and quant a Q8 and also a q8/q5/q5/q6 quant and run KLD on it compared to the rest of the GLM-4.6 data I've collected. Should have those results in a few hours.

Downtown-Case

3 days ago

•

edited 3 days ago

With this text?

https://huggingface.co/ddh0/imatrices/blob/main/ddh0_imat_calibration_data_v2.txt

(For the reference of others in this thread)

AesSedai

Owner 3 days ago

For calibration and producing the imatrix, I use Bartowski's v3 data from here: https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8

For KLD and testing the reference logits, I use ddh0's imat calibration data yes.

AesSedai

Owner 3 days ago

•

edited 3 days ago

Woof, PPL and KLD cannot be compared against the reference Q8_0 logits I have for the rest of my data against the unpruned model.

vkerkez

3 days ago

I did a awq https://huggingface.co/vkerkez/GLM-4.6-REAP-266B-A32B-awq-sym conversion and it’s working really well for me

Lockout

3 days ago

•

edited 3 days ago

So logits are madly different. The high perplexity is a bit worrisome.

new challengers have appeared: https://huggingface.co/cerebras/GLM-4.6-REAP-252B-A32B-FP8

Downtown-Case

1 day ago

•

edited 1 day ago

So I finally got home and got around to testing reap with Qwen 30B locally. Took me hours to troubleshoot random dependency issues, ugh.

But thanks for the fixes ddh0.

Anyway, I noticed they have several presets for pruning datasets:

--dataset_name {
m-a-p/CodeFeedback-Filtered-Instruction,
ise-uiuc/Magicoder-Evol-Instruct-110K,
allenai/c4,
theblackcat102/evol-codealpaca-v1,
euclaise/WritingPrompts_curated,
allenai/tulu-3-sft-personas-math,
combined
}

I tried it with euclaise/WritingPrompts_curated, and will see how an exl3 evals.

Downtown-Case

1 day ago

Oh, it seems there are some other prunes floating around.

Is there any discussion on the effects of the dataset used to prune?

Lockout

1 day ago

The 218b is shit. It literally forgot parts of english. I'm probably going to try the 268b, there is one quant of it in Q3. Nobody else has run PPL tests only coding benchmarks. What datasets were used to do this one? I think ideally you want to do creative writing + instruction following + code together in even amounts. Cerebras used theblackcat102/evol-codealpaca-v1?

Downtown-Case

about 20 hours ago

There are several dataset presets in the pruning script, including a short writing prompt dataset and a “combined” option. It’s also (seemingly) very quick to run, hence I’m going to run a few experiments on Qwen 30B.

My guess is the result is a quite specialized model.