Awesome!
Can you share how you ran the repo's script? What parameters? I assume it’s a 25% cull?
What hardware did it require?
Is there any other data you can share, like data used to select the experts to cull?
I assume it needs HW to run full/fp8 pytorch model. I'd love some GGUF of any of these reaps at least. Higher quants/faster speeds. EXL3 in 96gb looking good too at the smaller size.
The method itself I think uses inference to figure out what parameters to cut so it isn't blind.
Yeah.
I’m asking because even a 10-15% prune would be perfect for me (depending on how lossy it actually is), and if the “list” of experts to cull is already made, perhaps other sizes could be too.
I’m also interested in the reference dataset. Was it coding, which seems to be the suggested default in the scripts? Perhaps optimal culls for different tasks are different.
Yea I'd hate for it to be coding and I'm trying to do conversations. Without quants I have no way of trying them out. It's too much to download. They could be perfect or completely awful.
I made some real scuffed modifications to the code via fork: https://github.com/AesSedai/reap
Then rented some cloud compute to perform the REAP, here were my approx. steps that I recorded. The whole nvme formatting bit can be ignored, was more something with how the instance was set up with attached unformatted storage:
git clone https://github.com/AesSedai/reap.git
# python.h header files needed for zstandard, needed for helm (https://github.com/stanford-crfm/helm.git)
sudo apt-get update
sudo apt-get install python3.12-dev
source .venv/bin/activate
# format nvme storage, 3.5TB per disk so one disk is fine
lsblk
sudo parted /dev/nvme1n1
sudo parted /dev/nvme1n1 mklabel gpt
sudo parted -a opt /dev/nvme1n1 mkpart primary ext4 0% 100%
sudo mkfs.ext4 /dev/nvme1n1p1
sudo mkdir /mnt/slush
sudo mount /dev/nvme1n1p1 /mnt/slush
lsblk -f
df -h /mnt/slush
sudo mkdir -p /mnt/slush/pruned_models
sudo chown -R user:user /mnt/slush/
ln -s /mnt/slush/pruned_models /home/user/reap/artifacts/GLM-4.6/evol-codealpaca-v1/pruned_models
python ./scripts/patch_glm.py
bash experiments/pruning-cli.sh 0,1,2,3,4,5,6,7 zai-org/GLM-4.6 reap 42 0.25 theblackcat102/evol-codealpaca-v1 true true true false false
The lm_eval worked, but evalplus ran into issues cloning the dataset down from github so I didn't have time to finish troubleshooting the issue there.
Is there a way to get a awq version of this? that would be awesome but I cannot find it.
Awesome! Thanks for the steps.
bash experiments/pruning-cli.sh 0,1,2,3,4,5,6,7 zai-org/GLM-4.6 reap 42 0.25 theblackcat102/evol-codealpaca-v1 true true true false false
The `lm_eval` worked, but `evalplus` ran into issues cloning the dataset down from github so I didn't have time to finish troubleshooting the issue there.
What do you mean by this? Do you change 'theblackcat102/evol-codealpaca-v1' to some lm_eval benchmark, and it will use that for the pruning?
The theblackcat102/evol-codealpaca-v1 is the dataset used for producing the list of experts to prune if I understand the REAP code correctly. That is the default dataset arg that they provided in their repo's README so I just went with that.
lm_eval uses its own defaults, those five true true true false false flags at the end are what tells it which benchmarks to run:
- lm_eval
- evalplus
- livecodebench
- math
- wildbench
It did the lm_eval benchmark successfully, but was failing when trying to run the evalplus benchmark due to the host I was running on not being able to setup the evalplus dataset due to some HTTP error that prevented it from downloading from github. Not sure if it was some sort of odd, possibly region or IP-specific issue since I was able to git clone from github to set up REAP on the host. But I kept getting an HTTP connection reset and it wasn't able to download.
Hah, sounds like one of those “works on our lab machine at this point in time” kind of repo things.
Thanks.
I want to try this with tiny MoEs and see if other datasets work (and prune a notably different set of experts), then rent something to try on 4.6.
Judging from the way the git submodules are configured in the original repo, pointing to a private user version of the upstream dependencies, that's probably true. One of the first changes I had to do was to point to the public upstream git repos for lm_eval, evalplus, helm, etc.
I am stumbling my way around preserving MTP tensors during REAP here: ddh0/reap:mtp. Aes Sedai now has push access to this fork as well so hopefully we can get something working soon. 🤞
then rent something to try on 4.6
I haven't looked at the code yet, but I wasn't able to create an exllamav3 quant of this one, due to the number of experts (120) not being divisible by 32.
So that might be something to consider when you prune GLM-4.6. (eg. keeping 128 experts instead of 120).
I am stumbling my way around preserving MTP tensors during REAP
Are these necessary for a quant, or more for future use when MTP gets implemented in one of the popular inference engines?
I am stumbling my way around preserving MTP tensors during REAP
Are these necessary for a quant, or more for future use when MTP gets implemented in one of the popular inference engines?
Yes, llama.cpp expects them to be present and won't load without them. We could maybe patch llama.cpp but I think it's better to preserve the MTP tensors to avoid having to re-quant later when MTP becomes supported
Yes, llama.cpp expects them to be present and won't load without them.
Oh, I managed to create a gguf without them and it seems coherent.
That's just a quick Q2_K with no imatrix ^
Just swap this from 1 -> 0 before running the conversion script:
https://huggingface.co/AesSedai/GLM-4.6-REAP-266B-A32B/blob/main/config.json#L32
Oh, that's good to know. But I think for a proper release it's better to keep the tensors included to support MTP once it's implemented
Yes, llama.cpp expects them to be present and won't load without them.
Oh, I managed to create a gguf without them and it seems coherent.
That's just a quick Q2_K with no imatrix ^
Just swap this from 1 -> 0 before running the conversion script:
https://huggingface.co/AesSedai/GLM-4.6-REAP-266B-A32B/blob/main/config.json#L32
Interesting, good to know that's a workaround at least!
Oh, that's good to know. But I think for a proper release it's better to keep the tensors included to support MTP once it's implemented
MTP isn't going to help you on hybrid inference. The PR in llama.cpp is already proving that out. Same as how nobody got much juice from deepseek MTP. Since they never pruned the MTP layer (prolly how this got started) it's not likely to be functional even if you hack it back in.
Kudos on seeing the Q2K.. just holding out for larger Q3/Q4 quants. Am using the big one at Q3K_XL so IQ4_NL or one of them.. whatever is around 120-130 with imatrix. The EXL3 is probably going to be lit as well. No more 2.01bpw, maybe I get in the 3s. This would be done but I dunno who I have to shank for better internet..So... the question is...
How is it?
Am using the big one at Q3K_XL so IQ4_NL or one of them..
Check out this quant if you haven't already: Downtown-Case/GLM-4.6-128GB-RAM-IK-GGUF.
It seems like the best bang-for-buck if you can run it.
Kudos on seeing the Q2K..
just holding out for larger Q3/Q4 quants.
I'll put the Q4_K up for a little while then (until I get close to the HF public storage limit again), but these guy are probably going to make proper i-matrix'd quants once they get MTP re-implemented.
How is it?
I only tested it briefly, it seemed "normal" to me. I didn't imatrix it. A pruned model like this without retraining, I doubt will be better than the big one at Q3K_XL.
Well the interesting question is where the “crossover “ is.
If one can only run a Q2KL mix, would pruning 12% and jumping up above 3bpw be better? The paper certainly suggests so. There’s a steep cliff between IQ2KL and IQ3KS.
Same with exl3, as dropping from 3bpw to 2 is painful.
Even if the losses are really domain specific (with the default being alpaca code style questions), thats still interesting, as prunes could be “specialized” with no retraining.
Oh, that's good to know. But I think for a proper release it's better to keep the tensors included to support MTP once it's implemented
MTP isn't going to help you on hybrid inference. The PR in llama.cpp is already proving that out. Same as how nobody got much juice from deepseek MTP. Since they never pruned the MTP layer (prolly how this got started) it's not likely to be functional even if you hack it back in.
I was arguing with someone about this the other day, and suspected this, as the CPU doesn’t have the “extra” compute to make MTP so cheap like GPU.
Kudos on seeing the Q2K.. just holding out for larger Q3/Q4 quants. Am using the big one at Q3K_XL so IQ4_NL or one of them.. whatever is around 120-130 with imatrix. The EXL3 is probably going to be lit as well. No more 2.01bpw, maybe I get in the 3s. This would be done but I dunno who I have to shank for better internet..
I can't make the imatrix with 128GB RAM (can I?), but I can make quants once someone else does.
So... the question is...
How is it?
That is an excellent question.
Is KLD testing this vs full GLM valid? Or are benchmarks the only reliable way?
I don't know if KLD is valid but it's certainly going to be enlightening. KLD against the full model and KLD against the pruned model. Guess I better start downloading, its probably going to take overnight. Q3K_XL is 158gb and this is 162gb so I keep speed the same but in theory gain fidelity.
Dataset was english, right? So that means we're shedding the CN experts. Something that was already done with qwen in a more brutal way. IIRC experts only expert in series of tokens like punctuation, etc.
Dataset was english, right? So that means we're shedding the CN experts.
No
Yeah I’m not sure I buy that. Don’t LLMs abstract language away, hence they pick up stuff even if trained in another language?
The testing will be enlightening. I am AFK, but will try some stuff with smaller MoEs.
It's not something to buy, it's the results of a similar experiment that kalomaze did: https://huggingface.co/kalomaze/Qwen3-16B-A3B/discussions/14
Using the model at Q4_K, it surprised me by getting some logic tests right that the full version (IQ3_KS and Z.AI API) get wrong...?
I can't make the imatrix with 128GB RAM (can I?), but I can make quants once someone else does.
I've created an imatrix from the Q8_0 using ubergarm's calibration data.
Note: That Q4_K quant in that repo is not calibrated though.
similar experiment that kalomaze did
That model lost a lot more than just non-English though lol.
I used the model a bit more now at Q4_K.
It seemed good for 1-shot translations, throw-away app generation, etc. But it seems to degrade quickly with multi-turn conversations.
Still coherent, but it misses information randomly (Full version at iq2_ks doesn't do this).
There’s a lot of testing to be done (which I can only start in like two days), but I’m thinking:
SimpleQA subset on full GLM and 75% REAP GLM, at the same quant, for general world knowledge.
KLD between them and the Q8
A sanity check of the same thing with Cerebras's official releases. It’s possible something tiny went wrong with 4.6 specifically.
Surprising little info on the cerebras releases. You'd think people would give some feedback. I kinda like what happened to this model's writing and personality, especially OOD. Feels like some of the positivity bias got pruned. Wonder how the smaller one fares, will find out tomorrow.
Their claims of being worse for creative but coding intact appears flipped.
They did some testing, and, in fact, showed gains with GLM Air on WildBench, a more creative general benchmark, and small losses for tool use:
Oh I get what you’re saying. The low community feedback is kinda par, heh.
GLM 4.6 Air “distill” was a total placebo (a carbon copy of Air 4.5), yet seemed to have gotten a lot of vllm deployments and such which no one even noticing for weeks. And this release is still new.
One thing I found is that full GLM knows who filian is and the pruned thinks it's a dude who plays minecraft. A couple of other vtubers are intact. I checked lower temp with top-n-sigma and it persists. Looks like that fact got pruned. My guess is that alpaca coding is probably not the best dataset to use. A mix of creative/tools/coding might be better.
Thankful that a quant/prune was made at all; this method, however, needs some refinement. Make a dataset of slop and prune the inverse some way?
I'm going to go with the "set nextn to 0" and quant a Q8 and also a q8/q5/q5/q6 quant and run KLD on it compared to the rest of the GLM-4.6 data I've collected. Should have those results in a few hours.
With this text?
https://huggingface.co/ddh0/imatrices/blob/main/ddh0_imat_calibration_data_v2.txt
(For the reference of others in this thread)
For calibration and producing the imatrix, I use Bartowski's v3 data from here: https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8
For KLD and testing the reference logits, I use ddh0's imat calibration data yes.
I did a awq https://huggingface.co/vkerkez/GLM-4.6-REAP-266B-A32B-awq-sym conversion and it’s working really well for me
So logits are madly different. The high perplexity is a bit worrisome.
new challengers have appeared: https://huggingface.co/cerebras/GLM-4.6-REAP-252B-A32B-FP8
So I finally got home and got around to testing reap with Qwen 30B locally. Took me hours to troubleshoot random dependency issues, ugh.
But thanks for the fixes ddh0.
Anyway, I noticed they have several presets for pruning datasets:
--dataset_name {
m-a-p/CodeFeedback-Filtered-Instruction,
ise-uiuc/Magicoder-Evol-Instruct-110K,
allenai/c4,
theblackcat102/evol-codealpaca-v1,
euclaise/WritingPrompts_curated,
allenai/tulu-3-sft-personas-math,
combined
}
I tried it with euclaise/WritingPrompts_curated, and will see how an exl3 evals.
Oh, it seems there are some other prunes floating around.
Is there any discussion on the effects of the dataset used to prune?
The 218b is shit. It literally forgot parts of english. I'm probably going to try the 268b, there is one quant of it in Q3. Nobody else has run PPL tests only coding benchmarks. What datasets were used to do this one? I think ideally you want to do creative writing + instruction following + code together in even amounts. Cerebras used theblackcat102/evol-codealpaca-v1?
There are several dataset presets in the pruning script, including a short writing prompt dataset and a “combined” option. It’s also (seemingly) very quick to run, hence I’m going to run a few experiments on Qwen 30B.
My guess is the result is a quite specialized model.





