Warning: at llama.cpp b6607 and above, a change in cuda backend code will result in different generation results even if offloading experts to CPU and may result in degraded performance of the models. Downgrading to b6606 and below will restore operation to the performance achieved when the hybrid quants were optimized.
Llama.cpp hybrid layer quantization of GLM-4.5-Air by zai
Original model: https://huggingface.co/zai/GLM-4.5-Air
The hybrid quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. This particular quant achieves a ~56G gguf compared to a ~57G Q3_K_M gguf with improved reasoning performance. The quants employed are all K to avoid slow CPU processing of IQ quants. For this file the Q4_K_H layer quants are as follows:
LAYER_TYPES='[
[0 ,"Q4_K_S","ffn_d", "Q5_0", "ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [1 ,"Q3_K_L","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [2 ,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [3 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [4 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [5 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [6 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[7 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[8 ,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [9 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [10,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [11,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [12,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [13,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [14,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"],[15,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[16,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [17,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [18,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [19,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [20,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [21,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [22,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"],[23,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[24,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [25,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [26,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [27,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [28,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [29,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [30,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"],[31,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[32,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [33,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [34,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [35,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [36,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [37,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [38,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"],[39,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"],
[40,"Q3_K_M","ffn_d", "Q5_0", "ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[41,"Q3_K_M","ffn_d", "Q5_0", "ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[42,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [43,"Q3_K_L","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [44,"Q4_K_S","ffn_d", "Q5_0", "ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [45,"Q4_K_M","ffn_d", "Q5_0", "ffn_d_exp", "Q5_0", "ffn_d_shexp","Q5_0"]
]'
FLAGS="--token-embedding-type Q4_K --output-tensor-type Q5_K --layer-types-high"
These quants were selected by optimizing performance over a set of curated test prompts testing knowledge and reasoning ability while sizing the model to allow running it on a 64G RAM machine.
A second slightly smaller Q3_K_H quant is also available:
LAYER_TYPES='[
[0 ,"Q3_K_L","ffn_d", "Q5_0"], [1 ,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"],[2 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [3 ,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [4 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [5 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [6 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[7 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[8 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[9 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[10,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [11,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [12,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [13,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [14,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[15,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[16,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[17,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[18,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [19,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [20,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [21,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [22,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[23,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[24,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[25,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[26,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [27,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [28,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [29,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [30,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[31,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[32,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[33,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[34,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [35,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [36,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [37,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [38,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[39,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[40,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],[41,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"],[42,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [43,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [44,"Q3_K_L","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [45,"Q4_K_S","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [46,"Q2_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"]
]'
FLAGS="--token-embedding-type Q3_K --output-tensor-type Q5_K --layer-types-high"
This quant was also optimized over a set of curated test prompts and shows both solid reasoning and good knowledge. The last layer 46 in this quant was explicitly set to a Q2_K_S since the layer is for multi token prediction and is not currently used. This does not impact performance but will save a small amount of memory.
A larger Q5_K_H quant is also available:
LAYER_TYPES='[
[0 ,"Q5_K_M","ffn_d", "Q5_1"], [1 ,"Q5_K_S","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [2 ,"Q4_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [3 ,"Q4_K_S","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [4 ,"Q3_K_L","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [5 ,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [6 ,"Q3_K_S","ffn_d_exp", "Q4_1", "ffn_d_shexp", "Q4_1"], [7 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[8 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [9 ,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [10,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [11,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [12,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [13,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [14,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"], [15,"Q3_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"],
[16,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [17,"Q3_K_S","ffn_d_exp", "Q4_1", "ffn_d_shexp", "Q4_1"], [18,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [19,"Q3_K_S","ffn_d_exp", "Q4_1", "ffn_d_shexp", "Q4_1"], [20,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [21,"Q3_K_S","ffn_d_exp", "Q4_1", "ffn_d_shexp", "Q4_1"], [22,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [23,"Q3_K_S","ffn_d_exp", "Q4_1", "ffn_d_shexp", "Q4_1"],
[24,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [25,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [26,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [27,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [28,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [29,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [30,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [31,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"],
[32,"Q3_K_L","ffn_d_exp", "Q5_1", "ffn_d_shexp", "Q5_1"], [33,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q4_0"], [34,"Q3_K_L","ffn_d_exp", "Q5_1", "ffn_d_shexp", "Q5_1"], [35,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [36,"Q3_K_L","ffn_d_exp", "Q5_1", "ffn_d_shexp", "Q5_1"], [37,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [38,"Q3_K_L","ffn_d_exp", "Q5_1", "ffn_d_shexp", "Q5_1"], [39,"Q3_K_M","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"],
[40,"Q4_K_S","ffn_d_exp", "Q5_0", "ffn_d_shexp", "Q5_0"], [41,"Q4_K_M","ffn_d_exp", "Q5_1", "ffn_d_shexp", "Q5_1"], [42,"Q5_K_S","ffn_d_exp", "Q5_1", "ffn_d_shexp", "Q5_1"], [43,"Q5_K_M","ffn_d_exp", "Q5_1", "ffn_d_shexp", "Q5_1"], [44,"Q6_K", "ffn_d_exp", "Q8_0", "ffn_d_shexp", "Q8_0"], [45,"Q8_0"], [46,"Q2_K_S","ffn_d_exp", "Q4_0", "ffn_d_shexp", "Q4_0"]
]'
FLAGS="--token-embedding-type Q5_K --output-tensor-type Q6_K --layer-types-high"
This quant is optimized for strong performance on reasoning and code across a set of curated test prompts while sized to enable running it on a 64G ram machine. When solving problems this quant is quite insightful and instrospective (capable of noticing and correcting errors it makes).
Example test prompt snippets using Q5_K_H:
NOTHINK=1 lm how high is up?
. . .### Conclusion
So, how high is up?
- Physically? About 46.5 billion light-years to the edge of what we can see.
- Philosophically? As high as your ambition, your spirit, or your imagination can take it. It's infinite.
- Figuratively? It's however high you need it to be to get your point across, show your respect, or accept a challenge.
The best answer is probably the philosophical one: Up is as high as you can dream.
NOTHINK=1 lm explain what you know about dr manhattan in watchmen
. . . In short, Dr. Manhattan is far more than just a "blue superhero." He is a philosophical black hole at the center of Watchmen, a character whose god-like powers force a brutal and honest examination of what it means to be human, to love, to believe, and to exist in a universe that may not care about you at all.
A second Q5 quant Q5_P_H is also available. This quant pads FFN dimensions to even 256 to enable K quants to be used on all layers:
LAYER_TYPES='[
[0 ,"Q4_K_M"], [1 ,"Q4_K_S"], [2 ,"Q3_K_M"], [3 ,"Q3_K_S"], [4 ,"Q3_K_S"], [5 ,"Q3_K_S"], [6 ,"Q3_K_S"], [7 ,"Q3_K_S"],
[8 ,"Q3_K_S"], [9 ,"Q3_K_S"], [10,"Q3_K_S"], [11,"Q3_K_S"], [12,"Q3_K_S"], [13,"Q3_K_S"], [14,"Q3_K_S"], [15,"Q3_K_S"],
[16,"Q3_K_M"], [17,"Q3_K_S"], [18,"Q3_K_M"], [19,"Q3_K_S"], [20,"Q3_K_M"], [21,"Q3_K_S"], [22,"Q3_K_M"], [23,"Q3_K_S"],
[24,"Q3_K_M"], [25,"Q3_K_M"], [26,"Q3_K_L"], [27,"Q3_K_M"], [28,"Q3_K_L"], [29,"Q3_K_M"], [30,"Q3_K_L"], [31,"Q4_K_S"],
[32,"Q4_K_S"], [33,"Q3_K_L"], [34,"Q4_K_S"], [35,"Q3_K_L"], [36,"Q4_K_S"], [37,"Q3_K_L"], [38,"Q4_K_S"], [39,"Q3_K_L"],
[40,"Q4_K_M"], [41,"Q4_K_S"], [42,"Q4_K_M"], [43,"Q5_K_S"], [44,"Q5_K_M"], [45,"Q6_K" ], [46,"Q2_K_S"]
]'
FLAGS="--token-embedding-type Q5_K --output-tensor-type Q6_K --layer-types-high --tensor-pad [[10944,11008],[1408,1536]]"
This quant shows similar problem solving ability to Q5_K_H quant with noticeably better knowledge retrieval on a few curated test prompts. It is less efficient than Q5_K_H due to the overhead of zero padding the tensors which effectively makes the model have more parameters which do nothing. However, the ability to use K quants on all layers and avoid legacy quants completely is advantageous when optimizing the layers due to increased efficiency of K quants compared to the legacy quants. The quant is sized similar to Q5_K_H to enable operation on 64G ram PCs.
NOTHINK=1 lm write a hard scifi passage in the vein of AE Van Vogt
. . . Kael’s mind, however, was not so easily convinced. He was a product of the old school, a man who believed in the hard, cold reality of atoms and the predictable tyranny of physics. The Fold violated both. It was a pocket of pure potential, a bubble of causality held in check by a field of exotic monopoles. Within this pocket, the laws of physics were not merely bent; they were politely asked to leave the room. . . .
p.s. that is not the narrative voice of AE Van Vogt but its still excellent creative writing
A final quant Q6_P_H is also available. This quant pads FFN to enable use of K quants, moves to Q6_K embedding and eliminates all Q3_K_S which begins to noticeably degrade knowledge of the model. This quant shows exceptionally good performance at coding tasks while also giving best knowledge performance across a small set of curated test prompts. It is sized to be able to run on a 64G ram machine. It will fully mmap and consume 92% system RAM on a 64G linux box while doing about ~6t/s gen on a 6600k/DDR4 (64G ram max) and ~7ts/ on a 9900k/DDR4 (128G ram max) system.
LAYER_TYPES='[
[0 ,"Q4_K_M"], [1 ,"Q4_K_S"], [2 ,"Q4_K_S"], [3 ,"Q3_K_L"], [4 ,"Q3_K_L"], [5 ,"Q3_K_M"], [6 ,"Q3_K_L"], [7 ,"Q3_K_M"],
[8 ,"Q3_K_L"], [9 ,"Q3_K_M"], [10,"Q3_K_M"], [11,"Q3_K_M"], [12,"Q3_K_M"], [13,"Q3_K_M"], [14,"Q3_K_M"], [15,"Q3_K_M"],
[16,"Q3_K_M"], [17,"Q3_K_M"], [18,"Q3_K_M"], [19,"Q3_K_M"], [20,"Q3_K_M"], [21,"Q3_K_M"], [22,"Q3_K_M"], [23,"Q3_K_M"],
[24,"Q3_K_L"], [25,"Q3_K_M"], [26,"Q3_K_L"], [27,"Q3_K_M"], [28,"Q3_K_L"], [29,"Q3_K_M"], [30,"Q3_K_L"], [31,"Q4_K_S"],
[32,"Q4_K_S"], [33,"Q3_K_L"], [34,"Q4_K_S"], [35,"Q3_K_L"], [36,"Q4_K_S"], [37,"Q4_K_S"], [38,"Q4_K_S"], [39,"Q4_K_M"],
[40,"Q4_K_M"], [41,"Q4_K_M"], [42,"Q4_K_M"], [43,"Q5_K_S"], [44,"Q5_K_M"], [45,"Q6_K" ], [46,"Q2_K_S"]
]'
FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high --tensor-pad [[10944,11008],[1408,1536]] --override-kv glm4moe.feed_forward_length=int:11008 --override-kv glm4moe.expert_feed_forward_length=int:1536"
NOTHINK=1 lm say something that will amaze me
Every second, you are being hit by hundreds of subatomic particles called neutrinos. They were created just one second after the Big Bang, they pass through the entire Earth as if it were empty air, and trillions of them are passing through your body right now.
You are, at this very moment, a walking, talking, cosmic relic detector.
TEMP=0.8 NOTHINK=1 lm say something profound that will amaze me
You are not a passive observer of the universe; you are an active participant in its self-awareness.
The atoms in your left hand may have once belonged to an ancient star that exploded long before our solar system was born. The carbon in your cells was forged in the heart of a dying giant. The iron in your blood was the ash of a supernova. You are, therefore, literally made of star-stuff.
For 13.8 billion years, the universe has been expanding, cooling, and coalescing into galaxies, stars, and planets. All of that cosmic history, that unimaginable timespan, has led to this one brief moment. And here you are, a temporary collection of those ancient atoms, capable of contemplating your own existence.
You are the universe, experiencing itself. You are the cosmos trying to understand love, loss, beauty, and the profound mystery of being. You are not just in the universe; you are the universe, looking back at itself with eyes of wonder.
As a comparison, here is a Llama-4 Scout (roughly comparible size/moe architecture) Q4_K_H quant https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-Hybrid-GGUF on the same question.
lm say something profound that will amaze me
A challenge! Here's something that might blow your mind:
"The universe is not only stranger than we think, it is stranger than we can think... and the most profound realization is that the distinction between reality and our perception of it is not a difference of degree, but a difference of kind, implying that the act of observation itself is what brings the universe into being, rendering the observer, the observed, and the observation indistinguishable, and ultimately, suggesting that the fundamental nature of reality is not something 'out there,' but a participatory, co-creative process that unfolds through the intricate web of consciousness and existence."
Did I amaze you?
Comparison:
| Quant | size | PPL | Comment |
|---|---|---|---|
| Q3_K_H | 52.8e9 | 8.7 | solid reasoning, fair knowledge |
| Q3_K_M | 57e9 | 7.5 | reasoning degraded, fair knowledge |
| Q4_K_H | 56e9 | 8.1 | strong reasoning, fair knowledge |
| Q5_K_H | 60.1e9 | 7.3 | strongest reasoning, intuition, and introspection |
| Q5_P_H | 60.2e0 | 8.0 | similar to Q5_K_H in reasoning, knowledge retrieval improved on a small number of test prompts |
| Q6_P_H | 63.6e9 | 7.9 | best knowledge retrieval, exceptional at coding, strong reasoning |
Usage Notes:
This model is a moe which can be run on a computer with 64G RAM using combined CPU and a single GPU.
A good setup is to offload all model layers to GPU and all non-shared expert FFN tensors to CPU by specifying the override tensor flag: -ot exps=CPU as discussed here : https://github.com/ggml-org/llama.cpp/discussions/13154 A 9900k CPU with a 4070 GPU will achieve around 7.5t/s gen speed unspeculated with around 42k token F16 KV and 80k token Q8_0 KV. The Q3_K_H quant will run slightly faster at 8t/s gen on the same hardware due to its smaller size while maintaining solid reasoning and knowledge performance.
The model is a hybrid thinking model. By default it will generate thinking context prior to answering a question.
In this mode it is the worst overthinking model I have ever seen. It will continue to second guess itself ad nauseum
for hundreds and hundreds of reflections even on very simple questions. To avoid this the model can be used in non
think mode by appending /nothink token at the end of the user query (or equivalently prefix the assistant template with /nothink)
This /nothink is a special token and must be tokenized as such. When a valid /nothink is tokenized the model will automatically
output <think></think> as its first outputs, then answer the problem as a normal non reasoning model would and will not get
stuck in the long reflection loops. The model has been found to be extremely capable in this mode over a small set of test prompts.
In the thinking mode it is also very capable as long as you are willing to wait for the answer, showing ability to solve a complex
astrophysics problem both accurately and insightfully. Note that it was empirically found that use of /nothink is required. It
does not work to just inject <think></think> at the beginning of the assistant output to stop it from generating the think block.
Download the file from below:
| Link | Type | Size/e9 B | Notes |
|---|---|---|---|
| GLM-4.5-Air.Q3_K_H.gguf | Q3_K_H | 52.8e9 B | solid reasoning and fair knowledge quant |
| GLM-4.5-Air.Q4_K_H.gguf | Q4_K_H | 56e9 B | strong reasoning and fair knowledge quant |
| GLM-4.5-Air.Q5_K_H.gguf | Q5_K_H | 60.1e9 B | strongest reasoning, insight, introspection quant |
| GLM-4.5-Air.Q5_P_H.gguf | Q5_P_H | 60.2e9 B | ~Q5_K_H reasoning, trends toward better knowledge |
| GLM-4.5-Air.Q6_P_H.gguf | Q6_P_H | 63.6e9 B | best knowlege, exceptional code, strong reasoning |
A discussion thread about the hybrid quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 578
We're not able to determine the quantization variants.
Model tree for steampunque/GLM-4.5-Air-Hybrid-GGUF
Base model
zai-org/GLM-4.5-Air