4x24GB VRAM Spread Results.
My GPU fans ramped up in excitement when they heard me mutter to myself "another ubergarm quant has arrived"...
3x3090 // 1x4090 VRAM Only Test
I didn't realize this was a 100b dense model till halfway downloading it, so the lower token generation is more than expected.
I had it create a simple dynamic landing html page, it was a little under 2k tokens generated, at a speed of 10.5 t/s. not bad! I think the 4bit quant offers a good enough perplexity tradeoff for enabling full GPU offload, so I'll play more with this model in the coming days.
As always thanks again! I'm excited for the new deepseek model quants once support is added to llama.cpp!
sweeet! Great seein' you here for the new release! Yeah I tried to size this one to work well in ~96GB VRAM full offload with enough context to be useful without sacrificing much perplexity (kept attn just a touch bigger at ~6.7BPW also).
If you're not already, when running fully offloaded you can set -t 1 as you're not using CPU threads which can give a few percent boost due to less stuff to synchronize.
Also if you want to experiment, the new ik_llama.cpp split mode -sm graph "tensor parallel" has given big boosts to me when using exactly 2xGPUs (or 2xGPUs plus CPU hybrid). I'm not sure it is working perfectly yet for 4x GPUs but easy to test.
Just did a post on r/LocalLLaMA with some more info showing ~40% increase in tok/sec across the board testing with 2x CUDA GPUs on Devstral-Small
UPDATE
ik added a feature to support -sm graph with more than 2 GPUs showing promising early results here if you wan't to test: https://github.com/ikawrakow/ik_llama.cpp/pull/1051
compiling now, I have a bad habit of using weeks old builds of ik_llama (if it ain't broke, don't fix it!)
I didn't realize tensor parallel support is finally starting to be implemented. I'll get some numbers and results soon in an edit to this post. The results from the git pull thread look promising!
edit
unfortunately I'm not seeing any expected improvements, infact it's performing worse and that's with the smaller 24b or this one, the 123b. I noticed one of my 3090s has significantly higher usage than the other cards, and as I understand it this isn't really the expected outcome, I though each card was supposed to be utilized more. see photos for details. I'm going to chalk this up due to parallelism needing the same gpu architecture, and the slight variances of the one 4090 and 3 3090s is making it perform not as expected.
/home/phone/Documents/ik_llama_graph_pull/ik_llama.cpp/build/bin/llama-server \
--model /run/media/phone/SharedData/LocalModelsBIG/Devstral-2-123B-Instruct-2512-IQ4_KSS.gguf \
--alias ubergarm/Devstral-2-123B-Instruct-2512-GGUF \
--ctx-size 20000 \
-ctk q8_0 -ctv q8_0 \
-ngl 99 \
-sm graph \
--max-gpu 4 \
--host 0.0.0.0 \
--port 8081 \
--jinja \
-sm graph \
--max-gpu 4 \
Thanks for testing, don't lose heart yet! My impression reading the PR1051 more is that you actually want to set --max-gpu to something LESS than the number of GPUs in your system. Your best bet to max out TG is probably use exactly 2 or 3. You might be able to play some games with -mg to set your 4090 to the "main gpu" and maybe some other tricks to re-order the GPUs but that is beyond anything i've tested myself. Read here for more details: https://github.com/ikawrakow/ik_llama.cpp/pull/1051#issuecomment-3642668828
So give this a try:
/home/phone/Documents/ik_llama_graph_pull/ik_llama.cpp/build/bin/llama-server \
--model /run/media/phone/SharedData/LocalModelsBIG/Devstral-2-123B-Instruct-2512-IQ4_KSS.gguf \
--alias ubergarm/Devstral-2-123B-Instruct-2512-GGUF \
--ctx-size 20000 \
-ctk q8_0 -ctv q8_0 \
-ngl 99 \
-sm graph \
--max-gpu 2 \
--host 0.0.0.0 \
--port 8081 \
--jinja \
--threads 1
UPDATE plus there were a couple more PRs merged overnight that may improve multi-GPU performance as well (essentially leaving P2P enabled instead of adding overhead to enable/disable it)
UPDATE 2 I'm trying to test it myself on exactly 2x GPUs but hit a snag. Anyway, exciting times haha...
I was trying various configs, I just attached the latest one I was trying. I'll rebuild and try again today since I have some free time. It is indeed annoying seeing one gpu do so much work and others slack! lol
edit; the pull request has been merged, I am compiling mainline ik_llama right now. the following are my build options... I've used these for so long is there any updated arguments to put now adays?
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DCMAKE_CUDA_ARCHITECTURES="86;89"
results of the smaller 24b version as it's quicker to load after changing config:
before: (default config no new graph parameters)
after: (gpu-max: 3)
so there is a change, I just didn't notice it as much as I thought... looking at the graphs from the PR it seems to really benefit token generation over long contexts, not so much initial generation speed as I expected!
(sidenote, the sm: graph and max gpu options seem to help bigger models partially offloaded to system ram as well. I noticed kimi k2 think was a little snappier and stable this morning after applying the parameters!)
second edit: setting main gpu as a 3090 rather than the 4090 actually provided more stable and the best token gen speeds, same prompt as above 60t/s generation... i'll keep playing around -
final edit for now:
the 123b parameter model started generating at 17 t/s, which is a great increase! however it then had a big fall-off to 11 t/s, which can be visualized with the following chart. again, GPU2 seems to be a bad apple, this looks like it could be a hardware issue at this point.
Great seems like you're seeing some uplift!
is there any updated arguments to put now adays?
That is pretty much the same, you don't have to explicitly specify -DGGML_SCHED_MAX_COPIES=1 anymore as it is default, but hurts nothing (i still leave it on hah). For some KT quants I've seen improvement using explicit -DGGML_CUDA_F16=ON but haven't looked into it and don't think it matters for most quants.
looking at the graphs from the PR it seems to really benefit token generation over long contexts, not so much initial generation speed as I expected!
totes! if you run llama-sweep-bench it makes the differences more obvious across the entire kv-cache depth e.g.
/home/phone/Documents/ik_llama_graph_pull/ik_llama.cpp/build/bin/llama-sweep-bench \
--model /run/media/phone/SharedData/LocalModelsBIG/Devstral-2-123B-Instruct-2512-IQ4_KSS.gguf \
--ctx-size 16896 \
-ctk q8_0 -ctv q8_0 \
-ngl 99 \
-sm graph \
--max-gpu 2 \
--threads 1
(sidenote, the sm: graph and max gpu options seem to help bigger models partially offloaded to system ram as well. I noticed kimi k2 think was a little snappier and stable this morning after applying the parameters!)
Yes this can also help with hybrid CPU big MoEs especially at longer context length. More info on that here: https://github.com/ikawrakow/ik_llama.cpp/pull/1040#issuecomment-3620696408
Thanks for testing out all the options!
And just for completions sake, here is the results of the sweep-bench script above.
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 1.886 | 271.47 | 9.972 | 12.84 |
| 512 | 128 | 512 | 1.849 | 276.90 | 11.611 | 11.02 |
| 512 | 128 | 1024 | 1.807 | 283.35 | 11.447 | 11.18 |
| 512 | 128 | 1536 | 1.865 | 274.52 | 10.760 | 11.90 |
| 512 | 128 | 2048 | 1.663 | 307.92 | 10.392 | 12.32 |
| 512 | 128 | 2560 | 1.732 | 295.56 | 10.214 | 12.53 |
| 512 | 128 | 3072 | 1.602 | 319.60 | 9.919 | 12.90 |
| 512 | 128 | 3584 | 1.675 | 305.72 | 10.079 | 12.70 |
I... I think my setup might be a little confusted, it seems to be improving as token length increases?
Haha yeah that seems odd, but maybe just some noise as 512 token steps aren't too big, if you let it run all the way out to full 16k it might have more clear downward trend.
my approach is have a command and then only vary one experimental value and then graph the runs all together to see the difference. e.g. run the exact same command but try --max-gpu 2 and --mmax-gpu 3 on the other and graph them.
if you decide to do that, i can graph it for you if you paste it in here, or i could share the python script etc
Cool didn't know you did dense models.
Unless you need llama.cpp specific features, you're probably better off using exllamav3 with full cuda/vram
172.17.0.2:43166 - "GET /v1/models HTTP/1.1" 200
172.17.0.2:43182 - "POST /v1/chat/completions HTTP/1.1" 200
Received chat completion streaming request 40c2da9978fd4db1a47b469279713dea
Finished chat completion streaming request 40c2da9978fd4db1a47b469279713dea
Metrics (ID: 40c2da9978fd4db1a47b469279713dea): 561 tokens generated in 26.02 seconds (Queue: 0.0 s,
Process: 10388 cached tokens and 503 new tokens at 303.01 T/s, Generate: 23.03 T/s, Context: 10891 tokens)
(Random last gen log in the console) - that's the 4.0bpw on 3090's.
P.S. holy shit, we're getting tensor parallel in ik_llama!
17 t/s at zero context. Nice free performance boost. Also cool that it manages to detect / select the best peer connections (nvlinked pairs)
Cool didn't know you did dense models.
I'll try anything once! lol... dense models are not quite as exciting as the best recipes generally have all the tensors at similar level throughout with just a little bump here or there.
you're probably better off using exllamav3 with full cuda/vram
Right, in general if a model/quant can fit into full GPU VRAM offload it makes sense to check out exllamav3 EXL3 quants, and also stuff like vLLM (especially for multi-user / batched throughput case) assuming the quants available are good for your hardware.
P.S. holy shit, we're getting tensor parallel in ik_llama!
Yes it definitely helps with exactly 2x GPUs and some more recent PRs are improving it for more as well. This is for the 4.709 BPW available in this repo:
A big speed up while allowing a lot of existing GGUFs at just the right quantization. Also the big win that I see with this new -sm graph "graph parallel" can still give a nice boost on hybrid CPU+ multi GPU, especially at longer context depths.
Yes it definitely helps with exactly 2x GPUs and some more recent PRs are improving it for more as well.
I just tested limiting to 4 GPUs (had to quant the kv cache) and 20t/s!
Control-vectors still work, so this is a huge win for me (probably won't bother keeping my hacked-together exl3 implementation working now!)
I'll have to try command-a as well, that's another "about 12 t/s if I want to use control-vectors" dense model.
edit:
=======================================================
Split mode 'graph' is not supported for this model
=> changing split mode to 'layer'
=======================================================
I guess it depends on the model.
Ahh I see these are supported so far:
# src/llama.cpp line ~1726
static bool is_model_split_supported(const llama_model & model) {
static std::unordered_set<llm_arch> k_supported = {
LLM_ARCH_LLAMA,
LLM_ARCH_QWEN3MOE,
LLM_ARCH_GLM4_MOE,
LLM_ARCH_MISTRAL3,
};
auto it = k_supported.find(model.arch);
return it != k_supported.end();
}
If you wanted to go wild west, just add LLM_ARCH_COHERE2 or whatever and see what happens, but zero guarantees as this is all very new as you know.
Or you can just comment this out to experiment too. I did this to try with RPC backend applying -sm graph which did at least start up right, but ended up giving gibberish out of a tiny 0.6B test model.
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -1758,7 +1758,7 @@ static bool llm_load_tensors(
LLAMA_LOG_WARN("Split mode 'graph' is not supported for this model\n");
LLAMA_LOG_WARN(" => changing split mode to 'layer'\n");
LLAMA_LOG_WARN("=======================================================\n\n");
- split_mode = LLAMA_SPLIT_MODE_LAYER;
+ // split_mode = LLAMA_SPLIT_MODE_LAYER;
}
}
Anyway, lots of exciting toys to try out haha...







