vram usage of each?
What's the minimum and average vram usage of each?
speace hard?
ram?
64g ram it Enough?ask
You probably need 2-4gb more than the raw size of the file depending on context
Assuming you are using llama.cpp inference engine pick the largest model that fits into your CPU RAM to use mlock and then offload as much as will fit into your VRAM. Keep in mind hugging face lists sizes in GB not GiB. GB = 1000 * 1000 * 1000 vs GiB = 1024 * 1024 * 1024...
For example, since I have 96GiB CPU RAM I picked the IQ3_XXS which is sized 90.85GB to leave a little overhead for my browser and tiling windows manager... Then with trial and error I managed to get 14 layers offloaded into VRAM (15 would OOM depending on kv cache size).
Surprisingly fast inferencing to me, ~7 tok/sec with DeepSeek-V2.5-IQ3_XXS
in 96GiB RAM offloading ~22GiB into VRAM (mmap'd) with llama.cpp@a249843d
. I expected much slower like Mistral-Large ~2.5 tok/sec...
./llama-server \
--model "../models/bartowski/DeepSeek-V2.5-GGUF/DeepSeek-V2.5-IQ3_XXS-00001-of-00003.gguf" \
--n-gpu-layers 14 \
--ctx-size 1024 \
--cache-type-k f16 \
--cache-type-v f16 \
--threads 16 \
--flash-attn \
--mlock \
--n-predict -1 \
--host 127.0.0.1 \
--port 8080
I have 2x48GiB DDR5 dimms running at uclk=mclk=3200MHz (DDR5-6400) with fabric oc'd to 2133Mhz to squeeze the most out of RAM bandwidth... My R9 9950X supports AVX512 and I compile that into llama.cpp, but usually RAM i/o is the bottleneck anyway... Using 1x 3090TI FE GPU.
Probably depends on what inference engine you are using too, I'd like to try ktransformers too at some time and compare with latest llama.cpp
Good luck and thanks for all the quants!