high memory use
EXAONE uses a lot more memory for context compared to Qwen 2.5. Is this inherent to the model or is it something wrong with llama.cpp?
Hi, electroglyph.
Would you give us more information (e.g., gguf type and llama-cli parameters) for testing?
When compared EXAONE-3.5-2.4B-Instruct-BF16.gguf and qwen2.5-3b-instruct-fp16.gguf with the same parameters (llama-cli -cnv -m '...' -p '...') on CPU, EXAONE used less memory.
okay, thanks. i've tested using some of the GPU backends, i.e. SYCL, Vulkan, etc. my context limit is around 50% of what it is with Qwen 2.5 3B. i've tested several versions of llama.cpp so far. i'm going to do some more testing and i'll be back with more detailed information.
...my context limit is somewhere around 60K with EXAONE 2.4B, but I can hit 120K with Qwen 2.5 3B (no quantization). these small models are great for running in parallel, so my actual context is divided by how many parallel tasks I'm running. the lower context limit means i have to lower how many i run in parallel