LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct

3 days ago

EXAONE uses a lot more memory for context compared to Qwen 2.5. Is this inherent to the model or is it something wrong with llama.cpp?

yireun

LG AI Research org 3 days ago

•

edited 3 days ago

Hi, electroglyph.

Would you give us more information (e.g., gguf type and llama-cli parameters) for testing?
When compared EXAONE-3.5-2.4B-Instruct-BF16.gguf and qwen2.5-3b-instruct-fp16.gguf with the same parameters (llama-cli -cnv -m '...' -p '...') on CPU, EXAONE used less memory.

0xDEADFED5

about 20 hours ago

•

edited about 20 hours ago

okay, thanks. i've tested using some of the GPU backends, i.e. SYCL, Vulkan, etc. my context limit is around 50% of what it is with Qwen 2.5 3B. i've tested several versions of llama.cpp so far. i'm going to do some more testing and i'll be back with more detailed information.

...my context limit is somewhere around 60K with EXAONE 2.4B, but I can hit 120K with Qwen 2.5 3B (no quantization). these small models are great for running in parallel, so my actual context is divided by how many parallel tasks I'm running. the lower context limit means i have to lower how many i run in parallel

LGAI-EXAONE
/

EXAONE-3.5-2.4B-Instruct

high memory use