Oom with 24g vram

#1
by Klopez - opened

Anyone else experiencing this? I have a 3090 24gb vram and i tried loading this via vllm and got oom even with max model size 1000. Is it possible to do int8 rather than fp8?

Neural Magic org

Try also setting --max-num-seqs=1. Unfortunately the kv cache required to run this model is very large at the moment due to how vision models are profiled

Thank you for that. Seems like it helped, but wow didnt expect that to happen with such a small model. Could you link me where I can read more on this?

Neural Magic org

We have an issue tracker here https://github.com/vllm-project/vllm/issues/8826 so maybe you could leave your experience?

Sign up or log in to comment