What is the required GPU size to run Is a 4090 possible and does it support ollama

by sminbb - opened 9 days ago

Discussion

sminbb

9 days ago

What is the required GPU size to run
Is a 4090 possible and does it support ollama

ewre324

9 days ago

4090 should be good enough. Yes ollama would be helpful since these are GGUF files. However you will have to import GGUF in ollama.

shimmyshimmer

Unsloth AI org 9 days ago

What is the required GPU size to run
Is a 4090 possible and does it support ollama

Yes 4090 is enough. You don't need a GPU, CPU with 48GB RAM will be enough.

At the moment Ollama does not support it as far as I'm aware of so you will need to use llama.cpp

IndrasMirror

8 days ago

Wait...you're telling me if I have say a

AMD Ryzen 7 5700X 8-Core Processor
Thread(s) per core: 2
Core(s) per socket: 8

RTX 4090

64GB DDR4 Ram

That I could run a Deepseek V3 Quant?

shimmyshimmer

Unsloth AI org 7 days ago

Wait...you're telling me if I have say a

AMD Ryzen 7 5700X 8-Core Processor
Thread(s) per core: 2
Core(s) per socket: 8

RTX 4090

64GB DDR4 Ram

That I could run a Deepseek V3 Quant?

Yes that is correct but it will probably be slow

bhupesh-sf

2 days ago

But you need enough RAM to load the model in memory, NO ????

shimmyshimmer

Unsloth AI org 2 days ago

But you need enough RAM to load the model in memory, NO ????

Nope you actually don't but it will be slow. With a GPU and offloading it will be faster

bhupesh-sf

2 days ago

you means if I have mac mini with 64GB, is that enough to run this model? And when you slow how much slowness we are saying here assuming the mac mini as example?

shimmyshimmer

Unsloth AI org 2 days ago

you means if I have mac mini with 64GB, is that enough to run this model? And when you slow how much slowness we are saying here assuming the mac mini as example?

yes, you can def run the model if u use the 2bit. for slowness, you might 1.5 token or less per second.

bhupesh-sf

2 days ago

In some discord channel (I think LM Studio), it was extensively discussed and everyone told that it needs that much RAM for sure to load the model and then for inference it needs less memory as its MOE. But this is quite interesting that it doesn't need that much RAM to even load. But if complete model is not even loaded, how it decides which params to activate during inference. (I am little newbie, so might be confused and asking these questions )

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment