70TB with multiple A5000
Are two A5000s with 24GB each enough for handling 70TB?
Yes, that will work. Recommended to use ExLlama for maximum performance. You need to load less of the model on GPU1 - a recommended split is 17.2GB on GPU1, 24GB on GPU 2. This leaves room for context on GPU1.
@TheBloke how to spread workload to multiple GPU? Default example is:
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
model_basename=model_basename,
inject_fused_attention=False, # Required for Llama 2 70B model at this time.
use_safetensors=True,
trust_remote_code=False,
device="cuda:0",
use_triton=use_triton,
quantize_config=None)
how to define this spilt?
Yes, that will work. Recommended to use ExLlama for maximum performance. You need to load less of the model on GPU1 - a recommended split is 17.2GB on GPU1, 24GB on GPU 2. This leaves room for context on GPU1.
This is probably a dumb question, but using ExLlama or ExLlama HF isn't enough to run this on a 4090, is it?
Maybe if I can split it with my 11900k, but I don't know how to do that.
@TheBloke can you please help with this?
@neo-benjamin
add max_memory parameter.
reference: https://huggingface.co/TheBloke/Llama-2-70B-GPTQ/discussions/9