Can't run model
Hey, so i've got the model loaded onto a node with 4x A100 80GB. It loads into memory OK, but crashes out when I try to generate with a CUDA error:
lots of:
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [32,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
and finally:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Let me know if you'd like the error output.
Tried to load the model into text-generation-webui on RunPod with 2x A100 80GB. Indeed it crashes with the above error.
It needs only 2x77GB of memory to load at float16 precision, so 2x80GB GPUs should be enough. (Unless I've overlooked something real bad.)
Tried various settings, none with float16 precision worked. Loading at 8-bit which fits on a single 80GB GPU works, but quantized.
I'm interested in trying this model at full precision to know how good it actually is without quantization.
It does not have to be the Web UI, could be a script. I just need to get it working somehow for testing purposes.
+1. Tried with 4xA100 -40gb and 8x 3090-24gb.
Yeah I have re-created the problem. And have just confirmed it is not specific to this model. It happens also with huggyllama/llama-65B for example.
I am doing my testing and will report back soon. It may be an issue in transformers or accelerate
Update that might help - I have two runpods running, both with 2xA100's.
Curiously my notebook script will work on one of them but the exact same script reproduces the error above.
The only difference I can ascertain is that the one that works is running an older nvidia driver - 525.105.17 whereas the newer driver 530.41.03 doesn't work and produces the error above.
EDIT: Yes - there's definitely a system config issue that's playing into this causing issues with multi-GPU inference across a bunch of models. I have VMs which reliably work and which reliably don't work. Let me know if you want me to log out and system config to assist with diagnosis.
I have rent 4*A6000 with Xeon® Gold 6248 works while 4 A6000 with AMD cpu or Xeon® Silver fails. I'm not sure whether the cpu matters.