This error message is caused by the tensors used being on different GPUs. Specifically, some tensors are located on device cuda:0 (first GPU) and others are located on device cuda:1 (second GPU). PyTorch expects all tensors used in the same process to be on the same device.
To solve this problem, we need to make sure that all tensors and the model are on the same GPU.

MODEL_NAME = "CohereForAI/aya-23-35b"

Determine the GPU to use

gpu_id = 0 # Use first GPU (cuda:0) or 1
device = torch.device(f"cuda:{gpu_id}" if torch.cuda.is_available() else "cpu")
torch.cuda.set_device(device) # Set active GPU

model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=quantization_config,
attn_implementation=attn_implementation,
torch_dtype=torch.bfloat16,
device_map=device, #<<<== Set active GPU
)

These changes will resolve the incompatibility bug between different devices by ensuring that all operations occur on the same GPU.
If you have more than one GPU in your system and you are still experiencing problems, you can check the available GPUs and their usage status using the "nvidia-smi" command. You can also control which GPUs are visible to Python by setting the CUDA_VISIBLE_DEVICES environment variable.

CohereForAI
/

aya-23-35B

Finetune 35B Model

Determine the GPU to use