I am getting "torch.cuda.OutOfMemoryError: CUDA out of memory" error message when I use the AWS sagemaker sample code to deploy. can anyone help? thanks in advance.

Code:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

models

hub = {
'HF_MODEL_ID':'defog/sqlcoder-70b-alpha',
'SM_NUM_GPUS': json.dumps(1)
}

create Hugging Face Model Class

huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.4.2"),
env=hub,
role=role,
)

deploy model to SageMaker Inference

predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)

send request

predictor.predict({
"inputs": "My name is Julien and I like to",
})

Your GPU instance doesn't have enough vram to load the model directly.

The model weight is 140GB, if you want to load it directly your EC2 instance should have 140GB memory at least.

As I can see from your code, your instance selected is ml.g5.2xlarge. As per https://aws.amazon.com/sagemaker/pricing/

g5.2xlarge has a total vram of 24gb only plus total ram of 32..
The may try to quantitize model to int4 to reduce weight size to 35gb, and then use offloading method to load the model onto disk+ram+vram during deployment.
You may aso offload majority model weight onto disk, but this will dramatically increase latency of inference. You can ask gpt4o how to use hugging face to do offloading.

Alternatively you can use a bigger instance that provides more vram.