Sagemaker Deployment Failing in ml.g5.2xlarge instance
I am getting the below error in Cloudwatch. We are trying to deploy it in ml.g5.2xlarge instance. Any resolution for this or we need to deploy it in bigger instance.
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 20.61 GiB
Requested : 172.00 MiB
Device limit : 22.20 GiB
Free (according to CUDA): 15.12 MiB
PyTorch limit (set by user-supplied memory fraction)
: 22.20 GiB
The above exception was the direct cause of the following exception:
The model can be deployed on g5.xlarge
with torch.bfloat16
.
Thanks @senwu . Can you please tell me how to give torch.bfloat16. configuration in the deployment script. Sorry, I am new to this and don't know many of these configs. Below is the deployment script I am using
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20230723T133694')['Role']['Arn']
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'NumbersStation/nsql-llama-2-7B',
'SM_NUM_GPUS': json.dumps(1)
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="0.9.3"),
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
predictor.predict({
"inputs": "Can you please let us know more details about your ",
})```
Hi @rishisaraf11 ,
We haven't used Sagemaker to deploy the model and from the doc it doesn't seem like there is much flexibility. The model prefers torch.bfloat16
but you can still use other dtype.
Hi @senwu
I tried different variations of passing SM_FRAMEWORK_PARAMS
into env for HuggingFaceModel
class in the script shared by
@rishisaraf11
but no luck
hub = {
'HF_MODEL_ID': 'NumbersStation/nsql-llama-2-7B',
'SM_NUM_GPUS': json.dumps(1),
'SM_FRAMEWORK_PARAMS': "{'torch_dtype': 'bfloat16'}"
}
#create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface", version="0.9.3"),
env=hub,
role=role,
)
#deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)
It seems like sagemaker doesn't have full transformer support yet. You can use the default config for the model as well.
You can also use g5.2xlarge
machine or low_cpu_mem_usage=True
from https://huggingface.co/docs/transformers/main_classes/model to reduce the RAM usage when loading the model.
Thank you for the reply @senwu
Problem seems with the overflow of GPU VRAM which is ~22.2 GB's
for ml.g5.2xlarge
which has Nvidia A10g 24 GB GPU.
Error: Sagemaker deployment failed due to memory error
torch.cuda.OutOfMemoryError: Allocation on device 0 would exceed allowed memory. (out of memory)
Currently allocated : 20.61 GiB
Requested : 172.00 MiB
Device limit : 22.20 GiB
Free (according to CUDA): 15.12 MiB
PyTorch limit (set by user-supplied memory fraction)
: 22.20 GiB
To torch.float32
version of the model it requires around 26G VRAM. We will adjust the default model type this week.