Deploying on Amazon SageMaker
I'm trying to run this model with Amazon SageMaker and am able to successfully deploy it on a ml.m5.xlarge instance.
Unfortunately, when invoking the endpoint I get a PredictionException
which says: Could not load model /.sagemaker/mms/models/google__flan-t5-xxl with any of the following classes: (transformers.models.auto.modeling_auto.AutoModelForSeqZSeqLMu0027, transformers.models.t5.modeling_t5.T5ForConditionalGeneration).
Has anyone had this issue or, even better, been able to deploy and use this model via SageMaker?
Interestingly, when I try to deploy and invoke the endpoint for the flan-t5-large model (with the same ml.m5.xlarge instance) I don't face any issues. I strongly suspect that this is related to the model size so I tried using the biggest instance SageMaker offers (and that is available to me) which is a ml.p3.8xlarge instance, yet I still face the same issue.
My code for deployment looks like this (very similar to what HuggingFace provides, but not exactly the same):
from sagemaker.huggingface import HuggingFaceModel
import boto3
import os
AWS_ACCESS_KEY_ID = os.environ["AWS_ACCESS_KEY_ID"]
AWS_SECRET_ACCESS_KEY = os.environ["AWS_SECRET_ACCESS_KEY"]
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'google/flan-t5-xxl',
'HF_TASK':'text2text-generation'
}
iam_client = boto3.client('iam')
# IAM role
role = iam_client.get_role(RoleName='my-role-with-sagemaker-access')['Role']['Arn']
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.m5.xlarge' # ec2 instance type
)
And my invocation looks like this:
import boto3
from sagemaker.serializers import JSONSerializer
import json
client = boto3.client('sagemaker-runtime')
endpoint_name = "huggingface-pytorch-inference-XXXXXXXXX"
# The MIME type of the input data in the request body.
content_type = "application/json"
# The desired MIME type of the inference in the response.
accept = "application/json"
# Payload for inference.
payload = {
"inputs": "The capital of Germany is",
"parameters": {
"temperature": 0.7,
},
"options": {
"use_cache": False,
},
}
response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType=content_type,
Accept=accept,
Body=JSONSerializer().serialize(payload)
)
pred = json.loads(response['Body'].read())
print(pred)
prediction = pred[0]['generated_text']
print(prediction)
I'm encountering the same issue. Did you find a workaround?
Facing the same issue. Have raised a GitHub issue here
https://github.com/huggingface/transformers/issues/21402