Jan 29

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

models

hub = {
'HF_MODEL_ID':'mistralai/Mistral-7B-Instruct-v0.2',
'SM_NUM_GPUS': json.dumps(1)
}

create Hugging Face Model Class

huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
env=hub,
role=role,
)

deploy model to SageMaker Inference

predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge",
container_startup_health_check_timeout=300,
)

send request

predictor.predict({
"inputs": "My name is Julien and I like to",
})

This code is giving error.

ashbushster

Jan 29

Update the version from "1.1.0" to "1.3.3"

ki11b451c

Jan 30

Hey, does it seem to be working thus far? im tyring to figure out a way to run it without sacrificing to much w a quant version. my comp is an macbook 8gig... what would you suggest?

adhiltortil

Jan 31

I had been using the model from Automodal using the code:

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", torch_dtype=torch.float16, attn_implementation="flash_attention_2").

I want to deploy the model on sage maker. Is this the right way to load the model with flash attention?

Hub Model configuration. https://huggingface.co/models

hub = {
'HF_MODEL_ID':'mistralai/Mistral-7B-Instruct-v0.2',
'SM_NUM_GPUS': json.dumps(1),
'HF_TASK':'text-generation',
'attn_implementation':"flash_attention_2",
'torch_dtype':'torch.float16'
}

mistralai
/

Mistral-7B-Instruct-v0.2

deploying on aws sagemaker.

Hub Model configuration. https://huggingface.co/models

create Hugging Face Model Class

deploy model to SageMaker Inference

send request

Hub Model configuration. https://huggingface.co/models