Prompt style documentation
Hi, i've wanted to try out your model for a personal project that I'm doing and I was wondering if you could provide some more documentation regarding the prompt style that needs to be used
I've been using Llama Index and the default chat messages roles are like this:
class MessageRole(str, Enum):
"""Message role."""
SYSTEM = "system"
USER = "user"
ASSISTANT = "assistant"
FUNCTION = "function"
TOOL = "tool"
Wanted to know what are the equivalents for this model and the general prompt usage, since even following the example on the readme I cannot get it to run properly, sometimes it continues to generate even after the fist message, and sometimes the generated response is missing the final vertical bar "|"
see below
Sorry i didn't include a prompt format specification in the README. I have updated it https://huggingface.co/galatolo/cerbero-7b#prompt-format
The prompt is:
[|Umano|] First human message
[|Assistente|] First AI reply
[|Umano|] Second human message
[|Assistente|] Second AI reply
When crafting prompts, ensure to conclude with the [|Assistente|]
tag, signaling the AI to generate a response.
Use [|Umano|]
as stop word. For example:
[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]
Hi, I typically use vLLM for inference, which supports this feature by default.
Using Hugging Face Transformers to halt generation at a multi-token word such as [|Umano|]
, you can follow the approach outlined in this discussion by defining a custom stopping criteria:
from transformers import StoppingCriteria
class MyStoppingCriteria(StoppingCriteria):
def __init__(self, target_sequence, prompt):
self.target_sequence = target_sequence
self.prompt = prompt
def __call__(self, input_ids, scores, **kwargs):
# Convert the generated token IDs to text and remove the initial prompt
generated_text = tokenizer.decode(input_ids[0]).replace(self.prompt, '')
# Halt generation if the target sequence is found
return self.target_sequence in generated_text
def __len__(self):
return 1
def __iter__(self):
yield self
Then, incorporate this criterion into the generate
function:
model.generate(input_ids, max_new_tokens=128, stopping_criteria=MyStoppingCriteria("[|Umano|]", prompt))
While I haven't tested this myself, the logic appears to be sound.
It super worked! Thank you, you're amazing.
I post the code below for completeness. I'm executing the whole thing in a HF Inference Endpoint with this custom handler:
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria
import torch
from typing import Dict
class MyStoppingCriteria(StoppingCriteria):
"""Necessary for multi-token EOS words."""
def __init__(self, target_sequence, prompt, tokenizer):
self.target_sequence = target_sequence
self.prompt = prompt
self.tokenizer = tokenizer
def __call__(self, input_ids, scores, **kwargs):
# Convert the generated token IDs to text and remove the initial prompt
generated_text = self.tokenizer.decode(input_ids[0]).replace(self.prompt, '')
# Halt generation if the target sequence is found
return self.target_sequence in generated_text
def __len__(self):
return 1
def __iter__(self):
yield self
class EndpointHandler():
def __init__(self, path=""):
# Variables
model_id = "galatolo/cerbero-7b-openchat"
self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Model (GPU)
self.model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True
)
self.model.to(self.device)
# Tokenizer (CPU)
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
def __call__(self, data: Dict[str, bytes]) -> Dict[str, str]:
# Read the input
prompt = data.pop("inputs", data)
stopping_criteria = MyStoppingCriteria("[|Umano|]", prompt, self.tokenizer)
# Encode
input_ids = self.tokenizer(prompt, return_tensors='pt').input_ids
input_ids = input_ids.to(self.device)
# Generate
with torch.no_grad():
output_ids = self.model.generate(input_ids, max_new_tokens=2048, stopping_criteria=stopping_criteria)
# Decode
generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)
return generated_text
Thank you for providing the full code!