Why the input prompt is part of the output?

#25
by neo-benjamin - opened

When I ran the code:

prompt = "Tell me about AI"
system_message = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

output contains the prompt along with the output i.e. response from the model.
So how token is calculated? the output token is max_new_tokens and the max_new_tokens+output tokens needs to be less than 4096?

That's how model.generate() works. You could use model.pipeline() instead with return_full_text=False

max_new_tokens is the number of tokens it will generate in response to your prompt. prompt + max_new_tokens must be less than 4096.

Is there a way to remove the input prompt from the output? I use below but still gives me the input prompt.
out_obj[0]['generated_text']

output_ids = model.generate(input_ids)
output = tokenizer.decode(output_ids[:, input_ids.shape[1]:])

Sign up or log in to comment