Why the input prompt is part of the output?
#25
by
neo-benjamin
- opened
When I ran the code:
prompt = "Tell me about AI"
system_message = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>
{prompt} [/INST]
'''
print("\n\n*** Generate:")
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))
output contains the prompt along with the output
i.e. response from the model.
So how token is calculated? the output token is max_new_tokens
and the max_new_tokens
+output tokens
needs to be less than 4096?
That's how model.generate()
works. You could use model.pipeline()
instead with return_full_text=False
max_new_tokens
is the number of tokens it will generate in response to your prompt. prompt + max_new_tokens
must be less than 4096.
Is there a way to remove the input prompt from the output? I use below but still gives me the input prompt.
out_obj[0]['generated_text']
output_ids = model.generate(input_ids)
output = tokenizer.decode(output_ids[:, input_ids.shape[1]:])