TheBloke/Llama-2-70B-Chat-GPTQ · Why the input prompt is part of the output?

Jul 27, 2023

When I ran the code:

prompt = "Tell me about AI"
system_message = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]
'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

output contains the prompt along with the output i.e. response from the model.
So how token is calculated? the output token is max_new_tokens and the max_new_tokens+output tokens needs to be less than 4096?

TheBloke

Owner Jul 27, 2023

That's how model.generate() works. You could use model.pipeline() instead with return_full_text=False

max_new_tokens is the number of tokens it will generate in response to your prompt. prompt + max_new_tokens must be less than 4096.

Michael528

Aug 29, 2023

Is there a way to remove the input prompt from the output? I use below but still gives me the input prompt.
out_obj[0]['generated_text']

which47

Nov 26, 2023

•

edited Nov 26, 2023

output_ids = model.generate(input_ids)
output = tokenizer.decode(output_ids[:, input_ids.shape[1]:])