Confusion about sampling for `flan-t5` models
I added some breakpoints to the following code:
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-small")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-small", device_map="auto")
input_ids = tokenizer("summarize: studies have shown that owning a dog is good for you ", return_tensors="pt").input_ids # Batch size 1
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
I saw that this activates greedy generation mode in this function. However, the paper https://openreview.net/pdf?id=gEZrGCozdqR has many more usages of top-k sampling than greedy sampling. Shouldn't the default be top-k sampling?
(more generally, I'm curious about what the best practices for sampling from SOTA LLMs is. It seems that top-p nucleus sampling is the most common, but beam search and greedy seem also used, and the articles online don't mention which models use each of them)
I think greedy sampling is the simplest strategy that works on the widest variety of tasks. It covers summarisation and open-ended generation, but also things like generative classification or question answering. For the latter class of tasks, you may not want to use stochastic sampling, especially if you finetuned the model on a specific task, as then you need to somehow choose one of the hypotheses. So the defaults the library authors set cover all tasks. People interested in summarisation or other tasks where diversity is important, can consult the literature and set the parameters accordingly. It would be great to have a table listing the major works and the parameters they use, that would be a great addition to the docs!
@ArthurConmy
,
Not related to your question but since it may be useful for you to know nevertheless,
could you check if model.generate(input_ids,min_new_tokens=30) actually gives you 30 new tokens? This is related to the issue here or #38