T0 vs T5
I'd imagine that T0 has better cross-task transfer learning because of the instructions being more verbose and doesn't rely on memorizing task names to perform certain tasks such as "cola sentence: John made Bill master of himself." In other words, I think apart from only zero-shot generalization, I think T0 might even do better for individual tasks that T5 models were trained with.
Is there any comparison between T0 vs T5 on benchmarks that T5 report on? I'm surprised that I don't find these results anywhere.
Hi
@wise-east
!
In the T0 paper, we compared against T5+LM which is T5 further pre-trained on standard language modeling loss. This is a much more sensible baseline to compare with given the MLM loss of T5 (IIRC, I did run some comparisons with T5 and it performed not as good as t5+lm).
Hey @VictorSanh , thanks for the quick response! I'm sorry, I think my question was not clear. It seems clear that T0 is better than T5 variants for zero-shot generalization (i.e. not being fine-tuned on any of the training sets of the task to be evaluated on).
I'm wondering if fine-tuning T0 with the training set of the downstream evaluation task attains better performance for a T5 model that is also fine-tuned on the training set of the downstream evaluation task, and how big this difference is. More specifically, where would T0's performance be if it were to be included in Table 14 of the T5 paper? (shown in the image) Or has T0 been trained with the test sets of datasets included in these benchmarks that it is not comparable? It didn't seem to be the case based on my read of the T0 paper.
Oooh i see.
I think this is an experimental question, I don't have a strong intuition on what the exact conclusion would be: papers that do "intermediate fine-tuning" of this sort can observe both negative or positive transfer depending on the task.
Does anyone know if the T0PP can only handle 512 tokens? I need a model that can handle 2048 tokens?when i print print(tokenizer.model_max_length) it shows 512 while the input training data is 1024. I am wondering why?