Difference between MT0 and BLOOMZ
My understanding is mT0 is based on an encoder-decoder transformer similar to BERT/BART, while BLOOMZ is a decoder-only model like GPT3. Are there any performance/output differences to be aware of between these two models?
You kindly provide a 13B mt0-xxl model and a 7B BLOOMZ model. I tried comparing the output of these models on a variety of tasks and did not notice much difference between the two.
Yes that's correct.
There are output/performance differences between the models - You may want to take a look at the paper Crosslingual Generalization through Multitask Finetuning, where the graphs/tables compare performance of mT0 & BLOOMZ models. Generally,
- mT0 models are stronger on most benchmarks if we compare models with ~the same parameters.
- mT0 models are biased towards shorter answers, you can check the examples in the appendix of the paper for that if you want 😇
- Performance by language will vary. E.g. mT0 models are much stronger on many languages BLOOMZ has not extensively seen, such as Russian or Japanese.
In addition, various specs are different like precision & various architecture components.
Let us know if you notice any other differences!