Is distilling an already distilled model superior to fine-tuning?
I'm currently exploring various techniques to optimize and improve our model's performance, and one question arose in our discussions: If we take a model that has already undergone distillation and then distill it again, would this lead to better results compared to just fine-tuning the original model?
I understand that distillation is a method to transfer knowledge from a larger model (teacher) to a smaller model (student), aiming to retain the generalization capabilities of the larger model while benefiting from the efficiency of the smaller one. However, it's unclear to me how this would work when applied repeatedly, especially when compared to fine-tuning.
Has anyone here tried this approach before? If so, could you share your findings? Are there any studies or papers that discuss this topic in detail?
Thank you in advance for your insights!
Repeated distillation is probably not a good approach to take, As their might be a lot of hidden loss that is not apparent very quickly. Repeated or Progressive Distillation when explored via the lens of faster sampling however seems to be promising, as shown by this paper.
Also, Distillation is much more expensive in terms of compute. Finetuning, or even lora training, can instill powerful concepts into text to image models while requiring a fraction of the cost, so if increasing the quality is your motto then these methods are probably more suitable.
Thank you for your response. Fine-tuning the distilled model is quite challenging, but instead of distillation, I will try to come up with some workaround.