Post
After the Supervised Fine-Tuning (SFT) phase, we observed a notable degradation in the instruction-following capabilities of the LLaVA Multi-Modal Large Language Model (MM-LLM). To address this issue, we introduced a 6K-entry VQA preference dataset and employed Direct Preference Optimization (DPO), alongside testing other algorithms such as Rejection Sampling and SteerLM, to enhance instruction-following proficiency. Our methodology not only fully restored the language following capabilities of LLaVa on the MT-Bench but also outperformed LLaVA-RLHF and Vicuna. Additionally, our approach extended to visual VQA tasks, as demonstrated by significant performance improvements on MM-Vet and LLaVa-Bench. An interesting observation was that, compared to models using distilled SFT, our method showed substantial out-of-distribution improvements.
https://arxiv.org/abs/2402.10884
Model available
alexshengzhili/llava-v1.5-13b-dpo
GitHub:
https://github.com/findalexli/mllm-dpo/edit/main/README.MD