Why not use the Plackett-Luce Model version of DPO when K=4 ranked responses are present?
#18
by
MasterGodzilla
- opened
The original paper have the DPO version that can deal with multiple ranked responses.
Since you guys are ranking responses from 4 models using the UltraFeedback framework, using the Plackett-Luce version might very likely provide more information to the instruction tuning process with only twice the computation cost.
Why did you guys decide not to do it but instead saved "the highest scoring response as yw and a random lower scoring prompt as yl" from the four responses?
MasterGodzilla
changed discussion title from
Why not use the Plackett-Luce Model version of DPO since K=4 ranked responses are present
to Why not use the Plackett-Luce Model version of DPO when K=4 ranked responses are present?