Why not use the Plackett-Luce Model version of DPO when K=4 ranked responses are present?

#18

by MasterGodzilla - opened Nov 3, 2023

Nov 3, 2023

The original paper have the DPO version that can deal with multiple ranked responses.

Since you guys are ranking responses from 4 models using the UltraFeedback framework, using the Plackett-Luce version might very likely provide more information to the instruction tuning process with only twice the computation cost.

Why did you guys decide not to do it but instead saved "the highest scoring response as yw and a random lower scoring prompt as yl" from the four responses?

MasterGodzilla changed discussion title from Why not use the Plackett-Luce Model version of DPO since K=4 ranked responses are present to Why not use the Plackett-Luce Model version of DPO when K=4 ranked responses are present? Nov 3, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment