Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeepΒ 
posted an update Jun 9
Post
3182
Here is a thought, instead of telling LLMs what to do, show them! 🎭

Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. πŸ—£οΈπŸŒ

DITTO from Stanford University proposes that LLMs can be tuned with less than 10 samples! 🀯

What's DITTO? Demonstration ITerated Task Optimization (definitely came up with the acronym first! πŸ˜‚)

Here is the step-by-step implementation: πŸ› οΈ

Initialization: Start with a reference language model (LM), a set of expert demonstrations, a sample size, and a frequency of sampling. 🏁

Supervised Fine-Tuning (SFT): Begin by fine-tuning the reference LM on the set of expert demonstrations to create an initial policy P0. 🎚️

Iterative Comparison Sampling: For each iteration t:
Sample multiple completions from the policy Pt for each demonstration to create a new dataset Dt.
Construct a batch of comparisons where the demonstrations are ranked higher than all sampled model outputs from the current and previous iterations. πŸ”„

Policy Update:
Update the policy Pt using a Direct Preference Optimization (DPO) algorithm, which incorporates feedback from the batch of comparisons.
Increment the iteration and repeat the sampling and updating process until convergence. ⏭️

Result: The final policy P after sufficient iterations aligns more closely with the expert demonstrations, effectively tuning the LM to reflect user-specific preferences and behaviors. 🎯

DITTO outperforms few-shot prompting. πŸš€

Paper: Show, Don't Tell: Aligning Language Models with Demonstrated Feedback (2406.00888) πŸ“„