snorkelai
/

Snorkel-Mistral-PairRM-DPO

@@ -4,8 +4,6 @@ datasets:
 - snorkelai/Snorkel-Mistral-Self-Improvement
 ---
-Original post: [Snorkel link]
 ### Dataset:
 Training dataset: [snorkelai/Snorkel-Mistral-Self-Improvement](link)
@@ -23,16 +21,15 @@ We plan to release more detailed results and findings in the coming weeks on the
 ### Key Premises:
 - **Specialization Requirement**: For most enterprise use cases, using LLMs "off-the-shelf" falls short of production quality, necessitating additional fine-tuning and alignment.
 - **Ease of Model Building**: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
-- **Programmatic Alignment**: Using smaller but specialized teacher models (reward models) can incrementally align LLMs towards specific axes. We call this **Programmatic Alignment** - capturing domain knowledge in programmatic forms that can be used to guide LLM improvement.
 ### Applications:
 Unlike our customers, who have very specific use cases to align LLMs to,
-the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow general user instructions.
-Thus, for this demonstration, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM).
 We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model as our base LLM.
-With this demonstration, we focus on the general approach of programmatic alignment.
 For interest in building your **specialized internal reward models
 that reflect your enterprises' needs**, please contact the Snorkel AI team or consider attending our
 [**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
@@ -43,17 +40,21 @@ On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
 - The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.
 After applying the above methodology:
 - This model scored **30.2** - ranked 3rd and the highest for an open-source base model at the time of publication.
-- When post-processing the model outputs with PairRM-best-of-16, which involved generating 16 responses and select the highest-scoring response by PairRM, we scored **34.86** - ranked 2nd.
 The best model on the leaderboard is "gpt-4-turbo", which is also the judge of optimal responses.
 We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.
 However, in our current work, where the goal is to align with general "human preferences," Alpaca-Eval 2.0 serves as a suitable and representative benchmark.
 Moving forward, we anticipate further contributions from the community regarding new alignment axes, and conduct evaluations using other appropriate benchmarks.
 ### Limitations:
 The model is a quick demonstration that the LLMs can be programmatically aligned using smaller specialized reward models.
 It does not have any moderation mechanisms.
-We look forward to continuing to engage with the research community and our customers exploring optimal methods for gettings models to respect guardrails,
 allowing for deployment in environments requiring moderated outputs.
 ### Contemporary Work and Acknowledgements:

 - snorkelai/Snorkel-Mistral-Self-Improvement
 ---
 ### Dataset:
 Training dataset: [snorkelai/Snorkel-Mistral-Self-Improvement](link)
 ### Key Premises:
 - **Specialization Requirement**: For most enterprise use cases, using LLMs "off-the-shelf" falls short of production quality, necessitating additional fine-tuning and alignment.
 - **Ease of Model Building**: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
+- **Programmatic Alignment**: Using smaller but specialized teacher models (reward models) can incrementally align LLMs towards specific axes.
 ### Applications:
 Unlike our customers, who have very specific use cases to align LLMs to,
+the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow user instructions.
+With this demonstration, we focus on the general approach to alignment.
+Thus, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM).
 We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model as our base LLM.
 For interest in building your **specialized internal reward models
 that reflect your enterprises' needs**, please contact the Snorkel AI team or consider attending our
 [**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
 - The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.
 After applying the above methodology:
 - This model scored **30.2** - ranked 3rd and the highest for an open-source base model at the time of publication.
+- When post-processing the model outputs with PairRM-best-of-16, which involved generating 16 responses and selecting the highest-scoring response by PairRM, we scored **34.86** - ranked 2nd.
 The best model on the leaderboard is "gpt-4-turbo", which is also the judge of optimal responses.
 We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.
 However, in our current work, where the goal is to align with general "human preferences," Alpaca-Eval 2.0 serves as a suitable and representative benchmark.
 Moving forward, we anticipate further contributions from the community regarding new alignment axes, and conduct evaluations using other appropriate benchmarks.
+The Alpaca-Eval 2.0 evaluator, "gpt-4-turbo," exhibits a bias towards longer responses.
+This tendency might also be present in our chosen reward model, resulting in our model producing lengthier responses after DPO iterations.
+Future work could include measures to control response length and other relevant metrics.
 ### Limitations:
 The model is a quick demonstration that the LLMs can be programmatically aligned using smaller specialized reward models.
 It does not have any moderation mechanisms.
+We look forward to continuing to engage with the research community and our customers exploring optimal methods for getting models to respect guardrails,
 allowing for deployment in environments requiring moderated outputs.
 ### Contemporary Work and Acknowledgements: