File size: 11,533 Bytes
664f4f1 e411426 3d37062 664f4f1 9602d48 71cf999 9602d48 31f2ec8 d6ca7e0 9602d48 31f2ec8 e0e5ae5 d6ca7e0 31f2ec8 7b46b80 664f4f1 81782c3 664f4f1 af3d538 664f4f1 e0e5ae5 7b46b80 7326eae c6a74d9 95d62cc 664f4f1 42a72d8 664f4f1 cfe05c9 664f4f1 5d5e221 664f4f1 20c8488 cfe05c9 664f4f1 cfe05c9 e4110a5 cfe05c9 664f4f1 cfe05c9 664f4f1 2fdf351 664f4f1 31f2ec8 ccd18b2 31f2ec8 664f4f1 3d37062 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
---
license: apache-2.0
datasets:
- snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset
pipeline_tag: text-generation
model-index:
- name: Snorkel-Mistral-PairRM-DPO
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 66.04
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 85.64
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 60.83
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 70.86
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 77.74
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 36.77
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
name: Open LLM Leaderboard
---
Read our release blog here: [Snorkel AI Blog](https://snorkel.ai/new-benchmark-results-demonstrate-value-of-snorkel-ai-approach-to-llm-alignment/)
You can try our models on the [Together AI](https://api.together.xyz/playground/chat/snorkelai/Snorkel-Mistral-PairRM-DPO) playground: https://api.together.xyz/playground/chat/snorkelai/Snorkel-Mistral-PairRM-DPO.
This model is optimized for chat purposes. Have fun!
Our model is also available through [Together AI API](https://www.together.ai/solutions#what-we-offer) with the following model API string: `snorkelai/Snorkel-Mistral-PairRM-DPO`.
Special thanks to the [Together AI](https://www.together.ai/) team for adding our model to their endpoints.
We also provide an HF inference endpoint for everyone to test the model.
It may initially take a few minutes to activate, but will eventually operate at the standard speed of HF's 7B model text inference endpoint.
The speed of inference depends on HF endpoint performance and is not related to Snorkel offerings.
This endpoint is designed for initial trials, not for ongoing production use.
```
import requests
API_URL = "https://t1q6ks6fusyg1qq7.us-east-1.aws.endpoints.huggingface.cloud"
headers = {
"Accept" : "application/json",
"Content-Type": "application/json"
}
def query(payload):
response = requests.post(API_URL, headers=headers, json=payload)
return response.json()
output = query({
"inputs": "[INST] Recommend me some Hollywood movies [/INST]",
"parameters": {}
})
```
### Dataset:
Training dataset: [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset)
We utilize ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.
### Methodology:
1. Generate five response variations for each prompt from a subset of 20,000 using the LLM - to start, we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
4. Use this LLM as the base model for the next iteration, repeating three times in total.
This overview provides a high-level summary of our approach.
We plan to release more detailed results and findings in the coming weeks on the [Snorkel blog.](https://snorkel.ai/blog/)
The prompt format follows the Mistral model:
```[INST] {prompt} [/INST]```
### Training recipe:
- The provided data is formatted to be compatible with the Hugging Face's [Zephyr recipe](https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta).
We executed the n_th DPO iteration using the "train/test_iteration_{n}".
### Key Premises:
- **Specialization Requirement**: For most enterprise use cases, using LLMs "off-the-shelf" falls short of production quality, necessitating additional fine-tuning and alignment.
- **Ease of Model Building**: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
- **Alignment Recipe**: Using smaller but specialized teacher models (reward models) can incrementally align LLMs towards specific axes.
### Applications:
Unlike our customers, who have very specific use cases to align LLMs to,
the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow user instructions.
With this demonstration, we focus on the general approach to alignment.
Thus, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM).
We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model as our base LLM.
For interest in building your **specialized internal reward models
that reflect your enterprises' needs**, please contact the Snorkel AI team or consider attending our
[**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
to learn more about "Programmatically scaling human preferences and alignment in GenAI".
### Result:
On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
- The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.
After applying the above methodology:
- This model scored **30.22** - ranked 3rd and the highest for an open-source base model at the time of publication.
- When post-processing the model outputs with PairRM-best-of-16, which involved generating 16 responses and selecting the highest-scoring response by PairRM, we scored **34.86** - ranked 2nd.
The best model on the leaderboard is "gpt-4-turbo", which is also the judge of optimal responses.
We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.
However, in our current work, where the goal is to align with general "human preferences," Alpaca-Eval 2.0 serves as a suitable and representative benchmark.
Moving forward, we anticipate further contributions from the community regarding new alignment axes, and conduct evaluations using other appropriate benchmarks.
The Alpaca-Eval 2.0 evaluator, "gpt-4-turbo," exhibits a bias towards longer responses.
This tendency might also be present in our chosen reward model, resulting in our model producing lengthier responses after DPO iterations,
which can be among the factors to our higher ranks on the leaderboard.
Future work could include measures to control response length and other relevant metrics.
### Limitations:
The model is a quick demonstration that the LLMs can be programmatically aligned using smaller specialized reward models.
It does not have any moderation mechanisms.
We look forward to continuing to engage with the research community and our customers exploring optimal methods for getting models to respect guardrails,
allowing for deployment in environments requiring moderated outputs.
### Contemporary Work and Acknowledgements:
- The Mistral AI Team for developing and releasing the advanced Mistral-7B-Instruct-v0.2 model.
- The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
- The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
- The HuggingFace team for the DPO implementation under [The Alignment Handbook](https://github.com/huggingface/alignment-handbook)
- We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called [Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020),
which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most
enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
- Also, we would like to acknowledge another concurrent work that has a similar approach but focuses more on the theoretical aspect of the iterative DPO process: [Iterative Preference Learning from Human Feedback: Bridging Theory and
Practice for RLHF under KL-Constraint](https://arxiv.org/pdf/2312.11456.pdf) on 2024-01-28 (Xiong, et al).
### GGUF version
Snorkel-Mistral-PairRM-DPO GGUF model version: from [andrew-cartwheel](https://huggingface.co/andrew-cartwheel/snorkel-mistral-pairRM-DPO-q8_0.gguf) or [brittlewis12](https://huggingface.co/brittlewis12/Snorkel-Mistral-PairRM-DPO-GGUF).
Thanks to the mentioned community members for providing the GGUF model versions.
### The Snorkel AI Team
Hoang Tran, Chris Glaze, Braden Hancock
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_snorkelai__Snorkel-Mistral-PairRM-DPO)
| Metric |Value|
|---------------------------------|----:|
|Avg. |66.31|
|AI2 Reasoning Challenge (25-Shot)|66.04|
|HellaSwag (10-Shot) |85.64|
|MMLU (5-Shot) |60.83|
|TruthfulQA (0-shot) |70.86|
|Winogrande (5-shot) |77.74|
|GSM8k (5-shot) |36.77|
|