File size: 11,533 Bytes
664f4f1
 
 
e411426
 
3d37062
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
664f4f1
 
9602d48
71cf999
9602d48
31f2ec8
d6ca7e0
9602d48
31f2ec8
 
 
 
e0e5ae5
d6ca7e0
31f2ec8
7b46b80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
664f4f1
81782c3
664f4f1
 
 
 
 
 
 
 
 
 
af3d538
664f4f1
e0e5ae5
 
 
7b46b80
7326eae
c6a74d9
95d62cc
 
664f4f1
 
 
42a72d8
664f4f1
 
 
cfe05c9
 
 
664f4f1
 
 
 
 
 
 
 
 
 
5d5e221
664f4f1
20c8488
cfe05c9
664f4f1
 
 
 
 
 
cfe05c9
e4110a5
 
cfe05c9
 
664f4f1
 
 
cfe05c9
664f4f1
 
 
 
 
 
 
 
 
 
 
2fdf351
 
664f4f1
31f2ec8
 
ccd18b2
31f2ec8
664f4f1
3d37062
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
license: apache-2.0
datasets:
- snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset
pipeline_tag: text-generation
model-index:
- name: Snorkel-Mistral-PairRM-DPO
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: AI2 Reasoning Challenge (25-Shot)
      type: ai2_arc
      config: ARC-Challenge
      split: test
      args:
        num_few_shot: 25
    metrics:
    - type: acc_norm
      value: 66.04
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: HellaSwag (10-Shot)
      type: hellaswag
      split: validation
      args:
        num_few_shot: 10
    metrics:
    - type: acc_norm
      value: 85.64
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU (5-Shot)
      type: cais/mmlu
      config: all
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 60.83
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: TruthfulQA (0-shot)
      type: truthful_qa
      config: multiple_choice
      split: validation
      args:
        num_few_shot: 0
    metrics:
    - type: mc2
      value: 70.86
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: Winogrande (5-shot)
      type: winogrande
      config: winogrande_xl
      split: validation
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 77.74
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GSM8k (5-shot)
      type: gsm8k
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 36.77
      name: accuracy
    source:
      url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=snorkelai/Snorkel-Mistral-PairRM-DPO
      name: Open LLM Leaderboard
---

Read our release blog here: [Snorkel AI Blog](https://snorkel.ai/new-benchmark-results-demonstrate-value-of-snorkel-ai-approach-to-llm-alignment/)

You can try our models on the [Together AI](https://api.together.xyz/playground/chat/snorkelai/Snorkel-Mistral-PairRM-DPO) playground: https://api.together.xyz/playground/chat/snorkelai/Snorkel-Mistral-PairRM-DPO.
This model is optimized for chat purposes. Have fun!


Our model is also available through [Together AI API](https://www.together.ai/solutions#what-we-offer) with the following model API string: `snorkelai/Snorkel-Mistral-PairRM-DPO`.
Special thanks to the [Together AI](https://www.together.ai/) team for adding our model to their endpoints.

We also provide an HF inference endpoint for everyone to test the model.
It may initially take a few minutes to activate, but will eventually operate at the standard speed of HF's 7B model text inference endpoint.
The speed of inference depends on HF endpoint performance and is not related to Snorkel offerings.
This endpoint is designed for initial trials, not for ongoing production use.

```
import requests

API_URL = "https://t1q6ks6fusyg1qq7.us-east-1.aws.endpoints.huggingface.cloud"
headers = {
	"Accept" : "application/json",
	"Content-Type": "application/json" 
}

def query(payload):
	response = requests.post(API_URL, headers=headers, json=payload)
	return response.json()

output = query({
	"inputs": "[INST] Recommend me some Hollywood movies [/INST]",
	"parameters": {}
})
```

### Dataset:
Training dataset: [snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset](https://huggingface.co/datasets/snorkelai/Snorkel-Mistral-PairRM-DPO-Dataset)

We utilize ONLY the prompts from [UltraFeedback](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized); **no external LLM responses used**.

### Methodology:
  1. Generate five response variations for each prompt from a subset of 20,000 using the LLM - to start, we used [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
  2. Apply [PairRM](https://huggingface.co/llm-blender/PairRM) for response reranking.
  3. Update the LLM by applying Direct Preference Optimization (DPO) on the top (chosen) and bottom (rejected) responses.
  4. Use this LLM as the base model for the next iteration, repeating three times in total.
 
This overview provides a high-level summary of our approach. 
We plan to release more detailed results and findings in the coming weeks on the [Snorkel blog.](https://snorkel.ai/blog/)

The prompt format follows the Mistral model:

```[INST] {prompt} [/INST]```

### Training recipe:
- The provided data is formatted to be compatible with the Hugging Face's [Zephyr recipe](https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta).
We executed the n_th DPO iteration using the "train/test_iteration_{n}".

### Key Premises:
- **Specialization Requirement**: For most enterprise use cases, using LLMs "off-the-shelf" falls short of production quality, necessitating additional fine-tuning and alignment.
- **Ease of Model Building**: Creating ranking/scoring/classification models is simpler than developing high-quality, manually annotated datasets for long-form responses.
- **Alignment Recipe**: Using smaller but specialized teacher models (reward models) can incrementally align LLMs towards specific axes.

### Applications:
Unlike our customers, who have very specific use cases to align LLMs to,
the AlpacaEval 2.0 leaderboard measures the ability of LLMS to follow user instructions.
With this demonstration, we focus on the general approach to alignment. 
Thus, we use a general-purpose reward model - the performant [PairRM model](https://huggingface.co/llm-blender/PairRM).
We use the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model as our base LLM.

For interest in building your **specialized internal reward models
that reflect your enterprises' needs**, please contact the Snorkel AI team or consider attending our
[**Enterprise LLM Summit: Building GenAI with Your Data on January 25, 2024**](https://snorkel.ai/event/enterprise-llm-summit/)
to learn more about "Programmatically scaling human preferences and alignment in GenAI".

### Result:
On [**Alpaca-Eval 2.0**](https://tatsu-lab.github.io/alpaca_eval/):
- The base model: [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) scored **14.72**.

After applying the above methodology:
- This model scored **30.22** - ranked 3rd and the highest for an open-source base model at the time of publication.
- When post-processing the model outputs with PairRM-best-of-16, which involved generating 16 responses and selecting the highest-scoring response by PairRM, we scored **34.86** - ranked 2nd.
The best model on the leaderboard is "gpt-4-turbo", which is also the judge of optimal responses.

We recognize that the Alpaca-Eval 2.0 benchmark does not entirely capture the full range of capabilities and performances of LLMs.
However, in our current work, where the goal is to align with general "human preferences," Alpaca-Eval 2.0 serves as a suitable and representative benchmark.
Moving forward, we anticipate further contributions from the community regarding new alignment axes, and conduct evaluations using other appropriate benchmarks.

The Alpaca-Eval 2.0 evaluator, "gpt-4-turbo," exhibits a bias towards longer responses. 
This tendency might also be present in our chosen reward model, resulting in our model producing lengthier responses after DPO iterations,
which can be among the factors to our higher ranks on the leaderboard.
Future work could include measures to control response length and other relevant metrics.

### Limitations:
The model is a quick demonstration that the LLMs can be programmatically aligned using smaller specialized reward models. 
It does not have any moderation mechanisms. 
We look forward to continuing to engage with the research community and our customers exploring optimal methods for getting models to respect guardrails, 
allowing for deployment in environments requiring moderated outputs.

### Contemporary Work and Acknowledgements:
- The Mistral AI Team for developing and releasing the advanced Mistral-7B-Instruct-v0.2 model.
- The author of the [Direct Preference Optimization paper](https://arxiv.org/abs/2305.18290) for the innovative approach
- The author of the [Pairwise Reward Model for LLMs paper](https://arxiv.org/abs/2306.02561) for the powerful general-purpose reward model
- The HuggingFace team for the DPO implementation under [The Alignment Handbook](https://github.com/huggingface/alignment-handbook)
- We would also like to acknowledge contemporary work published independently on arXiv on 2024-01-18 by Meta & NYU (Yuan, et al) in a paper called [Self-Rewarding Language Models](https://arxiv.org/abs/2401.10020), 
which proposes a similar general approach for creating alignment pairs from a larger set of candidate responses, but using the LLM as the reward model.
While this may work for general-purpose models, our experience has shown that task-specific reward models guided by SMEs are necessary for most 
enterprise applications of LLMs for specific use cases, which is why we focus on the use of external reward models.
- Also, we would like to acknowledge another concurrent work that has a similar approach but focuses more on the theoretical aspect of the iterative DPO process: [Iterative Preference Learning from Human Feedback: Bridging Theory and
Practice for RLHF under KL-Constraint](https://arxiv.org/pdf/2312.11456.pdf) on 2024-01-28 (Xiong, et al).

### GGUF version
Snorkel-Mistral-PairRM-DPO GGUF model version: from [andrew-cartwheel](https://huggingface.co/andrew-cartwheel/snorkel-mistral-pairRM-DPO-q8_0.gguf) or [brittlewis12](https://huggingface.co/brittlewis12/Snorkel-Mistral-PairRM-DPO-GGUF).
Thanks to the mentioned community members for providing the GGUF model versions.

### The Snorkel AI Team
Hoang Tran, Chris Glaze, Braden Hancock
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_snorkelai__Snorkel-Mistral-PairRM-DPO)

|             Metric              |Value|
|---------------------------------|----:|
|Avg.                             |66.31|
|AI2 Reasoning Challenge (25-Shot)|66.04|
|HellaSwag (10-Shot)              |85.64|
|MMLU (5-Shot)                    |60.83|
|TruthfulQA (0-shot)              |70.86|
|Winogrande (5-shot)              |77.74|
|GSM8k (5-shot)                   |36.77|