lewtun HF staff commited on
Commit
b0b6916
1 Parent(s): 3a027ab
Files changed (1) hide show
  1. app/src/index.html +3 -4
app/src/index.html CHANGED
@@ -110,9 +110,7 @@
110
  <li><strong>Process reward model: </strong>To guide our search strategies, we used <code>RLHFlow/Llama3.1-8B-PRM-Deepseek-Data</code>, an 8B reward model that has been trained using <em>process supervision</em>. Process supervision is a training approach where models receive feedback on each step of their reasoning process, not just the final outcome. We picked this model since it belongs to the same model family as our policy and gave better results than other PRMs like <a href="https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm">Math-Shepherd</a> we tested in this weight class.</li>
111
  <li><strong>Dataset: </strong>We evaluated on the<a href="https://huggingface.co/datasets/HuggingFaceH4/MATH-500"> MATH-500 subset</a> of the <a href="https://huggingface.co/papers/2103.03874">MATH benchmark</a>, a dataset released by OpenAI as part of their <a href="https://huggingface.co/papers/2305.20050">research</a> on process supervision. These math problems span seven subjects and are challenging for both humans and most LLMs. Take a look at the dataset viewer below to get a taste for the problem difficulty!</li>
112
 
113
-
114
- <iframe src="https://huggingface.co/datasets/HuggingFaceH4/MATH-500/embed/viewer/default/test" frameborder="0"
115
- width="100%" height="560px"></iframe>
116
 
117
  <p>We tested each search strategy across compute budgets ranging from 1 to 256 generations per prompt and ran the data-generation pipeline with five random seeds to estimate variance across runs. You can find the models and datasets from our analysis in this <a href="https://huggingface.co/collections/HuggingFaceH4/scaling-test-time-compute-with-open-models-675c3b475a0d6eb4528fec23">Hugging Face collection</a>.</p>
118
 
@@ -179,6 +177,7 @@ def get_canonical_form(expression: str) -&gt; str:
179
  <li><strong>Weighted Best-of-N:</strong> Aggregate scores across all identical responses and select the answer with the <em>highest total reward</em>. This approach prioritises high-quality answers by boosting their scores through repeated occurrences. Mathematically, the weighting across answers \(a_i\) is performed as follows:
180
 
181
  $$ a_\mathrm{weighted} = \arg\max_{a} \sum_{i=1}^{N} \mathbb{I}(a_i = a) \cdot \mathrm{RM}(p, s_i) \,,$$
 
182
  where \(\mathrm{RM}(p, s_i)\) is the reward model score of the \(i\)-th solution solution \(s_i\) to problem \(p\).</li>
183
  </ul>
184
 
@@ -238,7 +237,7 @@ def get_canonical_form(expression: str) -&gt; str:
238
 
239
  <ul>
240
  <li>Majority voting is the worst performer for all compute budgets, except for \(N=256\), where beam search is worst.</li>
241
- <li style="list-style-type:disc">Beam search is best for \(N=[4,16,64]\), but Best-of-N is best for \(N=256\).</li>
242
  </ul>
243
 
244
  <p id="15a1384e-bcac-80d4-af98-eaebf5fcf84e" class="">Although we see that beam search gives consistent gains in the medium and hard problems (levels 3-5), it tends to do worse than Best-of-N (and even majority voting!) on the simpler problems and especially at large compute budgets. </p><p id="15a1384e-bcac-805b-9949-f0cdc44c9e3c" class="">We realized from looking at the resulting trees produced by beam search, that if a single step is assigned high reward, then the whole tree collapses to that trace and thus diversity is impacted. This prompted us to explore an extension to beam search that maximises diversity - let’s take a look!</p>
 
110
  <li><strong>Process reward model: </strong>To guide our search strategies, we used <code>RLHFlow/Llama3.1-8B-PRM-Deepseek-Data</code>, an 8B reward model that has been trained using <em>process supervision</em>. Process supervision is a training approach where models receive feedback on each step of their reasoning process, not just the final outcome. We picked this model since it belongs to the same model family as our policy and gave better results than other PRMs like <a href="https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm">Math-Shepherd</a> we tested in this weight class.</li>
111
  <li><strong>Dataset: </strong>We evaluated on the<a href="https://huggingface.co/datasets/HuggingFaceH4/MATH-500"> MATH-500 subset</a> of the <a href="https://huggingface.co/papers/2103.03874">MATH benchmark</a>, a dataset released by OpenAI as part of their <a href="https://huggingface.co/papers/2305.20050">research</a> on process supervision. These math problems span seven subjects and are challenging for both humans and most LLMs. Take a look at the dataset viewer below to get a taste for the problem difficulty!</li>
112
 
113
+ <iframe src="https://huggingface.co/datasets/HuggingFaceH4/MATH-500/embed/viewer/default/test" frameborder="0" width="100%" height="560px"></iframe>
 
 
114
 
115
  <p>We tested each search strategy across compute budgets ranging from 1 to 256 generations per prompt and ran the data-generation pipeline with five random seeds to estimate variance across runs. You can find the models and datasets from our analysis in this <a href="https://huggingface.co/collections/HuggingFaceH4/scaling-test-time-compute-with-open-models-675c3b475a0d6eb4528fec23">Hugging Face collection</a>.</p>
116
 
 
177
  <li><strong>Weighted Best-of-N:</strong> Aggregate scores across all identical responses and select the answer with the <em>highest total reward</em>. This approach prioritises high-quality answers by boosting their scores through repeated occurrences. Mathematically, the weighting across answers \(a_i\) is performed as follows:
178
 
179
  $$ a_\mathrm{weighted} = \arg\max_{a} \sum_{i=1}^{N} \mathbb{I}(a_i = a) \cdot \mathrm{RM}(p, s_i) \,,$$
180
+
181
  where \(\mathrm{RM}(p, s_i)\) is the reward model score of the \(i\)-th solution solution \(s_i\) to problem \(p\).</li>
182
  </ul>
183
 
 
237
 
238
  <ul>
239
  <li>Majority voting is the worst performer for all compute budgets, except for \(N=256\), where beam search is worst.</li>
240
+ <li>Beam search is best for \(N=[4,16,64]\), but Best-of-N is best for \(N=256\).</li>
241
  </ul>
242
 
243
  <p id="15a1384e-bcac-80d4-af98-eaebf5fcf84e" class="">Although we see that beam search gives consistent gains in the medium and hard problems (levels 3-5), it tends to do worse than Best-of-N (and even majority voting!) on the simpler problems and especially at large compute budgets. </p><p id="15a1384e-bcac-805b-9949-f0cdc44c9e3c" class="">We realized from looking at the resulting trees produced by beam search, that if a single step is assigned high reward, then the whole tree collapses to that trace and thus diversity is impacted. This prompted us to explore an extension to beam search that maximises diversity - let’s take a look!</p>