blogpost-scaling-test-time-compute

Running

App Files Files Community

lewtun HF staff commited on 5 days ago

Commit

b0b6916

•

1 Parent(s): 3a027ab

Minor QoL

Browse files

Files changed (1) hide show

app/src/index.html +3 -4

app/src/index.html CHANGED Viewed

@@ -110,9 +110,7 @@
         <li><strong>Process reward model: </strong>To guide our search strategies, we used <code>RLHFlow/Llama3.1-8B-PRM-Deepseek-Data</code>, an 8B reward model that has been trained using <em>process supervision</em>. Process supervision is a training approach where models receive feedback on each step of their reasoning process, not just the final outcome. We picked this model since it belongs to the same model family as our policy and gave better results than other PRMs like <a href="https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm">Math-Shepherd</a> we tested in this weight class.</li>
         <li><strong>Dataset: </strong>We evaluated on the<a href="https://huggingface.co/datasets/HuggingFaceH4/MATH-500"> MATH-500 subset</a> of the <a href="https://huggingface.co/papers/2103.03874">MATH benchmark</a>, a dataset released by OpenAI as part of their <a href="https://huggingface.co/papers/2305.20050">research</a> on process supervision. These math problems span seven subjects and are challenging for both humans and most LLMs. Take a look at the dataset viewer below to get a taste for the problem difficulty!</li>
-    <iframe src="https://huggingface.co/datasets/HuggingFaceH4/MATH-500/embed/viewer/default/test" frameborder="0"
-        width="100%" height="560px"></iframe>
     <p>We tested each search strategy across compute budgets ranging from 1 to 256 generations per prompt and ran the data-generation pipeline with five random seeds to estimate variance across runs. You can find the models and datasets from our analysis in this <a href="https://huggingface.co/collections/HuggingFaceH4/scaling-test-time-compute-with-open-models-675c3b475a0d6eb4528fec23">Hugging Face collection</a>.</p>
@@ -179,6 +177,7 @@ def get_canonical_form(expression: str) -&gt; str:
     <li><strong>Weighted Best-of-N:</strong> Aggregate scores across all identical responses and select the answer with the <em>highest total reward</em>. This approach prioritises high-quality answers by boosting their scores through repeated occurrences. Mathematically, the weighting across answers \(a_i\) is performed as follows:
     $$ a_\mathrm{weighted} = \arg\max_{a} \sum_{i=1}^{N} \mathbb{I}(a_i = a) \cdot \mathrm{RM}(p, s_i) \,,$$
     where \(\mathrm{RM}(p, s_i)\) is the reward model score of the \(i\)-th solution solution \(s_i\) to problem \(p\).</li>
 </ul>
@@ -238,7 +237,7 @@ def get_canonical_form(expression: str) -&gt; str:
 <ul>
     <li>Majority voting is the worst performer for all compute budgets, except for \(N=256\), where beam search is worst.</li>
-    <li style="list-style-type:disc">Beam search is best for \(N=[4,16,64]\), but Best-of-N is best for \(N=256\).</li>
 </ul>
 <p id="15a1384e-bcac-80d4-af98-eaebf5fcf84e" class="">Although we see that beam search gives consistent gains in the medium and hard problems (levels 3-5), it tends to do worse than Best-of-N (and even majority voting!) on the simpler problems and especially at large compute budgets. </p><p id="15a1384e-bcac-805b-9949-f0cdc44c9e3c" class="">We realized from looking at the resulting trees produced by beam search, that if a single step is assigned high reward, then the whole tree collapses to that trace and thus diversity is impacted. This prompted us to explore an extension to beam search that maximises diversity - let’s take a look!</p>

         <li><strong>Process reward model: </strong>To guide our search strategies, we used <code>RLHFlow/Llama3.1-8B-PRM-Deepseek-Data</code>, an 8B reward model that has been trained using <em>process supervision</em>. Process supervision is a training approach where models receive feedback on each step of their reasoning process, not just the final outcome. We picked this model since it belongs to the same model family as our policy and gave better results than other PRMs like <a href="https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm">Math-Shepherd</a> we tested in this weight class.</li>
         <li><strong>Dataset: </strong>We evaluated on the<a href="https://huggingface.co/datasets/HuggingFaceH4/MATH-500"> MATH-500 subset</a> of the <a href="https://huggingface.co/papers/2103.03874">MATH benchmark</a>, a dataset released by OpenAI as part of their <a href="https://huggingface.co/papers/2305.20050">research</a> on process supervision. These math problems span seven subjects and are challenging for both humans and most LLMs. Take a look at the dataset viewer below to get a taste for the problem difficulty!</li>
+    <iframe src="https://huggingface.co/datasets/HuggingFaceH4/MATH-500/embed/viewer/default/test" frameborder="0" width="100%" height="560px"></iframe>
     <p>We tested each search strategy across compute budgets ranging from 1 to 256 generations per prompt and ran the data-generation pipeline with five random seeds to estimate variance across runs. You can find the models and datasets from our analysis in this <a href="https://huggingface.co/collections/HuggingFaceH4/scaling-test-time-compute-with-open-models-675c3b475a0d6eb4528fec23">Hugging Face collection</a>.</p>
     <li><strong>Weighted Best-of-N:</strong> Aggregate scores across all identical responses and select the answer with the <em>highest total reward</em>. This approach prioritises high-quality answers by boosting their scores through repeated occurrences. Mathematically, the weighting across answers \(a_i\) is performed as follows:
     $$ a_\mathrm{weighted} = \arg\max_{a} \sum_{i=1}^{N} \mathbb{I}(a_i = a) \cdot \mathrm{RM}(p, s_i) \,,$$
     where \(\mathrm{RM}(p, s_i)\) is the reward model score of the \(i\)-th solution solution \(s_i\) to problem \(p\).</li>
 </ul>
 <ul>
     <li>Majority voting is the worst performer for all compute budgets, except for \(N=256\), where beam search is worst.</li>
+    <li>Beam search is best for \(N=[4,16,64]\), but Best-of-N is best for \(N=256\).</li>
 </ul>
 <p id="15a1384e-bcac-80d4-af98-eaebf5fcf84e" class="">Although we see that beam search gives consistent gains in the medium and hard problems (levels 3-5), it tends to do worse than Best-of-N (and even majority voting!) on the simpler problems and especially at large compute budgets. </p><p id="15a1384e-bcac-805b-9949-f0cdc44c9e3c" class="">We realized from looking at the resulting trees produced by beam search, that if a single step is assigned high reward, then the whole tree collapses to that trace and thus diversity is impacted. This prompted us to explore an extension to beam search that maximises diversity - let’s take a look!</p>