blogpost-scaling-test-time-compute

Running

App Files Files Community

lewtun HF staff commited on 5 days ago

Commit

0fd3ccd

•

1 Parent(s): bac2080

Fix numbered lists

Browse files

Files changed (1) hide show

app/src/index.html +28 -6

app/src/index.html CHANGED Viewed

@@ -97,9 +97,11 @@
     <p id="15a1384e-bcac-802a-b133-e279ff948bd0" class="">As illustrated in the diagram above, our experimental setup involves a pipeline with the following steps:</p>
-    <ol type="1" id="15c1384e-bcac-806e-b372-df95fdad14c8" class="numbered-list" start="1"><li>We begin by feeding a math problem to an LLM, which generates \(N\) <em><strong>partial solutions</strong></em>, e.g. an intermediate step in a derivation.</li></ol>
-    <ol type="1" id="15c1384e-bcac-8045-9a74-d1c2ca472aa3" class="numbered-list" start="2"><li>Each step is scored by a PRM, which estimates the probability of each step to eventually reach the correct final answer. The steps and PRM scores are then used by a given search strategy to select which partial solutions should be further explored to generate the next round of intermediate steps.</li></ol>
-    <ol type="1" id="15c1384e-bcac-803a-9772-ce27e91553c0" class="numbered-list" start="3"><li>Once the search strategy terminates, the final candidate solutions are ranked by the PRM to produce the final answer.</li></ol>
     <p id="15c1384e-bcac-80af-a31e-fd3330874674" class="">To compare various search strategies, we used the following open models and datasets:</p>
@@ -194,7 +196,7 @@ def get_canonical_form(expression: str) -&gt; str:
 <h2 id="1591384e-bcac-8065-a02c-cd760ebd6cd1" class="">Beam search with process reward models</h2><p id="15a1384e-bcac-80e1-9e0e-c01f5f373805" class="">Beam search is a structured search method that systematically explores the solution space, making it a powerful tool for improving model outputs at test-time. When combined with a PRM, beam search can optimize both the generation and evaluation of intermediate steps in problem-solving. The way it works is as follows:</p>
-<ol type="1">
     <li>Generate multiple candidate solutions <em>iteratively</em> by maintaining a fixed number of &quot;beams&quot; or active paths \(N\).</li>
     <li>In the first iteration, sample \(N\) independent steps from the LLM with temperature \(T\) to introduce diversity in the responses. These steps are usually defined by a stopping criterion like terminating on a new line <code>\n</code> or double new line <code>\n\n</code>.</li>
     <li>Score each step with the PRM and select the top \(N/M\) steps as candidates for the next round of generation. Here \(M\) denotes the “beam width” of a given active path. As in Best-of-N, we used the “last” reduction to score the partial solutions at each iteration.</li>
@@ -241,7 +243,16 @@ def get_canonical_form(expression: str) -&gt; str:
 <p id="15a1384e-bcac-80d4-af98-eaebf5fcf84e" class="">Although we see that beam search gives consistent gains in the medium and hard problems (levels 3-5), it tends to do worse than Best-of-N (and even majority voting!) on the simpler problems and especially at large compute budgets. </p><p id="15a1384e-bcac-805b-9949-f0cdc44c9e3c" class="">We realized from looking at the resulting trees produced by beam search, that if a single step is assigned high reward, then the whole tree collapses to that trace and thus diversity is impacted. This prompted us to explore an extension to beam search that maximises diversity - let’s take a look!</p>
-<h2 id="1591384e-bcac-80d2-8234-fe0e9a4df59d" class="">DVTS: boosting performance with diversity</h2><p id="1591384e-bcac-8044-b7c5-cf39e4aed683" class="">As we saw above beam search gives strong performance over Best-of-N, but tends to underperform on simpler problems and at large test-time compute budgets. To address this, we developed an extension we call Diverse Verifier Tree Search (DVTS) that is designed to maximise diversity at large \(N\).</p><p id="15a1384e-bcac-80ff-a97b-c7ccd88958e4" class="">DVTS works in a similar fashion as beam search, with the following modifications:</p><ol type="1" id="15d1384e-bcac-806c-8004-e054a98d98ef" class="numbered-list" start="1"><li>For a given \(N\) and \(M\), expand the initial set of beams into \(N/M\) <em>independent</em> subtrees.</li></ol><ol type="1" id="15d1384e-bcac-8081-8508-feb06a13469b" class="numbered-list" start="2"><li>For each subtree, select the step with the highest PRM score.</li></ol><ol type="1" id="15d1384e-bcac-806a-976f-ec9596cd9532" class="numbered-list" start="3"><li>Generate \(M\) new steps from the nodes selected in step (2) and select the step with the highest PRM score.</li></ol><ol type="1" id="15d1384e-bcac-808e-aa2b-f391ec426953" class="numbered-list" start="4"><li>Repeat step (3) until the EOS token or maximum tree depth is reached.</li></ol><p id="15d1384e-bcac-8087-b916-d9603de035dd" class="">Here’s the results from applying DVTS to Llama 1B:</p><figure id="15b1384e-bcac-801c-a1e7-d4e544826da3" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-all.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-all.png"/></a></figure><p id="15b1384e-bcac-80e1-bc9b-dbdb5738b9f1" class="">As we can see, DVTS provides a complementary strategy to beam search: at small \(N\) beam search is more effective at finding correct solutions, but at large \(N\) the diversity of DVTS candidates kicks in and we get better performance. </p><p id="15d1384e-bcac-80a7-8379-dca3c329c433" class="">We can also see this manifested in the problem difficulty breakdown, where DVTS enhances performance on the easy / medium problems at large \(N\), while beam search is best at small \(N\) across model problem difficulties:</p><figure id="15b1384e-bcac-807a-8dca-f322077cc616" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-all.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-all.png"/></a></figure>
 <h2 id="1591384e-bcac-806b-9dd0-c80a250c7754" class="">The best of all worlds: compute-optimal scaling</h2><p id="1591384e-bcac-80e0-93e6-ceaacc131142" class="">Armed with various search strategies, a natural question is which one is best? In the DeepMind paper, they proposed a <em><strong>compute-optimal</strong></em> <em><strong>scaling strategy</strong></em> where one selects the search method and hyperparameters \(\theta\) that achieves the <em><strong>best performance for a given compute budget </strong></em><em>\(N\)</em><em><strong>:</strong></em></p>
     $$\theta_{q,a^*(q)}^*(N) = \underset{\theta}{\arg\max} \left( \mathbb{E}_{y \sim \text{Target}(\theta, N, q)} \left[ \mathbb{1}_{y = y^*(q)} \right] \right),$$
@@ -249,7 +260,18 @@ where \(y^*(q)\) is the ground-truth for question \(q\) and \(\theta_{q,a^*(q)}^
 <h2 id="1591384e-bcac-809a-96d2-e928398d159a" class="">Scaling up to larger models</h2><p id="15a1384e-bcac-8078-86d7-f48c2146444e" class="">We also explored scaling up the compute-optimal recipe to Llama 3.2 3B Instruct to see at what point the benefits of the PRM fade in comparison to the policy’s own capacity. To our surprise, compute-optimal scaling works remarkably well, with the 3B model surpassing the performance of Llama 3.1 70B Instruct (22x it's size!):</p><figure id="15b1384e-bcac-80b3-bc58-d20ba41d3950" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt-3b.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt-3b.png"/></a></figure>
-<h2 id="15a1384e-bcac-809c-b5e7-eb92dadaebb4" class="">Where to go from here?</h2><p id="15b1384e-bcac-8052-91d7-d6e1f6f66e09" class="">This exploration of test-time compute scaling has revealed both the potential and the challenges of leveraging search-based methods. As we look ahead, several exciting directions emerge:</p><ol type="1" id="15b1384e-bcac-8040-91d7-e4236e1530ef" class="numbered-list" start="1"><li><strong>The Power of Strong Verifiers:</strong><p id="15b1384e-bcac-8032-85d2-e0e0283f7e4a" class="">Strong verifiers play a critical role in enhancing performance. However, their current limitations are apparent, as highlighted in benchmarks like <em>ProcessBench</em>. Improving the robustness and generalization of verifiers will be crucial for advancing these methods.</p></li></ol><ol type="1" id="15b1384e-bcac-80da-a940-e4ad39ff493d" class="numbered-list" start="2"><li><strong>The Challenge of Self-Verification:</strong><p id="15b1384e-bcac-8077-a5bb-c75fb34b60eb" class="">The ultimate goal—or &quot;holy grail&quot;—is achieving self-verification, where models can validate their own outputs autonomously. This approach appears to be what models like o1 are doing, but remains difficult to implement in practice. Unlike standard supervised fine-tuning (SFT), self-verification demands more nuanced strategies. The recent DeepMind paper on self-verification and <em>Score</em> sheds light on this challenge and offers a pathway for future research.</p></li></ol><ol type="1" id="15b1384e-bcac-8038-88f9-dd21d5cf272c" class="numbered-list" start="3"><li><strong>Integrating “Thoughts” into the Process:</strong><p id="15b1384e-bcac-801d-923f-d1ea46844321" class="">Incorporating explicit intermediate steps or “thoughts” during generation could further enhance reasoning and decision-making. By integrating structured reasoning into the search process, we may unlock better performance on complex tasks.</p></li></ol><ol type="1" id="15b1384e-bcac-802b-aa65-d84fb31ed476" class="numbered-list" start="4"><li><strong>Search as a Data Generation Tool:</strong><p id="15b1384e-bcac-802a-8488-ef74d5fb0a55" class="">This method can also serve as a powerful data generation process, creating high-quality training datasets. For example, fine-tuning models like Llama 1B on correct traces produced by search could yield significant gains. This on-policy approach resembles techniques like ReST or V-StaR but with the added benefits of search, offering a promising direction for iterative improvement.</p></li></ol><ol type="1" id="15b1384e-bcac-804a-8a0e-dcaf06767339" class="numbered-list" start="5"><li><strong>A Call for More PRMs:</strong><p id="15b1384e-bcac-803c-9524-f563cd123121" class="">Open process reward models (PRMs) are relatively rare, limiting their broader application. Developing and sharing more PRMs for different domains is a critical area where the community can contribute significantly.</p></li></ol><ol type="1" id="15b1384e-bcac-805f-b97c-d3383a2c9682" class="numbered-list" start="6"><li><strong>Expanding Beyond Verifiable Domains:</strong><p id="15b1384e-bcac-8073-9fea-c717f2d50df5" class="">While current methods excel in domains like math and code, where solutions are inherently verifiable, extending these techniques to other areas remains a major challenge. How can we adapt these strategies for less structured or subjective tasks? This is a vital question for future exploration.</p></li></ol><h2 id="15b1384e-bcac-8093-82c6-d9c951dc0bab" class="">Acknowledgements</h2><p id="15c1384e-bcac-80d5-8ad4-d7f6874404c5" class="">We are grateful to Charlie Snell and Aviral Kumar for many discussions about test-time compute scaling and for sharing implementation details from their work. We thank Chun Te Lee for designing the lovely banner and Thomas Wolf, Leandro von Werra, Colin Raffel, and Quentin Gallouédec for many helpful suggestions to improve the blog post. We also thank Hugo Larcher and Mathieu Morlon for continually optimising the Hugging Face Science Cluster to make the GPUs go brrr 🔥!</p>
 </d-article>

     <p id="15a1384e-bcac-802a-b133-e279ff948bd0" class="">As illustrated in the diagram above, our experimental setup involves a pipeline with the following steps:</p>
+    <ol>
+        <li>We begin by feeding a math problem to an LLM, which generates \(N\) <em><strong>partial solutions</strong></em>, e.g. an intermediate step in a derivation.</li>
+        <li>Each step is scored by a PRM, which estimates the probability of each step to eventually reach the correct final answer. The steps and PRM scores are then used by a given search strategy to select which partial solutions should be further explored to generate the next round of intermediate steps.</li>
+        <li>Once the search strategy terminates, the final candidate solutions are ranked by the PRM to produce the final answer.</li>
+    </ol>
     <p id="15c1384e-bcac-80af-a31e-fd3330874674" class="">To compare various search strategies, we used the following open models and datasets:</p>
 <h2 id="1591384e-bcac-8065-a02c-cd760ebd6cd1" class="">Beam search with process reward models</h2><p id="15a1384e-bcac-80e1-9e0e-c01f5f373805" class="">Beam search is a structured search method that systematically explores the solution space, making it a powerful tool for improving model outputs at test-time. When combined with a PRM, beam search can optimize both the generation and evaluation of intermediate steps in problem-solving. The way it works is as follows:</p>
+<ol>
     <li>Generate multiple candidate solutions <em>iteratively</em> by maintaining a fixed number of &quot;beams&quot; or active paths \(N\).</li>
     <li>In the first iteration, sample \(N\) independent steps from the LLM with temperature \(T\) to introduce diversity in the responses. These steps are usually defined by a stopping criterion like terminating on a new line <code>\n</code> or double new line <code>\n\n</code>.</li>
     <li>Score each step with the PRM and select the top \(N/M\) steps as candidates for the next round of generation. Here \(M\) denotes the “beam width” of a given active path. As in Best-of-N, we used the “last” reduction to score the partial solutions at each iteration.</li>
 <p id="15a1384e-bcac-80d4-af98-eaebf5fcf84e" class="">Although we see that beam search gives consistent gains in the medium and hard problems (levels 3-5), it tends to do worse than Best-of-N (and even majority voting!) on the simpler problems and especially at large compute budgets. </p><p id="15a1384e-bcac-805b-9949-f0cdc44c9e3c" class="">We realized from looking at the resulting trees produced by beam search, that if a single step is assigned high reward, then the whole tree collapses to that trace and thus diversity is impacted. This prompted us to explore an extension to beam search that maximises diversity - let’s take a look!</p>
+<h2 id="1591384e-bcac-80d2-8234-fe0e9a4df59d" class="">DVTS: boosting performance with diversity</h2><p id="1591384e-bcac-8044-b7c5-cf39e4aed683" class="">As we saw above beam search gives strong performance over Best-of-N, but tends to underperform on simpler problems and at large test-time compute budgets. To address this, we developed an extension we call Diverse Verifier Tree Search (DVTS) that is designed to maximise diversity at large \(N\).</p><p id="15a1384e-bcac-80ff-a97b-c7ccd88958e4" class="">DVTS works in a similar fashion as beam search, with the following modifications:</p>
+<ol>
+    <li>For a given \(N\) and \(M\), expand the initial set of beams into \(N/M\) <em>independent</em> subtrees.</li>
+    <li>For each subtree, select the step with the highest PRM score.</li>
+    <li>Generate \(M\) new steps from the nodes selected in step (2) and select the step with the highest PRM score.</li>
+    <li>Repeat step (3) until the EOS token or maximum tree depth is reached.</li>
+</ol>
+<p id="15d1384e-bcac-8087-b916-d9603de035dd" class="">Here’s the results from applying DVTS to Llama 1B:</p><figure id="15b1384e-bcac-801c-a1e7-d4e544826da3" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-all.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-all.png"/></a></figure><p id="15b1384e-bcac-80e1-bc9b-dbdb5738b9f1" class="">As we can see, DVTS provides a complementary strategy to beam search: at small \(N\) beam search is more effective at finding correct solutions, but at large \(N\) the diversity of DVTS candidates kicks in and we get better performance. </p><p id="15d1384e-bcac-80a7-8379-dca3c329c433" class="">We can also see this manifested in the problem difficulty breakdown, where DVTS enhances performance on the easy / medium problems at large \(N\), while beam search is best at small \(N\) across model problem difficulties:</p><figure id="15b1384e-bcac-807a-8dca-f322077cc616" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-all.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-all.png"/></a></figure>
 <h2 id="1591384e-bcac-806b-9dd0-c80a250c7754" class="">The best of all worlds: compute-optimal scaling</h2><p id="1591384e-bcac-80e0-93e6-ceaacc131142" class="">Armed with various search strategies, a natural question is which one is best? In the DeepMind paper, they proposed a <em><strong>compute-optimal</strong></em> <em><strong>scaling strategy</strong></em> where one selects the search method and hyperparameters \(\theta\) that achieves the <em><strong>best performance for a given compute budget </strong></em><em>\(N\)</em><em><strong>:</strong></em></p>
     $$\theta_{q,a^*(q)}^*(N) = \underset{\theta}{\arg\max} \left( \mathbb{E}_{y \sim \text{Target}(\theta, N, q)} \left[ \mathbb{1}_{y = y^*(q)} \right] \right),$$
 <h2 id="1591384e-bcac-809a-96d2-e928398d159a" class="">Scaling up to larger models</h2><p id="15a1384e-bcac-8078-86d7-f48c2146444e" class="">We also explored scaling up the compute-optimal recipe to Llama 3.2 3B Instruct to see at what point the benefits of the PRM fade in comparison to the policy’s own capacity. To our surprise, compute-optimal scaling works remarkably well, with the 3B model surpassing the performance of Llama 3.1 70B Instruct (22x it's size!):</p><figure id="15b1384e-bcac-80b3-bc58-d20ba41d3950" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt-3b.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt-3b.png"/></a></figure>
+<h2 id="15a1384e-bcac-809c-b5e7-eb92dadaebb4" class="">Where to go from here?</h2><p id="15b1384e-bcac-8052-91d7-d6e1f6f66e09" class="">This exploration of test-time compute scaling has revealed both the potential and the challenges of leveraging search-based methods. As we look ahead, several exciting directions emerge:</p>
+<ol>
+    <li><strong>The Power of Strong Verifiers:</strong> Strong verifiers play a critical role in enhancing performance. However, their current limitations are apparent, as highlighted in benchmarks like <em>ProcessBench</em>. Improving the robustness and generalization of verifiers will be crucial for advancing these methods.</li>
+    <li><strong>The Challenge of Self-Verification:</strong> The ultimate goal—or &quot;holy grail&quot;—is achieving self-verification, where models can validate their own outputs autonomously. This approach appears to be what models like o1 are doing, but remains difficult to implement in practice. Unlike standard supervised fine-tuning (SFT), self-verification demands more nuanced strategies. The recent DeepMind paper on self-verification and <em>Score</em> sheds light on this challenge and offers a pathway for future research.</li>
+    <li><strong>Integrating “Thoughts” into the Process:</strong> Incorporating explicit intermediate steps or “thoughts” during generation could further enhance reasoning and decision-making. By integrating structured reasoning into the search process, we may unlock better performance on complex tasks.</li>
+    <li><strong>Search as a Data Generation Tool:</strong> This method can also serve as a powerful data generation process, creating high-quality training datasets. For example, fine-tuning models like Llama 1B on correct traces produced by search could yield significant gains. This on-policy approach resembles techniques like ReST or V-StaR but with the added benefits of search, offering a promising direction for iterative improvement.</li>
+    <li><strong>A Call for More PRMs:</strong> Open process reward models (PRMs) are relatively rare, limiting their broader application. Developing and sharing more PRMs for different domains is a critical area where the community can contribute significantly.</li>
+    <li><strong>Expanding Beyond Verifiable Domains:</strong> While current methods excel in domains like math and code, where solutions are inherently verifiable, extending these techniques to other areas remains a major challenge. How can we adapt these strategies for less structured or subjective tasks? This is a vital question for future exploration.</li>
+</ol>
+<h2 id="15b1384e-bcac-8093-82c6-d9c951dc0bab" class="">Acknowledgements</h2><p id="15c1384e-bcac-80d5-8ad4-d7f6874404c5" class="">We are grateful to Charlie Snell and Aviral Kumar for many discussions about test-time compute scaling and for sharing implementation details from their work. We thank Chun Te Lee for designing the lovely banner and Thomas Wolf, Leandro von Werra, Colin Raffel, and Quentin Gallouédec for many helpful suggestions to improve the blog post. We also thank Hugo Larcher and Mathieu Morlon for continually optimising the Hugging Face Science Cluster to make the GPUs go brrr 🔥!</p>
 </d-article>