blogpost-scaling-test-time-compute

Running

App Files Files Community

lewtun HF staff commited on 5 days ago

Commit

e8fda65

•

1 Parent(s): 1d55115

Fix lists

Browse files

Files changed (1) hide show

app/src/index.html +24 -24

app/src/index.html CHANGED Viewed

@@ -56,9 +56,10 @@
     <p id="1591384e-bcac-80c1-82a7-cc9a41f9fdc4" class="">Over the past months we’ve been diving deep in trying to reverse engineer and reproduce several of these results and are finally happy to share some of our knowledge. More precisely, in this blog post we’ll cover:</p>
-    <ul id="15a1384e-bcac-8049-a51d-ebeb613ae570" class="bulleted-list"><li style="list-style-type:disc"><strong>Compute-optimal scaling: </strong>How we implemented DeepMind’s recipe to boost the mathematical capabilities of open models at test-time.</li></ul>
-    <ul id="15a1384e-bcac-803f-acef-cdf3efce939b" class="bulleted-list"><li style="list-style-type:disc"><strong>Diverse Verifier Tree Search (DVTS):</strong> An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.</li></ul>
-    <ul id="15a1384e-bcac-8069-947b-ef7818ea4d83" class="bulleted-list"><li style="list-style-type:disc">🧭 <strong>Search and Learn: </strong>A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM. You can check it out <a href="https://github.com/huggingface/search-and-learn">here</a>.</li>
     </ul>
     <p id="15a1384e-bcac-804b-9470-f64abc63b648" class="">So how well does compute-optimal scaling work in practice? Check out this plot where the tiny 1B and 3B Llama Instruct models outperform their much larger 8B and 70B siblings on the challenging MATH-500 benchmark if you give them enough “time to think” 🤯:</p>
@@ -71,9 +72,10 @@
     <p id="1591384e-bcac-8021-a784-d3340af0adb4" class="">There are two main strategies for scaling test-time compute:</p>
-    <ul id="1591384e-bcac-8060-8262-d07e6f3cb300" class="bulleted-list"><li style="list-style-type:disc"><strong>Self-Refinement: </strong>Models iteratively refine their own outputs or “thoughts” by identifying and correcting errors in subsequent iterations. While effective on some tasks, this strategy usually requires models to have built-in mechanisms for self-refinement, which can limit its applicability.</li></ul>
-    <ul id="1591384e-bcac-8020-a30a-e82811cb36cc" class="bulleted-list"><li style="list-style-type:disc"><strong>Search Against a Verifier: </strong>This approach focuses on generating multiple candidate answers and using verifier to select the best one. A verifier can be anything from a hard-coded heuristic to a learned reward model, but for the purposes of this blog post we will focus on learned verifiers. It includes techniques such as Best-of-N sampling<d-footnote>This is also known as rejection sampling.</d-footnote> and tree search. Search strategies are more flexible and can adapt to the difficulty of the problem, although their performance is constrained by the quality of the verifier.</li></ul>
     <p id="1591384e-bcac-801d-82e3-d2dc50cf2b24" class="">In this blog post, we’ll concentrate on search-based methods as they represent a practical and scalable solution for test-time compute optimization. In particular, we’ll examine the three strategies illustrated below:</p>
@@ -81,9 +83,11 @@
         <figcaption>Illustration of various search strategies used in test-time compute scaling. Figure adapted from the <a href="https://huggingface.co/papers/2408.03314">DeepMind paper.</a></figcaption>
     </figure>
-    <ul id="15b1384e-bcac-807b-a75f-e0b421bd6ee3" class="bulleted-list"><li style="list-style-type:disc"><strong>Best-of-N: </strong>Generate multiple responses per problem and assign scores to each candidate answer, typically using a reward model. Then select the answer with the highest reward (or a weighted variant discussed later). This approach emphasizes answer quality over frequency.</li></ul>
-    <ul id="15b1384e-bcac-800c-af0a-d4e4df5a003a" class="bulleted-list"><li style="list-style-type:disc"><strong>Beam search: </strong>A systematic search method that explores the solution space, often combined with a <em><a href="https://huggingface.co/papers/2211.14275">process reward model (PRM)</a></em> to optimise both the sampling and evaluation of intermediate steps in problem-solving. Unlike conventional reward models that produce a single score on the final answer, PRMs provide a <em>sequence </em>of scores, one for each step of the reasoning process. This ability to provide fine-grained feedback makes PRMs a natural fit for search methods with LLMs.</li></ul>
-    <ul id="15b1384e-bcac-80c9-91e2-e013fee74ec6" class="bulleted-list"><li style="list-style-type:disc"><strong>Diverse verifier tree search (DVTS):  </strong>An extension of beam search we developed that splits the initial beams into independent subtrees, which are then expanded greedily using a PRM.<d-footnote>DVTS is similar to <a href="https://huggingface.co/papers/1610.02424">diverse beam search (DBS)</a> with the main difference that beams share a common prefix in DBS and no sampling is used. DVTS is also similar to <a href="https://huggingface.co/papers/2306.09896">code repair trees</a>, although it is not restricted to code generation models and discrete verifiers.</d-footnote>  This method improves solution diversity and overall performance, particularly with larger test-time compute budgets.</li></ul>
     <p id="15a1384e-bcac-803c-bc89-ed15f18eafdc" class="">With an understanding of the key search strategies, let’s move on to how we evaluated them in practice.</p>
@@ -98,21 +102,17 @@
     <ol type="1" id="15c1384e-bcac-803a-9772-ce27e91553c0" class="numbered-list" start="3"><li>Once the search strategy terminates, the final candidate solutions are ranked by the PRM to produce the final answer.</li></ol>
     <p id="15c1384e-bcac-80af-a31e-fd3330874674" class="">To compare various search strategies, we used the following open models and datasets:</p>
-    <ul id="15a1384e-bcac-80d3-afcf-e5741e06845d" class="bulleted-list"><li style="list-style-type:disc"><strong>Model:</strong> We used <code>meta-llama/Llama-3.2-1B-Instruct</code> as our primary model for scaling test-time compute. With 1B parameters, its lightweight nature enables fast iterations, and its unsaturated performance on math benchmarks makes it an ideal choice for highlighting the benefits of scaling.</li></ul>
-    <ul id="15a1384e-bcac-807f-81fb-cd43b2273acf" class="bulleted-list"><li style="list-style-type:disc"><strong>Process reward model: </strong>To guide our search strategies, we used <code>RLHFlow/Llama3.1-8B-PRM-Deepseek-Data</code>, an 8B reward model that has been trained using <em>process supervision</em>. Process supervision is a training approach where models receive feedback on each step of their reasoning process, not just the final outcome. We picked this model since it belongs to the same model family as our policy and gave better results than other PRMs like <a href="https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm">Math-Shepherd</a> we tested in this weight class.</li></ul>
-    <ul id="15a1384e-bcac-80da-8c95-d6ca809696a8" class="bulleted-list"><li style="list-style-type:disc"><strong>Dataset: </strong>We evaluated on the<a href="https://huggingface.co/datasets/HuggingFaceH4/MATH-500"> MATH-500 subset</a> of the <a href="https://huggingface.co/papers/2103.03874">MATH benchmark</a>, a dataset released by OpenAI as part of their <a href="https://huggingface.co/papers/2305.20050">research</a> on process supervision. These math problems span seven subjects and are challenging for both humans and most LLMs. Take a look at the dataset viewer below to get a taste for the problem difficulty!
-    <p><aside>foo</aside></p>
-    <iframe
-  src="https://huggingface.co/datasets/HuggingFaceH4/MATH-500/embed/viewer/default/test"
-  frameborder="0"
-  width="100%"
-  height="560px"
-    ></iframe>
-    <p>We tested each search strategy across compute budgets ranging from 1 to 256 generations per prompt and ran the data-generation pipeline with five random seeds to estimate variance across runs. You can find the models and datasets from our analysis in this <a href="https://huggingface.co/collections/HuggingFaceH4/scaling-test-time-compute-with-open-models-675c3b475a0d6eb4528fec23">collection</a>.</p>
     <p>To warmup, we’ll begin with a simple baseline and progressively incorporate additional techniques to improve performance.</p>

     <p id="1591384e-bcac-80c1-82a7-cc9a41f9fdc4" class="">Over the past months we’ve been diving deep in trying to reverse engineer and reproduce several of these results and are finally happy to share some of our knowledge. More precisely, in this blog post we’ll cover:</p>
+    <ul>
+        <li><strong>Compute-optimal scaling: </strong>How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.</li>
+        <li><strong>Diverse Verifier Tree Search (DVTS):</strong> An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.</li>
+        <li>🧭 <strong>Search and Learn: </strong>A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM. You can check it out <a href="https://github.com/huggingface/search-and-learn">here</a>.</li>
     </ul>
     <p id="15a1384e-bcac-804b-9470-f64abc63b648" class="">So how well does compute-optimal scaling work in practice? Check out this plot where the tiny 1B and 3B Llama Instruct models outperform their much larger 8B and 70B siblings on the challenging MATH-500 benchmark if you give them enough “time to think” 🤯:</p>
     <p id="1591384e-bcac-8021-a784-d3340af0adb4" class="">There are two main strategies for scaling test-time compute:</p>
+    <ul>
+        <li><strong>Self-Refinement: </strong>Models iteratively refine their own outputs or “thoughts” by identifying and correcting errors in subsequent iterations. While effective on some tasks, this strategy usually requires models to have built-in mechanisms for self-refinement, which can limit its applicability.</li>
+        <li><strong>Search Against a Verifier: </strong>This approach focuses on generating multiple candidate answers and using verifier to select the best one. A verifier can be anything from a hard-coded heuristic to a learned reward model, but for the purposes of this blog post we will focus on learned verifiers. It includes techniques such as Best-of-N sampling<d-footnote>This is also known as rejection sampling.</d-footnote> and tree search. Search strategies are more flexible and can adapt to the difficulty of the problem, although their performance is constrained by the quality of the verifier.</li>
+    </ul>
     <p id="1591384e-bcac-801d-82e3-d2dc50cf2b24" class="">In this blog post, we’ll concentrate on search-based methods as they represent a practical and scalable solution for test-time compute optimization. In particular, we’ll examine the three strategies illustrated below:</p>
         <figcaption>Illustration of various search strategies used in test-time compute scaling. Figure adapted from the <a href="https://huggingface.co/papers/2408.03314">DeepMind paper.</a></figcaption>
     </figure>
+    <ul>
+        <li><strong>Best-of-N: </strong>Generate multiple responses per problem and assign scores to each candidate answer, typically using a reward model. Then select the answer with the highest reward (or a weighted variant discussed later). This approach emphasizes answer quality over frequency.</li>
+        <li><strong>Beam search: </strong>A systematic search method that explores the solution space, often combined with a <em><a href="https://huggingface.co/papers/2211.14275">process reward model (PRM)</a></em> to optimise both the sampling and evaluation of intermediate steps in problem-solving. Unlike conventional reward models that produce a single score on the final answer, PRMs provide a <em>sequence </em>of scores, one for each step of the reasoning process. This ability to provide fine-grained feedback makes PRMs a natural fit for search methods with LLMs.</li>
+        <li><strong>Diverse verifier tree search (DVTS):</strong> An extension of beam search we developed that splits the initial beams into independent subtrees, which are then expanded greedily using a PRM.<d-footnote>DVTS is similar to <a href="https://huggingface.co/papers/1610.02424">diverse beam search (DBS)</a> with the main difference that beams share a common prefix in DBS and no sampling is used. DVTS is also similar to <a href="https://huggingface.co/papers/2306.09896">code repair trees</a>, although it is not restricted to code generation models and discrete verifiers.</d-footnote>  This method improves solution diversity and overall performance, particularly with larger test-time compute budgets.</li>
+    </ul>
     <p id="15a1384e-bcac-803c-bc89-ed15f18eafdc" class="">With an understanding of the key search strategies, let’s move on to how we evaluated them in practice.</p>
     <ol type="1" id="15c1384e-bcac-803a-9772-ce27e91553c0" class="numbered-list" start="3"><li>Once the search strategy terminates, the final candidate solutions are ranked by the PRM to produce the final answer.</li></ol>
     <p id="15c1384e-bcac-80af-a31e-fd3330874674" class="">To compare various search strategies, we used the following open models and datasets:</p>
+    <ul>
+        <li><strong>Model:</strong> We used <code>meta-llama/Llama-3.2-1B-Instruct</code> as our primary model for scaling test-time compute. With 1B parameters, its lightweight nature enables fast iterations, and its unsaturated performance on math benchmarks makes it an ideal choice for highlighting the benefits of scaling.</li>
+        <li><strong>Process reward model: </strong>To guide our search strategies, we used <code>RLHFlow/Llama3.1-8B-PRM-Deepseek-Data</code>, an 8B reward model that has been trained using <em>process supervision</em>. Process supervision is a training approach where models receive feedback on each step of their reasoning process, not just the final outcome. We picked this model since it belongs to the same model family as our policy and gave better results than other PRMs like <a href="https://huggingface.co/peiyi9979/math-shepherd-mistral-7b-prm">Math-Shepherd</a> we tested in this weight class.</li>
+        <li><strong>Dataset: </strong>We evaluated on the<a href="https://huggingface.co/datasets/HuggingFaceH4/MATH-500"> MATH-500 subset</a> of the <a href="https://huggingface.co/papers/2103.03874">MATH benchmark</a>, a dataset released by OpenAI as part of their <a href="https://huggingface.co/papers/2305.20050">research</a> on process supervision. These math problems span seven subjects and are challenging for both humans and most LLMs. Take a look at the dataset viewer below to get a taste for the problem difficulty!</li>
+    <iframe src="https://huggingface.co/datasets/HuggingFaceH4/MATH-500/embed/viewer/default/test" frameborder="0"
+        width="100%" height="560px"></iframe>
+    <p>We tested each search strategy across compute budgets ranging from 1 to 256 generations per prompt and ran the data-generation pipeline with five random seeds to estimate variance across runs. You can find the models and datasets from our analysis in this <a href="https://huggingface.co/collections/HuggingFaceH4/scaling-test-time-compute-with-open-models-675c3b475a0d6eb4528fec23">Hugging Face collection</a>.</p>
     <p>To warmup, we’ll begin with a simple baseline and progressively incorporate additional techniques to improve performance.</p>