lewtun HF staff commited on
Commit
62e9987
·
1 Parent(s): d7c48d3

Add footnotes

Browse files
Files changed (1) hide show
  1. app/src/index.html +26 -22
app/src/index.html CHANGED
@@ -115,41 +115,45 @@
115
 
116
  <iframe src="https://huggingface.co/datasets/HuggingFaceH4/MATH-500/embed/viewer/default/test" frameborder="0" width="100%" height="560px"></iframe>
117
 
 
 
118
  <p>We tested each search strategy across compute budgets ranging from 1 to 256 generations per prompt and ran the data-generation pipeline with five random seeds to estimate variance across runs. You can find the models and datasets from our analysis in this <a href="https://huggingface.co/collections/HuggingFaceH4/scaling-test-time-compute-with-open-models-675c3b475a0d6eb4528fec23">Hugging Face collection</a>.</p>
119
 
120
  <p>To warmup, we’ll begin with a simple baseline and progressively incorporate additional techniques to improve performance.</p>
121
 
122
  <!-- SECTION 2 -->
123
  <h2 id="1591384e-bcac-801a-9201-cd4f3b8dfe96" class="">Majority voting: a simple baseline</h2>
124
-
125
- <p>Majority voting—or <a href="https://huggingface.co/papers/2203.11171">self-consistency decoding</a> if you want to be fancy—is the most straightforward method<d-footnote>It’s also the most common sampling method used in the literature and is usually referred to as “maj@X” in tables and results.</d-footnote> to aggregate an LLM’s outputs. As the name suggests, for a given math problem we generate \(N\) candidate solutions and pick the most frequent answer. For all our experiments we sampled up to \(N=256\) candidates with temperature \(T=0.8\) and generated up to 2048 tokens per problem.<d-footnote>We found that sampling with \(T=1.0\) would cause the model to generate Chinese characters midway through a solution and hurt performance.</d-footnote></p>
126
 
127
  <p id="15c1384e-bcac-8086-a0e7-e0ca93b5ea94" class="">One quirk with the MATH benchmark is that answers must be formatted in a LaTeX box like <code>\boxed{answer}</code> . We initially tried the following simple system prompt for Llama 3.2 1B</p>
128
-
129
  <script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre id="15c1384e-bcac-8042-9b96-ff3d615bb9f0" class="code"><code class="language-Python">Please think step by step and put your final answer within \boxed{}.</code></pre>
130
-
131
  <p id="15c1384e-bcac-80d0-bca6-ffed38482a37" class="">but found the resulting accuracy with greedy decoding (\(T=0\)) to be far worse than the 30.6% that Meta reported in their <a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/">release</a>. Luckily, Meta also <a href="https://huggingface.co/datasets/meta-llama/Llama-3.2-1B-Instruct-evals/viewer/Llama-3.2-1B-Instruct-evals__math__details">published</a> the prompts they used for their evals and switching our system prompt to theirs made all the difference:</p>
132
 
 
 
133
  <script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre id="15c1384e-bcac-8011-aee6-d7c433df8b5f" class="code"><code class="language-Python">Solve the following math problem efficiently and clearly:
134
 
135
  - For simple problems (2 steps or fewer):
136
  Provide a concise solution with minimal explanation.
137
-
138
  - For complex problems (3 steps or more):
139
  Use this step-by-step format:
140
-
141
  ## Step 1: [Concise description]
142
  [Brief explanation and calculations]
143
-
144
  ## Step 2: [Concise description]
145
  [Brief explanation and calculations]
146
-
147
  ...
148
-
149
  Regardless of the approach, always conclude with:
150
-
151
  Therefore, the final answer is: $\boxed{answer}$. I hope it is correct.
152
-
153
  Where [answer] is just the final number or expression that solves the problem.</code></pre>
154
 
155
  <p id="15c1384e-bcac-8076-9664-caa1b098c89c" class="">One subtlety with evaluating answers to math problems is that strings like \(1/\sqrt{3}\) and \(\sqrt{3}/3\) are distinct, but represent mathematically equivalent answers. The standard <a href="https://huggingface.co/papers/2206.14858">way</a> to handle this is to convert the a pair of answers to SymPy objects and then check whether subtracting the two objects and applying <code>sympy.simplify</code> gives zero. </p><p id="15c1384e-bcac-8043-a1d3-ffa0c5c3406e" class="">While this approach works well when comparing a small number of candidate answers, we found it was terribly slow when comparing many pairs in a list of \(N\) candidates; in some cases, slower than generating the candidates in the first place! To deal with this, we first reduced each answer to its <a href="https://en.wikipedia.org/wiki/Canonical_form">canonical form</a> and then computed the frequency of each form to determine the majority vote. Expand the detail below if you’re curious about how the code looks.</p>
@@ -182,7 +186,7 @@
182
  return canonical_to_original[canonical_form]</code></pre>
183
 
184
  <p id="15d1384e-bcac-804e-a99c-fe5e83313a3d" class="">This approach was significantly faster than checking each pair of solutions independently for equality.</p></div></details>
185
-
186
  <br><br>
187
 
188
  <p id="15b1384e-bcac-80f7-83e8-e1d6b360faa4" class="">Here’s how majority voting performs when applied to the generations from Llama 3.2 1B Instruct:</p><figure id="15b1384e-bcac-8072-9987-d80031b97793" class="image"><a href="Scaling%20test-time%20compute%20with%20open%20models%201531384ebcac800b9d73fca3503eb783/methods-maj.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-maj.png"/></a></figure><p id="15b1384e-bcac-8020-8688-fe1713e92c2b" class="">The results show that majority voting yields a significant improvement over the greedy decoding baseline, but its gains start to plateau after approximately \(N=64\) generations. This limitation arises because majority voting struggles with problems that require nuanced reasoning or tasks where errors are consistent across generations. If you’re also wondering why the majority voting accuracy is worse than the 0-shot CoT baseline for \(N=1\) and \(2\), that’s because we sample at \(T=0.8\), which makes it less likely we produce the correct answer among a handful of candidates.</p><p id="15b1384e-bcac-8075-8fef-f26f0b8e5559" class="">Building on the limitations of majority voting, let’s see how incorporating a reward model can enhance performance.</p>
@@ -214,7 +218,7 @@
214
 
215
  <!-- SECTION 4 -->
216
  <h2 id="1591384e-bcac-8065-a02c-cd760ebd6cd1" class="">Beam search with process reward models</h2>
217
-
218
  <p id="15a1384e-bcac-80e1-9e0e-c01f5f373805" class="">Beam search is a structured search method that systematically explores the solution space, making it a powerful tool for improving model outputs at test-time. When combined with a PRM, beam search can optimize both the generation and evaluation of intermediate steps in problem-solving. The way it works is as follows:</p>
219
 
220
  <ol>
@@ -228,19 +232,19 @@
228
  <p id="15a1384e-bcac-8003-a9d9-da7f3a4dc321" class="">By allowing the PRM to evaluate the correctness of intermediate steps, beam search can identify and prioritize promising paths early in the process. This step-by-step evaluation is particularly beneficial for complex reasoning tasks like mathematics, where verifying partial solutions can significantly improve final outcomes.</p>
229
 
230
  <details><summary style="font-weight:600;font-size:1.25em;line-height:1.3;margin:0">Implementation detail</summary><div class="indented">
231
- <p id="15b1384e-bcac-8065-a739-d24b699106be" class="">When we implemented beam search with process supervision, we encountered two major footguns with the Llama 3 chat template that are worth mentioning:</p>
232
 
233
  <ul>
234
  <li>By default, the chat template trims trailing new lines from every assistant turn. As a result, if one uses <code>\n</code> or <code>\n\n</code> to terminate a step, these tokens are lost on subsequent steps and force the model to produce peculiar outputs.</li>
235
- <li>The chat template is prefixed with Llama’s BOS token. When the formatted string is fed to vLLM a <em>second</em> BOS token is added which completely ruins performance, even though the generations look mostly coherent 🤯</li>
236
  </ul>
237
 
238
- <p>The solution is to overwrite the Llama 3 chat template to prevent trimming and exclude the BOS token prefix.</p>
239
  </div>
240
  </details>
241
  <br><br>
242
 
243
- <p id="15d1384e-bcac-80e9-8e65-e1b58080b94c" class="">In our experiments, we followed DeepMind’s hyperparameter choices and ran beam search with the following:</p>
244
 
245
  <ul>
246
  <li>\(N\) beams in compute scalings of 4, 16, 64, 256</li>
@@ -249,10 +253,10 @@
249
  <li style="list-style-type:disc">Up to 40 iterations, i.e. a tree of maximum depth with 40 steps.</li>
250
  </ul>
251
 
252
- <p id="15d1384e-bcac-8051-abe5-dc84c42a1b5f" class="">As shown below, the results are striking: with a test-time budget of \(N=4\), beam search achieves the same accuracy as Best-of-N for \(N=16\), i.e. it is 4x more compute efficient! Moreover, beam search matches the performance of Llama 3.1 8B with just \(N=32\) solutions per problem. The average performance on MATH by computer science PhD students is around 40%, so reaching nearly 55% isn’t too bad for a 1B model 💪!</p><figure id="15b1384e-bcac-80e9-97fa-fe50d1811f5b" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-maj-bon-beam.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-maj-bon-beam.png"/></a></figure>
253
-
254
  <h3 id="15a1384e-bcac-800c-baee-fb99b242ef87" class="">Which problems does beam search solve best?</h3>
255
-
256
  <p id="15d1384e-bcac-80e3-938a-c3f09db2e9ff" class="">Although in aggregate it is clear that beam search is a better search strategy than Best-of-N or majority voting, the DeepMind paper showed that <em><strong>each strategy has tradeoffs that depend on the problem difficulty</strong></em> and test-time compute budget. </p><p id="15d1384e-bcac-8015-a8f0-c2323b9e535f" class="">To see which problems are best suited for which strategy, DeepMind computed a distribution over estimated problem difficulty, and then binned the results into quintiles. In other words, each problem is assigned one of 5 levels, where level 1 indicates easier problems and level 5 indicates the hardest ones. To estimate problem difficulty, DeepMind generated 2048 candidate solutions with standard sampling per problem and then proposed the following heuristics:</p>
257
 
258
  <ul>
@@ -289,14 +293,14 @@
289
  $$\theta_{q,a^*(q)}^*(N) = \underset{\theta}{\arg\max} \left( \mathbb{E}_{y \sim \text{Target}(\theta, N, q)} \left[ \mathbb{1}_{y = y^*(q)} \right] \right),$$
290
 
291
  where \(y^*(q)\) is the ground-truth for question \(q\) and \(\theta_{q,a^*(q)}^*(N)\) denotes the compute-optimal scaling strategy. Since computing \(\theta_{q,a^*(q)}^*(N)\) directly is somewhat tricky, DeepMind proposed an approximation based on the <em><strong>problem difficulty</strong></em>, i.e. allocate test-time compute according to which search strategy achieves best performance for a given difficulty level.</p>
292
-
293
  <p id="15a1384e-bcac-80c9-a276-d5ea8974c543" class="">For example, on simpler problems and lower compute budgets, it is better to use strategies like Best-of-N, while on harder problems, beam search is the better choice. To implement this, for each method we compute the accuracy for a given difficulty level and test-time compute budget. And voila, we now have our compute-optimal curve!</p>
294
 
295
  <figure id="15b1384e-bcac-80b3-bc58-d20ba41d3950" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt.png"/></a></figure>
296
 
297
  <!-- SECTION 7 -->
298
  <h2 id="1591384e-bcac-809a-96d2-e928398d159a" class="">Scaling up to larger models</h2>
299
-
300
  <p id="15a1384e-bcac-8078-86d7-f48c2146444e" class="">We also explored scaling up the compute-optimal recipe to Llama 3.2 3B Instruct to see at what point the benefits of the PRM fade in comparison to the policy’s own capacity. To our surprise, compute-optimal scaling works remarkably well, with the 3B model surpassing the performance of Llama 3.1 70B Instruct (22x it's size!):</p><figure id="15b1384e-bcac-80b3-bc58-d20ba41d3950" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt-3b.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt-3b.png"/></a></figure>
301
 
302
  <h2 id="15a1384e-bcac-809c-b5e7-eb92dadaebb4" class="">Where to go from here?</h2><p id="15b1384e-bcac-8052-91d7-d6e1f6f66e09" class="">This exploration of test-time compute scaling has revealed both the potential and the challenges of leveraging search-based methods. As we look ahead, several exciting directions emerge:</p>
 
115
 
116
  <iframe src="https://huggingface.co/datasets/HuggingFaceH4/MATH-500/embed/viewer/default/test" frameborder="0" width="100%" height="560px"></iframe>
117
 
118
+ <aside>There are signs this benchmark is getting saturated by recent models like o1. This has prompted the creator of MATH (Dan Hendrycks) to start collecting a diverse set of extremely difficult problems as part of <a href="https://agi.safe.ai/submit">Humanity’s Last Exam.</a></aside>
119
+
120
  <p>We tested each search strategy across compute budgets ranging from 1 to 256 generations per prompt and ran the data-generation pipeline with five random seeds to estimate variance across runs. You can find the models and datasets from our analysis in this <a href="https://huggingface.co/collections/HuggingFaceH4/scaling-test-time-compute-with-open-models-675c3b475a0d6eb4528fec23">Hugging Face collection</a>.</p>
121
 
122
  <p>To warmup, we’ll begin with a simple baseline and progressively incorporate additional techniques to improve performance.</p>
123
 
124
  <!-- SECTION 2 -->
125
  <h2 id="1591384e-bcac-801a-9201-cd4f3b8dfe96" class="">Majority voting: a simple baseline</h2>
126
+
127
+ <p>Majority voting—or <a href="https://huggingface.co/papers/2203.11171">self-consistency decoding</a> if you want to be fancy—is the most straightforward method to aggregate an LLM’s outputs.<d-footnote>It’s also the most common sampling method used in the literature and is usually referred to as “maj@X” in tables and results.</d-footnote> As the name suggests, for a given math problem we generate \(N\) candidate solutions and pick the most frequent answer. For all our experiments we sampled up to \(N=256\) candidates with temperature \(T=0.8\) and generated up to 2048 tokens per problem.<d-footnote>We found that sampling with \(T=1.0\) would cause the model to generate Chinese characters midway through a solution and hurt performance.</d-footnote></p>
128
 
129
  <p id="15c1384e-bcac-8086-a0e7-e0ca93b5ea94" class="">One quirk with the MATH benchmark is that answers must be formatted in a LaTeX box like <code>\boxed{answer}</code> . We initially tried the following simple system prompt for Llama 3.2 1B</p>
130
+
131
  <script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre id="15c1384e-bcac-8042-9b96-ff3d615bb9f0" class="code"><code class="language-Python">Please think step by step and put your final answer within \boxed{}.</code></pre>
132
+
133
  <p id="15c1384e-bcac-80d0-bca6-ffed38482a37" class="">but found the resulting accuracy with greedy decoding (\(T=0\)) to be far worse than the 30.6% that Meta reported in their <a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/">release</a>. Luckily, Meta also <a href="https://huggingface.co/datasets/meta-llama/Llama-3.2-1B-Instruct-evals/viewer/Llama-3.2-1B-Instruct-evals__math__details">published</a> the prompts they used for their evals and switching our system prompt to theirs made all the difference:</p>
134
 
135
+ <aside>We wish more AI labs followed this practice - it helps enormously with reproducibility. Props to Meta!</aside>
136
+
137
  <script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre id="15c1384e-bcac-8011-aee6-d7c433df8b5f" class="code"><code class="language-Python">Solve the following math problem efficiently and clearly:
138
 
139
  - For simple problems (2 steps or fewer):
140
  Provide a concise solution with minimal explanation.
141
+
142
  - For complex problems (3 steps or more):
143
  Use this step-by-step format:
144
+
145
  ## Step 1: [Concise description]
146
  [Brief explanation and calculations]
147
+
148
  ## Step 2: [Concise description]
149
  [Brief explanation and calculations]
150
+
151
  ...
152
+
153
  Regardless of the approach, always conclude with:
154
+
155
  Therefore, the final answer is: $\boxed{answer}$. I hope it is correct.
156
+
157
  Where [answer] is just the final number or expression that solves the problem.</code></pre>
158
 
159
  <p id="15c1384e-bcac-8076-9664-caa1b098c89c" class="">One subtlety with evaluating answers to math problems is that strings like \(1/\sqrt{3}\) and \(\sqrt{3}/3\) are distinct, but represent mathematically equivalent answers. The standard <a href="https://huggingface.co/papers/2206.14858">way</a> to handle this is to convert the a pair of answers to SymPy objects and then check whether subtracting the two objects and applying <code>sympy.simplify</code> gives zero. </p><p id="15c1384e-bcac-8043-a1d3-ffa0c5c3406e" class="">While this approach works well when comparing a small number of candidate answers, we found it was terribly slow when comparing many pairs in a list of \(N\) candidates; in some cases, slower than generating the candidates in the first place! To deal with this, we first reduced each answer to its <a href="https://en.wikipedia.org/wiki/Canonical_form">canonical form</a> and then computed the frequency of each form to determine the majority vote. Expand the detail below if you’re curious about how the code looks.</p>
 
186
  return canonical_to_original[canonical_form]</code></pre>
187
 
188
  <p id="15d1384e-bcac-804e-a99c-fe5e83313a3d" class="">This approach was significantly faster than checking each pair of solutions independently for equality.</p></div></details>
189
+
190
  <br><br>
191
 
192
  <p id="15b1384e-bcac-80f7-83e8-e1d6b360faa4" class="">Here’s how majority voting performs when applied to the generations from Llama 3.2 1B Instruct:</p><figure id="15b1384e-bcac-8072-9987-d80031b97793" class="image"><a href="Scaling%20test-time%20compute%20with%20open%20models%201531384ebcac800b9d73fca3503eb783/methods-maj.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-maj.png"/></a></figure><p id="15b1384e-bcac-8020-8688-fe1713e92c2b" class="">The results show that majority voting yields a significant improvement over the greedy decoding baseline, but its gains start to plateau after approximately \(N=64\) generations. This limitation arises because majority voting struggles with problems that require nuanced reasoning or tasks where errors are consistent across generations. If you’re also wondering why the majority voting accuracy is worse than the 0-shot CoT baseline for \(N=1\) and \(2\), that’s because we sample at \(T=0.8\), which makes it less likely we produce the correct answer among a handful of candidates.</p><p id="15b1384e-bcac-8075-8fef-f26f0b8e5559" class="">Building on the limitations of majority voting, let’s see how incorporating a reward model can enhance performance.</p>
 
218
 
219
  <!-- SECTION 4 -->
220
  <h2 id="1591384e-bcac-8065-a02c-cd760ebd6cd1" class="">Beam search with process reward models</h2>
221
+
222
  <p id="15a1384e-bcac-80e1-9e0e-c01f5f373805" class="">Beam search is a structured search method that systematically explores the solution space, making it a powerful tool for improving model outputs at test-time. When combined with a PRM, beam search can optimize both the generation and evaluation of intermediate steps in problem-solving. The way it works is as follows:</p>
223
 
224
  <ol>
 
232
  <p id="15a1384e-bcac-8003-a9d9-da7f3a4dc321" class="">By allowing the PRM to evaluate the correctness of intermediate steps, beam search can identify and prioritize promising paths early in the process. This step-by-step evaluation is particularly beneficial for complex reasoning tasks like mathematics, where verifying partial solutions can significantly improve final outcomes.</p>
233
 
234
  <details><summary style="font-weight:600;font-size:1.25em;line-height:1.3;margin:0">Implementation detail</summary><div class="indented">
235
+ <p id="15b1384e-bcac-8065-a739-d24b699106be" class="">When we implemented beam search with process supervision, we encountered two major footguns with the Llama 3 chat template that are worth mentioning:<d-footnote>These footguns cost us a few days of our lives, so we’re sharing them for the benefit of humanity.</d-footnote></p>
236
 
237
  <ul>
238
  <li>By default, the chat template trims trailing new lines from every assistant turn. As a result, if one uses <code>\n</code> or <code>\n\n</code> to terminate a step, these tokens are lost on subsequent steps and force the model to produce peculiar outputs.</li>
239
+ <li>The chat template is prefixed with Llama’s BOS token. When the formatted string is fed to vLLM a <em>second</em> BOS token is added which completely ruins performance, even though the generations look mostly coherent 🤯. See this <a href="https://github.com/vllm-project/vllm/issues/9519">vLLM issue</a> for more details.</li>
240
  </ul>
241
 
242
+ <p>The solution is to <a href="https://github.com/huggingface/search-and-learn/blob/27f273f7db648d6d3739f0a65a0f7ab1ce45888f/src/sal/config.py#L50">overwrite the Llama 3 chat template</a> to prevent trimming and exclude the BOS token prefix.</p>
243
  </div>
244
  </details>
245
  <br><br>
246
 
247
+ <p id="15d1384e-bcac-80e9-8e65-e1b58080b94c" class="">In our experiments, we followed <a href="https://huggingface.co/papers/2408.03314">DeepMind’s hyperparameter choices</a> and ran beam search with the following:</p>
248
 
249
  <ul>
250
  <li>\(N\) beams in compute scalings of 4, 16, 64, 256</li>
 
253
  <li style="list-style-type:disc">Up to 40 iterations, i.e. a tree of maximum depth with 40 steps.</li>
254
  </ul>
255
 
256
+ <p id="15d1384e-bcac-8051-abe5-dc84c42a1b5f" class="">As shown below, the results are striking: with a test-time budget of \(N=4\), beam search achieves the same accuracy as Best-of-N for \(N=16\), i.e. it is 4x more compute efficient! Similarly, with \(N=16\), beam search achieves the same accuracy as Best-of-N for \(N=256\), making it 16x more compute efficient at larger \(N\). Moreover, beam search matches the performance of Llama 3.1 8B with just \(N=32\) solutions per problem. The average performance on MATH by computer science PhD students is around 40%, so reaching nearly 55% isn’t too bad for a 1B model 💪!</p><figure id="15b1384e-bcac-80e9-97fa-fe50d1811f5b" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-maj-bon-beam.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-maj-bon-beam.png"/></a></figure>
257
+
258
  <h3 id="15a1384e-bcac-800c-baee-fb99b242ef87" class="">Which problems does beam search solve best?</h3>
259
+
260
  <p id="15d1384e-bcac-80e3-938a-c3f09db2e9ff" class="">Although in aggregate it is clear that beam search is a better search strategy than Best-of-N or majority voting, the DeepMind paper showed that <em><strong>each strategy has tradeoffs that depend on the problem difficulty</strong></em> and test-time compute budget. </p><p id="15d1384e-bcac-8015-a8f0-c2323b9e535f" class="">To see which problems are best suited for which strategy, DeepMind computed a distribution over estimated problem difficulty, and then binned the results into quintiles. In other words, each problem is assigned one of 5 levels, where level 1 indicates easier problems and level 5 indicates the hardest ones. To estimate problem difficulty, DeepMind generated 2048 candidate solutions with standard sampling per problem and then proposed the following heuristics:</p>
261
 
262
  <ul>
 
293
  $$\theta_{q,a^*(q)}^*(N) = \underset{\theta}{\arg\max} \left( \mathbb{E}_{y \sim \text{Target}(\theta, N, q)} \left[ \mathbb{1}_{y = y^*(q)} \right] \right),$$
294
 
295
  where \(y^*(q)\) is the ground-truth for question \(q\) and \(\theta_{q,a^*(q)}^*(N)\) denotes the compute-optimal scaling strategy. Since computing \(\theta_{q,a^*(q)}^*(N)\) directly is somewhat tricky, DeepMind proposed an approximation based on the <em><strong>problem difficulty</strong></em>, i.e. allocate test-time compute according to which search strategy achieves best performance for a given difficulty level.</p>
296
+
297
  <p id="15a1384e-bcac-80c9-a276-d5ea8974c543" class="">For example, on simpler problems and lower compute budgets, it is better to use strategies like Best-of-N, while on harder problems, beam search is the better choice. To implement this, for each method we compute the accuracy for a given difficulty level and test-time compute budget. And voila, we now have our compute-optimal curve!</p>
298
 
299
  <figure id="15b1384e-bcac-80b3-bc58-d20ba41d3950" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt.png"/></a></figure>
300
 
301
  <!-- SECTION 7 -->
302
  <h2 id="1591384e-bcac-809a-96d2-e928398d159a" class="">Scaling up to larger models</h2>
303
+
304
  <p id="15a1384e-bcac-8078-86d7-f48c2146444e" class="">We also explored scaling up the compute-optimal recipe to Llama 3.2 3B Instruct to see at what point the benefits of the PRM fade in comparison to the policy’s own capacity. To our surprise, compute-optimal scaling works remarkably well, with the 3B model surpassing the performance of Llama 3.1 70B Instruct (22x it's size!):</p><figure id="15b1384e-bcac-80b3-bc58-d20ba41d3950" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt-3b.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/methods-opt-3b.png"/></a></figure>
305
 
306
  <h2 id="15a1384e-bcac-809c-b5e7-eb92dadaebb4" class="">Where to go from here?</h2><p id="15b1384e-bcac-8052-91d7-d6e1f6f66e09" class="">This exploration of test-time compute scaling has revealed both the potential and the challenges of leveraging search-based methods. As we look ahead, several exciting directions emerge:</p>