blogpost-scaling-test-time-compute

Running

App Files Files Community

lewtun HF staff commited on 3 days ago

Commit

33c5e72

•

1 Parent(s): b0b6916

Add footnote

Browse files

Files changed (1) hide show

app/src/index.html +5 -1

app/src/index.html CHANGED Viewed

@@ -116,7 +116,11 @@
     <p>To warmup, we’ll begin with a simple baseline and progressively incorporate additional techniques to improve performance.</p>
-    <h2 id="1591384e-bcac-801a-9201-cd4f3b8dfe96" class="">Majority voting: a simple baseline</h2><p id="1591384e-bcac-8011-8710-deb8e5601cdb" class="">Majority voting—or <a href="https://huggingface.co/papers/2203.11171">self-consistency decoding</a> if you want to be fancy—is the most straightforward method to aggregate an LLM’s outputs. As the name suggests, for a given math problem we generate \(N\) candidate solutions and pick the most frequent answer. For all our experiments we sampled up to \(N=256\) candidates with temperature \(T=0.8\) and generated up to 2048 tokens per problem.</p><p id="15c1384e-bcac-8086-a0e7-e0ca93b5ea94" class="">One quirk with the MATH benchmark is that answers must be formatted in a LaTeX box like <code>\boxed{answer}</code> . We initially tried the following simple system prompt for Llama 3.2 1B</p><script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre id="15c1384e-bcac-8042-9b96-ff3d615bb9f0" class="code"><code class="language-Python">Please think step by step and put your final answer within \boxed{}.</code></pre><p id="15c1384e-bcac-80d0-bca6-ffed38482a37" class="">but found the resulting accuracy with greedy decoding (\(T=0\)) to be far worse than the 30.6% that Meta reported in their <a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/">release</a>. Luckily, Meta also <a href="https://huggingface.co/datasets/meta-llama/Llama-3.2-1B-Instruct-evals/viewer/Llama-3.2-1B-Instruct-evals__math__details">published</a> the prompts they used for their evals and switching our system prompt to theirs made all the difference:</p><script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre id="15c1384e-bcac-8011-aee6-d7c433df8b5f" class="code"><code class="language-Python">Solve the following math problem efficiently and clearly:
 - For simple problems (2 steps or fewer):
 Provide a concise solution with minimal explanation.

     <p>To warmup, we’ll begin with a simple baseline and progressively incorporate additional techniques to improve performance.</p>
+    <h2 id="1591384e-bcac-801a-9201-cd4f3b8dfe96" class="">Majority voting: a simple baseline</h2><p id="1591384e-bcac-8011-8710-deb8e5601cdb" class="">Majority voting—or <a href="https://huggingface.co/papers/2203.11171">self-consistency decoding</a> if you want to be fancy—is the most straightforward method<d-footnote>It’s also the most common sampling method used in the literature and is usually referred to as “maj@X” in tables and results.</d-footnote> to aggregate an LLM’s outputs. As the name suggests, for a given math problem we generate \(N\) candidate solutions and pick the most frequent answer. For all our experiments we sampled up to \(N=256\) candidates with temperature \(T=0.8\) and generated up to 2048 tokens per problem.<d-footnote>We found that sampling with \(T=1.0\) would cause the model to generate Chinese characters midway through a solution and hurt performance.</d-footnote></p>
+    <p id="15c1384e-bcac-8086-a0e7-e0ca93b5ea94" class="">One quirk with the MATH benchmark is that answers must be formatted in a LaTeX box like <code>\boxed{answer}</code> . We initially tried the following simple system prompt for Llama 3.2 1B</p><script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre id="15c1384e-bcac-8042-9b96-ff3d615bb9f0" class="code"><code class="language-Python">Please think step by step and put your final answer within \boxed{}.</code></pre><p id="15c1384e-bcac-80d0-bca6-ffed38482a37" class="">but found the resulting accuracy with greedy decoding (\(T=0\)) to be far worse than the 30.6% that Meta reported in their <a href="https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/">release</a>. Luckily, Meta also <a href="https://huggingface.co/datasets/meta-llama/Llama-3.2-1B-Instruct-evals/viewer/Llama-3.2-1B-Instruct-evals__math__details">published</a> the prompts they used for their evals and switching our system prompt to theirs made all the difference:</p>
+    <script src="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/prism.min.js" integrity="sha512-7Z9J3l1+EYfeaPKcGXu3MS/7T+w19WtKQY/n+xzmw4hZhJ9tyYmcUS+4QqAlzhicE5LAfMQSF3iFTK9bQdTxXg==" crossorigin="anonymous" referrerPolicy="no-referrer"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/prism/1.29.0/themes/prism.min.css" integrity="sha512-tN7Ec6zAFaVSG3TpNAKtk4DOHNpSwKHxxrsiw4GHKESGPs5njn/0sMCUMl2svV4wo4BK/rCP7juYz+zx+l6oeQ==" crossorigin="anonymous" referrerPolicy="no-referrer"/><pre id="15c1384e-bcac-8011-aee6-d7c433df8b5f" class="code"><code class="language-Python">Solve the following math problem efficiently and clearly:
 - For simple problems (2 steps or fewer):
 Provide a concise solution with minimal explanation.