blogpost-scaling-test-time-compute

Running

App Files Files Community

lewtun HF staff commited on Jan 1

Commit

28c2bc9

1 Parent(s): 68d5169

Add pass@k

Browse files

Files changed (1) hide show

app/src/index.html +33 -0

app/src/index.html CHANGED Viewed

@@ -265,6 +265,39 @@
         <li><strong>Model: </strong>use the distribution of average PRM scores per problem to determine the quintiles. The intuition here is that harder problems will have lower scores.</li>
     </ul>
     <p id="15d1384e-bcac-80a3-af7c-f3497126ab1e" class="">Here’s the breakdown of the various methods according to the pass@1 scores and across four test-time compute budgets of \(N = [4,16,64, 256]\):</p><figure id="15b1384e-bcac-80ad-9cf3-cf5bcbd3f53b" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"/></a></figure><p id="15d1384e-bcac-80c3-93b3-fa4c071ac807" class="">In this plot, each bar denotes a test-time compute budget, and within each bar we show the relative accuracy of each method. For example, in the group of four bars on difficulty level 2 we see that:</p>
     <ul>

         <li><strong>Model: </strong>use the distribution of average PRM scores per problem to determine the quintiles. The intuition here is that harder problems will have lower scores.</li>
     </ul>
+    <details><summary style="font-weight:600;font-size:1.25em;line-height:1.3;margin:0">Implementation detail</summary><div class="indented">
+        <p>The pass@k metric measures the probability, computed over a set of problems, that at least one of the top \(k\) generated outputs for each problem contains the correct solution. In practice, computing pass@k naively leads to high variance; for example, if we compute pass@1 from a single completion per problem, we can get significantly different values from repeated evaluations due to sampling. To combat this, OpenAI's <a href="https://huggingface.co/papers/2107.03374">Codex paper</a> introduced an unbiased estimator that accounts for the total number of generated samples \(n\), the number of correct samples \(c\), and the desired \(k\) value. The estimator is formulated as:
+            $$\text{pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}} \right]$$
+        This formula calculates the expected value over all problems and determines the likelihood that at least one of the top \(k\) samples is correct. The term \(\binom{n - c}{k}/\binom{n}{k}\) represents the probability of selecting \(k\) incorrect samples from the total, and subtracting from 1 gives the probability of having at least one correct sample among the top \(k\).<d-footnote>See the <a href="https://samuelalbanie.com/files/digest-slides/2022-07-codex.pdf?utm_source=chatgpt.com">wonderful notes</a> from Samuel Albanie for many more details on pass@k.</d-footnote></p>
+        <p>However, computing the estimator directly suffers from numerical instabilities, so in practice one uses the following <a href="https://github.com/huggingface/search-and-learn/blob/27f273f7db648d6d3739f0a65a0f7ab1ce45888f/src/sal/utils/math.py#L230">simplified form:</a>
+            <d-code block language="python">
+                def pass_at_k(n: int, c: int, k: int) -> float:
+    """A numerically stable method for calculating an unbiased estimate of pass@k.
+    Taken from OpenAI's Codex paper: https://arxiv.org/abs/2107.03374
+    Args:
+        n (`int`): total number of samples
+        c (`int`): number of correct samples
+        k (`int`): k in pass@$k$
+    Returns:
+        `float`: an unbiased estimate of pass@k
+    """
+    if n - c < k:
+        return 1.0
+    return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
+            </d-code>
+        </p>
+    </div>
+    </details>
+    <br><br>
     <p id="15d1384e-bcac-80a3-af7c-f3497126ab1e" class="">Here’s the breakdown of the various methods according to the pass@1 scores and across four test-time compute budgets of \(N = [4,16,64, 256]\):</p><figure id="15b1384e-bcac-80ad-9cf3-cf5bcbd3f53b" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"/></a></figure><p id="15d1384e-bcac-80c3-93b3-fa4c071ac807" class="">In this plot, each bar denotes a test-time compute budget, and within each bar we show the relative accuracy of each method. For example, in the group of four bars on difficulty level 2 we see that:</p>
     <ul>