lewtun HF staff commited on
Commit
28c2bc9
1 Parent(s): 68d5169

Add pass@k

Browse files
Files changed (1) hide show
  1. app/src/index.html +33 -0
app/src/index.html CHANGED
@@ -265,6 +265,39 @@
265
  <li><strong>Model: </strong>use the distribution of average PRM scores per problem to determine the quintiles. The intuition here is that harder problems will have lower scores.</li>
266
  </ul>
267
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
  <p id="15d1384e-bcac-80a3-af7c-f3497126ab1e" class="">Here’s the breakdown of the various methods according to the pass@1 scores and across four test-time compute budgets of \(N = [4,16,64, 256]\):</p><figure id="15b1384e-bcac-80ad-9cf3-cf5bcbd3f53b" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"/></a></figure><p id="15d1384e-bcac-80c3-93b3-fa4c071ac807" class="">In this plot, each bar denotes a test-time compute budget, and within each bar we show the relative accuracy of each method. For example, in the group of four bars on difficulty level 2 we see that:</p>
269
 
270
  <ul>
 
265
  <li><strong>Model: </strong>use the distribution of average PRM scores per problem to determine the quintiles. The intuition here is that harder problems will have lower scores.</li>
266
  </ul>
267
 
268
+ <details><summary style="font-weight:600;font-size:1.25em;line-height:1.3;margin:0">Implementation detail</summary><div class="indented">
269
+ <p>The pass@k metric measures the probability, computed over a set of problems, that at least one of the top \(k\) generated outputs for each problem contains the correct solution. In practice, computing pass@k naively leads to high variance; for example, if we compute pass@1 from a single completion per problem, we can get significantly different values from repeated evaluations due to sampling. To combat this, OpenAI's <a href="https://huggingface.co/papers/2107.03374">Codex paper</a> introduced an unbiased estimator that accounts for the total number of generated samples \(n\), the number of correct samples \(c\), and the desired \(k\) value. The estimator is formulated as:
270
+
271
+ $$\text{pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}} \right]$$
272
+
273
+ This formula calculates the expected value over all problems and determines the likelihood that at least one of the top \(k\) samples is correct. The term \(\binom{n - c}{k}/\binom{n}{k}\) represents the probability of selecting \(k\) incorrect samples from the total, and subtracting from 1 gives the probability of having at least one correct sample among the top \(k\).<d-footnote>See the <a href="https://samuelalbanie.com/files/digest-slides/2022-07-codex.pdf?utm_source=chatgpt.com">wonderful notes</a> from Samuel Albanie for many more details on pass@k.</d-footnote></p>
274
+
275
+ <p>However, computing the estimator directly suffers from numerical instabilities, so in practice one uses the following <a href="https://github.com/huggingface/search-and-learn/blob/27f273f7db648d6d3739f0a65a0f7ab1ce45888f/src/sal/utils/math.py#L230">simplified form:</a>
276
+ <d-code block language="python">
277
+ def pass_at_k(n: int, c: int, k: int) -> float:
278
+ """A numerically stable method for calculating an unbiased estimate of pass@k.
279
+
280
+ Taken from OpenAI's Codex paper: https://arxiv.org/abs/2107.03374
281
+
282
+ Args:
283
+ n (`int`): total number of samples
284
+ c (`int`): number of correct samples
285
+ k (`int`): k in pass@$k$
286
+
287
+ Returns:
288
+ `float`: an unbiased estimate of pass@k
289
+ """
290
+ if n - c < k:
291
+ return 1.0
292
+ return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
293
+ </d-code>
294
+ </p>
295
+
296
+
297
+ </div>
298
+ </details>
299
+ <br><br>
300
+
301
  <p id="15d1384e-bcac-80a3-af7c-f3497126ab1e" class="">Here’s the breakdown of the various methods according to the pass@1 scores and across four test-time compute budgets of \(N = [4,16,64, 256]\):</p><figure id="15b1384e-bcac-80ad-9cf3-cf5bcbd3f53b" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"/></a></figure><p id="15d1384e-bcac-80c3-93b3-fa4c071ac807" class="">In this plot, each bar denotes a test-time compute budget, and within each bar we show the relative accuracy of each method. For example, in the group of four bars on difficulty level 2 we see that:</p>
302
 
303
  <ul>