Add pass@k
Browse files- app/src/index.html +33 -0
app/src/index.html
CHANGED
@@ -265,6 +265,39 @@
|
|
265 |
<li><strong>Model: </strong>use the distribution of average PRM scores per problem to determine the quintiles. The intuition here is that harder problems will have lower scores.</li>
|
266 |
</ul>
|
267 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
268 |
<p id="15d1384e-bcac-80a3-af7c-f3497126ab1e" class="">Here’s the breakdown of the various methods according to the pass@1 scores and across four test-time compute budgets of \(N = [4,16,64, 256]\):</p><figure id="15b1384e-bcac-80ad-9cf3-cf5bcbd3f53b" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"/></a></figure><p id="15d1384e-bcac-80c3-93b3-fa4c071ac807" class="">In this plot, each bar denotes a test-time compute budget, and within each bar we show the relative accuracy of each method. For example, in the group of four bars on difficulty level 2 we see that:</p>
|
269 |
|
270 |
<ul>
|
|
|
265 |
<li><strong>Model: </strong>use the distribution of average PRM scores per problem to determine the quintiles. The intuition here is that harder problems will have lower scores.</li>
|
266 |
</ul>
|
267 |
|
268 |
+
<details><summary style="font-weight:600;font-size:1.25em;line-height:1.3;margin:0">Implementation detail</summary><div class="indented">
|
269 |
+
<p>The pass@k metric measures the probability, computed over a set of problems, that at least one of the top \(k\) generated outputs for each problem contains the correct solution. In practice, computing pass@k naively leads to high variance; for example, if we compute pass@1 from a single completion per problem, we can get significantly different values from repeated evaluations due to sampling. To combat this, OpenAI's <a href="https://huggingface.co/papers/2107.03374">Codex paper</a> introduced an unbiased estimator that accounts for the total number of generated samples \(n\), the number of correct samples \(c\), and the desired \(k\) value. The estimator is formulated as:
|
270 |
+
|
271 |
+
$$\text{pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}} \right]$$
|
272 |
+
|
273 |
+
This formula calculates the expected value over all problems and determines the likelihood that at least one of the top \(k\) samples is correct. The term \(\binom{n - c}{k}/\binom{n}{k}\) represents the probability of selecting \(k\) incorrect samples from the total, and subtracting from 1 gives the probability of having at least one correct sample among the top \(k\).<d-footnote>See the <a href="https://samuelalbanie.com/files/digest-slides/2022-07-codex.pdf?utm_source=chatgpt.com">wonderful notes</a> from Samuel Albanie for many more details on pass@k.</d-footnote></p>
|
274 |
+
|
275 |
+
<p>However, computing the estimator directly suffers from numerical instabilities, so in practice one uses the following <a href="https://github.com/huggingface/search-and-learn/blob/27f273f7db648d6d3739f0a65a0f7ab1ce45888f/src/sal/utils/math.py#L230">simplified form:</a>
|
276 |
+
<d-code block language="python">
|
277 |
+
def pass_at_k(n: int, c: int, k: int) -> float:
|
278 |
+
"""A numerically stable method for calculating an unbiased estimate of pass@k.
|
279 |
+
|
280 |
+
Taken from OpenAI's Codex paper: https://arxiv.org/abs/2107.03374
|
281 |
+
|
282 |
+
Args:
|
283 |
+
n (`int`): total number of samples
|
284 |
+
c (`int`): number of correct samples
|
285 |
+
k (`int`): k in pass@$k$
|
286 |
+
|
287 |
+
Returns:
|
288 |
+
`float`: an unbiased estimate of pass@k
|
289 |
+
"""
|
290 |
+
if n - c < k:
|
291 |
+
return 1.0
|
292 |
+
return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
|
293 |
+
</d-code>
|
294 |
+
</p>
|
295 |
+
|
296 |
+
|
297 |
+
</div>
|
298 |
+
</details>
|
299 |
+
<br><br>
|
300 |
+
|
301 |
<p id="15d1384e-bcac-80a3-af7c-f3497126ab1e" class="">Here’s the breakdown of the various methods according to the pass@1 scores and across four test-time compute budgets of \(N = [4,16,64, 256]\):</p><figure id="15b1384e-bcac-80ad-9cf3-cf5bcbd3f53b" class="image"><a href="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"><img style="width:707.9891357421875px" src="https://huggingface.co/datasets/HuggingFaceH4/blogpost-images/resolve/main/levels-maj-bon-beam.png"/></a></figure><p id="15d1384e-bcac-80c3-93b3-fa4c071ac807" class="">In this plot, each bar denotes a test-time compute budget, and within each bar we show the relative accuracy of each method. For example, in the group of four bars on difficulty level 2 we see that:</p>
|
302 |
|
303 |
<ul>
|