Update evaluation/demo_humaneval.md
Browse files- evaluation/demo_humaneval.md +7 -22
evaluation/demo_humaneval.md
CHANGED
@@ -19,27 +19,12 @@ print(pass_at_k)
|
|
19 |
{'pass@1': 0.5, 'pass@2': 1.0}
|
20 |
```
|
21 |
|
22 |
-
To better understand how pass@k metric works, we will illustrate it with
|
23 |
|
24 |
-
**Problem
|
25 |
|
26 |
```python
|
27 |
|
28 |
-
from typing import List
|
29 |
-
|
30 |
-
|
31 |
-
def separate_paren_groups(paren_string: str) -> List[str]:
|
32 |
-
""" Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
|
33 |
-
separate those group into separate strings and return the list of those.
|
34 |
-
Separate groups are balanced (each open brace is properly closed) and not nested within each other
|
35 |
-
Ignore any spaces in the input string.
|
36 |
-
>>> separate_paren_groups('( ) (( )) (( )( ))')
|
37 |
-
['()', '(())', '(()())']
|
38 |
-
"""
|
39 |
-
````
|
40 |
-
**Problem 2:**
|
41 |
-
```python
|
42 |
-
|
43 |
def truncate_number(number: float) -> float:
|
44 |
""" Given a positive floating point number, it can be decomposed into
|
45 |
and integer part (largest integer smaller than given number) and decimals
|
@@ -51,23 +36,23 @@ def truncate_number(number: float) -> float:
|
|
51 |
"""
|
52 |
````
|
53 |
|
54 |
-
|
55 |
|
56 |
**Remark**:
|
57 |
|
58 |
Regarding the temperature parameter, in [CodeGen](https://github.com/salesforce/CodeGen) paper, the authors observed that the best performing temperature increases as the number of samples permitted k increases. When a model is only allowed a few samples to pass unit tests, it is beneficial to use the learned distribution, through a low temperature, to select candidates that are likely to pass. But when a model is allowed for more chances with a high k, using a higher sampling temperature to tilt the learned model distribution lets it explore diverse samples and thus have a greater chance of synthesizing a correct program.
|
59 |
|
60 |
|
61 |
-
For our experiment, we compute pass@1, pass@10 and pass@20, each
|
62 |
|
63 |
```
|
64 |
|
65 |
-
Results: {'pass@1': 0.
|
66 |
|
67 |
````
|
68 |
|
69 |
-
If we take a closer look at the unit test results for each candidate solution
|
70 |
-
for pass@20, it is `1
|
71 |
|
72 |
```python
|
73 |
|
|
|
19 |
{'pass@1': 0.5, 'pass@2': 1.0}
|
20 |
```
|
21 |
|
22 |
+
To better understand how pass@k metric works, we will illustrate it with a concrete example from HumanEval dataset. We select the problem below and see how CodeParrot 🦜 (110M) performs and which code completions pass the unit tests:
|
23 |
|
24 |
+
**Problem:**
|
25 |
|
26 |
```python
|
27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
def truncate_number(number: float) -> float:
|
29 |
""" Given a positive floating point number, it can be decomposed into
|
30 |
and integer part (largest integer smaller than given number) and decimals
|
|
|
36 |
"""
|
37 |
````
|
38 |
|
39 |
+
Instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use nucleus sampling with top-p where `p=0.95`, `temperature=0.2`, and sample tokens from the model until we encounter a stop sequence indicating the end of a method: ‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’. For more details about decoding strategies for language generation, we recommend this [blog](https://huggingface.co/blog/how-to-generate).
|
40 |
|
41 |
**Remark**:
|
42 |
|
43 |
Regarding the temperature parameter, in [CodeGen](https://github.com/salesforce/CodeGen) paper, the authors observed that the best performing temperature increases as the number of samples permitted k increases. When a model is only allowed a few samples to pass unit tests, it is beneficial to use the learned distribution, through a low temperature, to select candidates that are likely to pass. But when a model is allowed for more chances with a high k, using a higher sampling temperature to tilt the learned model distribution lets it explore diverse samples and thus have a greater chance of synthesizing a correct program.
|
44 |
|
45 |
|
46 |
+
For our experiment, we compute pass@1, pass@10 and pass@20, each corresponding to unit test pass rate when selecting respectively 1, 10 and 20 samples from the candidate solutions.
|
47 |
|
48 |
```
|
49 |
|
50 |
+
Results: {'pass@1': 0.1, 'pass@10': 0.7631, 'pass@20': 1.0}
|
51 |
|
52 |
````
|
53 |
|
54 |
+
If we take a closer look at the unit test results for each candidate solution, we find that 2 passed the unit test. This means that we have 2 correct solutions among 20, which corresponds to our pass@1 value `2/20 = 0.1`. The scores pass@10 and pass@20 are higher, because the more samples we select from the candidate completions, the more likely we are to include the correct implementation. As
|
55 |
+
for pass@20, it is `1`, since if we select all 20 candidates the problem gets solved which gives 100% success rate. If you are curious about the candidate solutions that passed the tests, they both implemented this function:
|
56 |
|
57 |
```python
|
58 |
|