Spaces:

codeparrot
/

code-generation-models

Running

File size: 5,526 Bytes

c2fead6
d41708b
2616726
742946c
2616726
 
53c9467
a342a9a
28f951a
a342a9a
2616726
 
 
 
 
 
8fd7e3c
 
b1ca4b4
78b2b7f
 
339089c
 
78b2b7f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c8ed2a
ceed3a4
a342a9a
1114a41
a342a9a
 
 
 
 
 
 
 
 
 
 
 
 
ceed3a4
1114a41
a342a9a
 
 
 
 
 
 
 
 
 
 
 
90d62ef
 
 
 
 
 
 
 
a342a9a
 
 
53c9467
a342a9a
 
 
c2fead6
8bcc93a
53c9467

A popular evaluation framework for code generation models is the [pass@k](https://huggingface.co/metrics/code_eval) metric on [HumanEval](https://huggingface.co/datasets/openai_humaneval) dataset, which was introduced in [Codex paper](https://arxiv.org/pdf/2107.03374v2.pdf). The dataset includes 164 handwritten programming problems. In the pass@k metric, k code samples are generated per problem, and a problem is considered solved if any sample passes the unit tests and the total fraction of problems solved is reported.
In most papers, 200 candidate program completions are sampled, and pass@1, pass@10, and pass@100 are computed using an unbiased sampling estimator. The table below shows the HumanEval scores of CodeParrot, InCoder, GPT-neo, GPT-J and Codex (not open-source).


| Model | pass@1 | pass@10 | pass@100|
|-------|--------|---------|---------|
|CodeParrot (110M) | 3.80% | 6.57% | 12.78% | 
|CodeParrot (1.5B) | 3.58% | 8.03% | 14.96% |
|||||
|InCoder (6.7B) | 15.2% | 27.8% | 47.00% |
|||||
|Codex (25M)| 3.21% | 7.1% |	12.89%|
|Codex (300M)| 13.17%| 20.37% | 36.27% |
|Codex (12B)| 28.81%| 46.81% | 72.31% |
|||||
|GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
|GPT-J (6B)| 11.62% | 15.74% | 27.74% |

We can load HumanEval dataset and pass@k metric from 🤗 [`datasets`](https://huggingface.co/docs/datasets/index)

```python
from datasets import load_dataset, load_metric

human_eval = load_dataset("openai_humaneval")
code_eval_metric = load_metric("code_eval")
```

We can easily compute the pass@k for a problem that asks for the implementation of a function that sums two integers:

```python
test_cases = ["assert add(2,3)==5"]
candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
pass_at_k, results = code_eval_metric.compute(references=test_cases, predictions=candidates, k=[1, 2])
print(pass_at_k)
{'pass@1': 0.5, 'pass@2': 1.0}
```

To better understand how pass@k metric works, we will illustrate it with some concrete examples. We select two problems from the HumanEval dataset and see how CodeParrot 🦜 (110M) performs and which code completions pass the unit tests of the two problems below:

**Problem 1:**

```python

from typing import List


def separate_paren_groups(paren_string: str) -> List[str]:
    """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to
    separate those group into separate strings and return the list of those.
    Separate groups are balanced (each open brace is properly closed) and not nested within each other
    Ignore any spaces in the input string.
    >>> separate_paren_groups('( ) (( )) (( )( ))')
    ['()', '(())', '(()())']
    """
````
**Problem 2:**
```python

def truncate_number(number: float) -> float:
    """ Given a positive floating point number, it can be decomposed into
    and integer part (largest integer smaller than given number) and decimals
    (leftover part always smaller than 1).

    Return the decimal part of the number.
    >>> truncate_number(3.5)
    0.5
    """
````

For each problem, instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use nucleus sampling with top-p where `p=0.95`, `temperature=0.2`, and sample tokens from the model until we encounter a stop sequence indicating the end of a method: ‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’. For more details about decoding strategies for language generation, we recommend this [blog](https://huggingface.co/blog/how-to-generate).

**Remark**:

Regarding the temperature parameter, in [CodeGen](https://github.com/salesforce/CodeGen) paper, the authors observed that the best performing temperature increases as the number of samples permitted k increases. When a model is only allowed a few samples to pass unit tests, it is beneficial to use the learned distribution, through a low temperature, to select candidates that are likely to pass. But when a model is allowed for more chances with a high k, using a higher sampling temperature to tilt the learned model distribution lets it explore diverse samples and thus more likely to synthesize a correct program. 


For our experiment, we compute pass@1, pass@10 and pass@20, each correspending to unit test pass rate when selecting respectively 1, 10 and 20 samples from the candidate solutions.

```

Results: {'pass@1': 0.0750, 'pass@10': 0.4473, 'pass@20': 0.5}

````

If we take a closer look at the unit test results for each candidate solution in the two problems, we find that 3 passed the test for the second problem, and none did for the first problem. This means that we have 3 correct solutions among 40, which corresponds to our pass@1 value `3/40 = 0.075`. The scores pass@10 and pass@20 are higher, because the more samples we select from the candidate completions, the more likely we are to include the correct implementation. As
for pass@20, it is `1/2 = 0.5`, since if we select all 20 candidates for each problem, the second problem get solved which gives 50% success rate. If you are curious about the candidate solutions that passed the tests, they all implemented this function:

```python

def truncate_number(number: float) -> float:
    """ Given a positive floating point number, it can be decomposed into
    and integer part (largest integer smaller than given number) and decimals
    (leftover part always smaller than 1).

    Return the decimal part of the number.
    >>> truncate_number(3.5)
    0.5
    """
    return number % 1
```