Spaces:

codeparrot
/

code-generation-models

Running

App Files Files Community

loubnabnl HF Staff commited on May 24, 2022

Commit

1114a41

1 Parent(s): 8399897

update

Browse files

Files changed (1) hide show

evaluation/intro.txt +5 -5

evaluation/intro.txt CHANGED Viewed

@@ -1,7 +1,7 @@
 A popular evaluation framework for code generation models is the [pass@k](https://huggingface.co/metrics/code_eval) metric on [HumanEval](https://huggingface.co/datasets/openai_humaneval) dataset, which was introduced in [Codex paper](https://arxiv.org/pdf/2107.03374v2.pdf). The dataset includes 164 handwritten programming problems. In the pass@k metric, k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests and the total fraction of problems solved is reported. Below are some examples for the selcted models.
 For most models, we sample 200 candidate program completions, and compute pass@1, pass@10, and pass@100 using an unbiased sampling estimator. The table below shows the humanEval scores of CodeParrot, InCoder, GPT-neo models, GPT-J and Codex (not open-source).
-<center>
 | Model | pass@1 | pass@10 | pass@100|
 |-------|--------|---------|---------|
@@ -16,10 +16,10 @@ For most models, we sample 200 candidate program completions, and compute pass@1
 |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
 |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
-<center>
-To better understand how pass@k metric works, we will illustrate it with some examples. We select 4 tasks from the HumanEval dataset and see how the models performs and which code completions pass the unit tests. We will use CodeParrot 🦜 . We select the three folowwing problem from HumanEval
 ```python
@@ -37,7 +37,7 @@ def has_close_elements(numbers: List[float], threshold: float) -> bool:
 ````
-```
 from typing import List
@@ -53,7 +53,7 @@ def separate_paren_groups(paren_string: str) -> List[str]:
 ````
-```
 def truncate_number(number: float) -> float:
     """ Given a positive floating point number, it can be decomposed into

 A popular evaluation framework for code generation models is the [pass@k](https://huggingface.co/metrics/code_eval) metric on [HumanEval](https://huggingface.co/datasets/openai_humaneval) dataset, which was introduced in [Codex paper](https://arxiv.org/pdf/2107.03374v2.pdf). The dataset includes 164 handwritten programming problems. In the pass@k metric, k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests and the total fraction of problems solved is reported. Below are some examples for the selcted models.
 For most models, we sample 200 candidate program completions, and compute pass@1, pass@10, and pass@100 using an unbiased sampling estimator. The table below shows the humanEval scores of CodeParrot, InCoder, GPT-neo models, GPT-J and Codex (not open-source).
+<div align="center">
 | Model | pass@1 | pass@10 | pass@100|
 |-------|--------|---------|---------|
 |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
 |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
+</div>
+To better understand how pass@k metric works, we will illustrate it with some examples. We select 4 problems from the HumanEval dataset and see how the model performs and which code completions pass the unit tests. We will use CodeParrot 🦜  with the three problem below:
 ```python
 ````
+```python
 from typing import List
 ````
+```python
 def truncate_number(number: float) -> float:
     """ Given a positive floating point number, it can be decomposed into