Spaces:

codeparrot
/

code-generation-models

Running

loubnabnl HF Staff commited on May 24, 2022

Commit

9196092

1 Parent(s): d41708b

update

Files changed (1) hide show

evaluation/intro.txt CHANGED Viewed

@@ -1,7 +1,6 @@
 A popular evaluation framework for code generation models is the [pass@k](https://huggingface.co/metrics/code_eval) metric on [HumanEval](https://huggingface.co/datasets/openai_humaneval) dataset, which was introduced in [Codex paper](https://arxiv.org/pdf/2107.03374v2.pdf). The dataset includes 164 handwritten programming problems. In the pass@k metric, k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests and the total fraction of problems solved is reported.
 In most papers, 200 candidate program completions are sampled, and pass@1, pass@10, and pass@100 are computed using an unbiased sampling estimator. The table below shows the HumanEval scores of CodeParrot, InCoder, GPT-neo, GPT-J and Codex (not open-source).
-<div align="center">
 | Model | pass@1 | pass@10 | pass@100|
 |-------|--------|---------|---------|
@@ -16,8 +15,6 @@ In most papers, 200 candidate program completions are sampled, and pass@1, pass@
 |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
 |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
-</div>
 To better understand how pass@k metric works, we will illustrate it with some examples. We select 4 problems from the HumanEval dataset and see how the model performs and which code completions pass the unit tests. We will use CodeParrot 🦜  with the three problem below:

 A popular evaluation framework for code generation models is the [pass@k](https://huggingface.co/metrics/code_eval) metric on [HumanEval](https://huggingface.co/datasets/openai_humaneval) dataset, which was introduced in [Codex paper](https://arxiv.org/pdf/2107.03374v2.pdf). The dataset includes 164 handwritten programming problems. In the pass@k metric, k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests and the total fraction of problems solved is reported.
 In most papers, 200 candidate program completions are sampled, and pass@1, pass@10, and pass@100 are computed using an unbiased sampling estimator. The table below shows the HumanEval scores of CodeParrot, InCoder, GPT-neo, GPT-J and Codex (not open-source).
 | Model | pass@1 | pass@10 | pass@100|
 |-------|--------|---------|---------|
 |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
 |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
 To better understand how pass@k metric works, we will illustrate it with some examples. We select 4 problems from the HumanEval dataset and see how the model performs and which code completions pass the unit tests. We will use CodeParrot 🦜  with the three problem below: