How to get the effect like Code Completion Playground above

#25

by rookielyb - opened May 10, 2023

May 10, 2023

When I was using it, I found that the effect of Code Completion Playground is much better than the offline prediction effect of the weight of the starcoder I downloaded. Why is this, and is there any difference? At the same time, the effect of Hosted inference API is not as good as that of Code Completion Playground. OK, does it do anything special?

SivilTaram

BigCode org May 11, 2023

@rookielyb I'm not sure, but you may be careful about the hyper-parameter (e.g., temperature).

miraclezst

May 11, 2023

•

edited May 11, 2023

I have the same problem, but I am running in 8bit mode, I don't know if this is the reason

rookielyb

May 11, 2023

this is my code

I predicted 10 times and didn't get one correct result

But I try to use your api, can get the correct result

Why is this? I'm having a hard time achieving your results in human eval.
Hope to get your reply!

SivilTaram

BigCode org May 11, 2023

@rookielyb I notice that you set the topk_k=50. Can you try again after removing the constraint? I do not think we have ever set top_k.

rookielyb

May 11, 2023

thank you for your reply!
Here are the latest parameters

The effect of the same parameter starcoderbase is better than that of starcoder

The left side of the figure is the weight based on starcoderbase, and the right side is based on starcoder
I found that the generated results are very sensitive to the parameters.
Is it convenient to give the parameters of the human eval results in the paper? pass@1

SivilTaram

BigCode org May 11, 2023

@rookielyb sure, you can already find them in the paper:

Like Chen et al. (2021), we use sampling temperature 0.2 for pass@1, and temperature 0.8 for k > 1. We generate n = 200 samples for all experiments with open-access models.

loubnabnl

BigCode org May 12, 2023

•

edited May 13, 2023

Hi, as answered in this issue the playground doesn't do anything special it calls the inference endpoint to generate code which is equivalent to doing model.generate if you use the same parameters (check the Playground's public code.

The humaneval score is 33%-40% so it's normal that the model gets some solutions wrong, if you want to reproduce the humaneval score you can run the evaluation-harness on the full benchmark instead of comparing a few problems, as specified in the issue both the paper settings provided by @SivilTaram and greedy decoding give a pass@1 of 33%-34%. (btw it helps to strip the prompt before generation if you're not doing it)

loubnabnl changed discussion status to closed Jun 6, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment