How to get the effect like Code Completion Playground above
When I was using it, I found that the effect of Code Completion Playground is much better than the offline prediction effect of the weight of the starcoder I downloaded. Why is this, and is there any difference? At the same time, the effect of Hosted inference API is not as good as that of Code Completion Playground. OK, does it do anything special?
@rookielyb I'm not sure, but you may be careful about the hyper-parameter (e.g., temperature).
I have the same problem, but I am running in 8bit mode, I don't know if this is the reason
@rookielyb
I notice that you set the topk_k=50
. Can you try again after removing the constraint? I do not think we have ever set top_k
.
thank you for your reply!
Here are the latest parameters
The effect of the same parameter starcoderbase is better than that of starcoder
The left side of the figure is the weight based on starcoderbase, and the right side is based on starcoder
I found that the generated results are very sensitive to the parameters.
Is it convenient to give the parameters of the human eval results in the paper? pass@1
@rookielyb sure, you can already find them in the paper:
Like Chen et al. (2021), we use sampling temperature 0.2 for pass@1, and temperature 0.8 for k > 1. We generate n = 200 samples for all experiments with open-access models.
Hi, as answered in this issue the playground doesn't do anything special it calls the inference endpoint to generate code which is equivalent to doing model.generate
if you use the same parameters (check the Playground's public code.
The humaneval score is 33%-40% so it's normal that the model gets some solutions wrong, if you want to reproduce the humaneval score you can run the evaluation-harness on the full benchmark instead of comparing a few problems, as specified in the issue both the paper settings provided by @SivilTaram and greedy decoding give a pass@1 of 33%-34%. (btw it helps to strip the prompt before generation if you're not doing it)