Underreported HumanEval Scores?

#83

by VaibhavSahai - opened Jul 3

Discussion

VaibhavSahai

Jul 3

Hello. I have noticed that after the June update, this model performs significantly better on HumanEval.

It previously had 64.6% (as measured on the eval plus leaderboard), but after i ran the same test, it scored a 72%.

my params:
temp = 0 and max_tokens = 2048.

Could someone verify/recheck these scores? Thank you

nguyenbh

Microsoft org Jul 3

Thank you for your interest and effort to independently run the HumanEval benchmark.

VaibhavSahai

Jul 3

•

edited Jul 3

Np @nguyenbh . I also ran EvalPlus (which adds extra test cases to human eval) and observed a jump from pre- June 59.1% to 65.2%. Really like the new model for lightweight coding tasks now🙂

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment