Underreported HumanEval Scores?

#83
by VaibhavSahai - opened

Hello. I have noticed that after the June update, this model performs significantly better on HumanEval.

It previously had 64.6% (as measured on the eval plus leaderboard), but after i ran the same test, it scored a 72%.

my params:
temp = 0 and max_tokens = 2048.

Could someone verify/recheck these scores? Thank you

Microsoft org

Thank you for your interest and effort to independently run the HumanEval benchmark.

Np @nguyenbh . I also ran EvalPlus (which adds extra test cases to human eval) and observed a jump from pre- June 59.1% to 65.2%. Really like the new model for lightweight coding tasks now๐Ÿ™‚

Sign up or log in to comment