Unable to reproduce gsm8k results

#3
by zhentaocc - opened

with batch size = 1, the result I got was 13.12.
I was using python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>," --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>
And I found different settings for batch size result in different accuracy.

Owner

Hi @zhentaocc ,

I don't think this is the right place to open this discussion. I noticed that you are part of the Intel organization. Which model did you evaluate? Also, observing different scores with different batch size (if you are referring to this), is normal.

Like I said again, this is not the right place to open this discussion, you should head to Open LLM Leaderboard.

Related:

@clefourrier @SaylorTwift

Weyaxi changed discussion status to closed

this one mncai/Llama2-7B-guanaco-dolphin-500  @Weyaxi

Owner

Hi @zhentaocc ,

The GSM8K results for that model were generated before a fix in that specific benchmark, so the real score is probably higher than reported. The leaderboard team re-evaluated the GSM8K score to address this issue, but it seems your model has not been evaluated.

As mentioned earlier, this is not the right place to initiate this discussion. Please go to the leaderboard space and open an issue there :)

Have a nice day :)

Sign up or log in to comment