Spaces:

Weyaxi
/

leaderboard-results-to-modelcard

Running

Unable to reproduce gsm8k results

by zhentaocc - opened Jan 8

Jan 8

with batch size = 1, the result I got was 13.12.
I was using python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>," --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>
And I found different settings for batch size result in different accuracy.

Weyaxi

Owner Jan 8

Hi @zhentaocc ,

I don't think this is the right place to open this discussion. I noticed that you are part of the Intel organization. Which model did you evaluate? Also, observing different scores with different batch size (if you are referring to this), is normal.

Like I said again, this is not the right place to open this discussion, you should head to Open LLM Leaderboard.

@clefourrier @SaylorTwift

Weyaxi changed discussion status to closed Jan 9

zhentaocc

Jan 10

•

edited Jan 10

this one mncai/Llama2-7B-guanaco-dolphin-500 @Weyaxi

Weyaxi

Owner Jan 10

Hi @zhentaocc ,

The GSM8K results for that model were generated before a fix in that specific benchmark, so the real score is probably higher than reported. The leaderboard team re-evaluated the GSM8K score to address this issue, but it seems your model has not been evaluated.

As mentioned earlier, this is not the right place to initiate this discussion. Please go to the leaderboard space and open an issue there :)

Have a nice day :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment