Spaces:
Running
on
CPU Upgrade
How do I view the results of my submission?
I am in the process of fine tuning google/gemma-2-2b-jpn-it. The first step is to know the benchmark scores of google/gemma-2-2b-jpn-it itself.
Since google didn't submit the model, so I submitted the model myself. According to the requests page, my submission is finished. However, after one day, I still can't see it to show up in the leaderboard. Where can I see the results of my submission? The revision hash of my submission is 6b046bbc091084a1ec89fe03e58871fde10868eb
I did read the FAQ and doc but I can't see anywhere saying how I can view the results of my submission. Thanks a lot in advance.
Taking hints from this discussion,
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/981
I find that results of my submission can be found here:
https://huggingface.co/datasets/open-llm-leaderboard/results
Probably should add this to FAQ in case people can't find their results.
Would also want to know if my results will get published or not....
Hi @ymcki ,
According to our FAQ, please, provide us with the request file for your model next time. Here is the request file for google/gemma-2-2b-jpn-it
you have submitted:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/google/gemma-2-2b-jpn-it_eval_request_False_float16_Original.json
According to the status, it is FINISHED
. Usually it takes approximately one day to see the results on the Leaderboard, but it can be longer on weekends. Currently, the model is displayed as you can see from my screenshot
I should also note, that google/gemma-2-2b-jpn-it
is a conversational model that has a chat_template
and the correct precision to submit it is bfloat16
according to its config.json, so I have submitted it with a chat_template = true
and in bfloat16
, please, find the request file here:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/google/gemma-2-2b-jpn-it_eval_request_False_bfloat16_Original.json
Thanks for telling me I submitted with the wrong data type.
But isn't gemma-2-2b-jpn-it an instruct model instead of a chat model. Based on my understanding of the README
https://huggingface.co/google/gemma-2b-it/blob/main/README.md
Models ending with it are instruct and without it are chat in google's naming convention.
Thanks for your clarification. Interestingly, my submission has higher raw score than yours
https://huggingface.co/datasets/open-llm-leaderboard/results/raw/main/google/gemma-2-2b-jpn-it/results_2024-10-11T13-51-38.420715.json
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/google/gemma-2-2b-jpn-it/results_2024-10-15T15-21-39.173019.json
Dev Average IFEval BBH MathLv5 GPQA MUSR MMLU-PRO model
google 31.82 51.37 42.21 3.474 28.52 39.56 25.78 gemma-2-2b-jpn-it (float16, not chat)
google 30.82 54.11 41.43 0.0 27.52 37.17 24.67 gemma-2-2b-jpn-it (bfloat16, chat)
Is this normal? Does float16 really make a big difference?
If not, doesn't that imply ticking chat makes the model a bit dumber?
We have experienced similar problems with gemma model evaluations before, and that's normal. The chat_template
has a positive effect on the IFEval score, while the MATH and, in particular, MUSR scores might be lower
Nevertheless, we usually strongly advise people to run the evaluation of instruct models with the chat template applied, as it is noted in our documentation