Can't reproduce hellaswag result - getting 42.3% v.s. 71.4 % reported
Hi! hope all is good.
I'm trying to reproduce the hellaswag result obtained through lm-evaluation-harness. Following the discussion from https://huggingface.co/google/gemma-2b/discussions/18, I:
- Pulled lm-evaluation-harness form commit b281b0921b636bc36ad05c0b0b0763bd6dd43463 and set it up in a fresh conda environment
- Ran:
$ python main.py --model hf-causal-experimental --model_args pretrained=google/gemma-2b,dtype=float32 --tasks hellaswag --device cuda:0 --batch_size 1
- Got the following results:
{
"results": {
"hellaswag": {
"acc": 0.34116709818761204,
"acc_stderr": 0.0047313244091332675,
"acc_norm": 0.42342162915753834,
"acc_norm_stderr": 0.004930911515084784
}
},
"versions": {
"hellaswag": 0
},
"config": {
"model": "hf-causal-experimental",
"model_args": "pretrained=google/gemma-2b,dtype=float32",
"num_fewshot": 0,
"batch_size": "1",
"batch_sizes": [],
"device": "cuda:0",
"no_cache": false,
"limit": null,
"bootstrap_iters": 100000,
"description_dict": {}
}
}
hf-causal-experimental (pretrained=google/gemma-2b,dtype=float32), limit: None, provide_description: False, num_fewshot: 0, batch_size: 1
| Task |Version| Metric |Value | |Stderr|
|---------|------:|--------|-----:|---|-----:|
|hellaswag| 0|acc |0.3412|± |0.0047|
| | |acc_norm|0.4234|± |0.0049|
Am I doing something obviously wrong? As can be seen from the output, I'm getting an accuracy of 42.3%. However, the paper reports an accuracy of 71.4% in hellaswag (similar to the 71.77 listed in the open llm leaderboard: https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard).
Thanks in advance!
Taking a deeper look at the results from the open llm leaderboard (https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, https://huggingface.co/datasets/open-llm-leaderboard-old/details_google__gemma-2b), from those links, if I'm reading this correctly, it seems like the 71.77% accuracy listed in the open llm leaderboard for hellaswag was obtained using 10 few-shot examples per:
...
"harness|hellaswag|10": {
"hashes": {
"hash_examples": "e1768ecb99d7ecf0",
"hash_full_prompts": "0b4c16983130f84f",
"hash_input_tokens": "11490eb47260730b",
"hash_cont_tokens": "6a8516a792e1673e"
},
"truncated": 0,
"non_truncated": 10042,
"padded": 40055,
"non_padded": 113,
"effective_few_shots": 10.0,
"num_truncated_few_shots": 0
},
...
Is this read correct?
Also, as noted above, the paper (https://arxiv.org/pdf/2403.08295) and its hugging face page (https://huggingface.co/google/gemma-2b) list a similar accuracy (71.4%) for hellaswag, but note that it was obtained using 0-shot. Is there any way to possibly replicate the 0-shot results listed there through lm-eval-harness or lighteval?
Thanks in advance