Trouble Reproducing gemma-3-270m-it IFEval Score
I'm trying to verify my setup by reproducing the IFEval benchmark score for gemma-3-270m-it. The official score is 51.2%, but my accuracy is only between 20-27% (run multiple times).
I am using the following settings:
temperature=1.0
top_p=0.95
top_k=64
min_p=0.0
Am I missing something? I suspect there's a misconfiguration somewhere in my setup.
+1. cannot reproduce IFEval score too, my evaluation results are among ~26%.
I'm working with temperature=0.2 and it's better
With
temperature=0.2
top_p=0.95
top_k=64
min_p=0.0
It got 27.9% on IFEval. It's a slight improvement, but there's still a gap to the 51.2%.
I honestly don’t know—maybe try 0.0 or 0.1 😅. Good luck.
Try this?
temperature = 0.1 // less random token picking
top_p = 0.95
top_k = 64
min_p = 0.25 // lower minimum probability
I’m really curious about IFEval score
By the way, I’m using llama.cpp. I forgot to mention that last time.
I am also having trouble replicating the results reported. I am using the standard lm_eval harness. I get the following results, and the biggest gap is in IFEval (with inst_level_loose_acc metric).
Gemma 3 270M IT - Actual Results vs Google's Reported Baseline
Benchmark | n-shot | Actual Results | Google Reported | Delta | Match Status |
---|---|---|---|---|---|
HellaSwag | 0-shot | 33.5% | 37.7% | -4.2% | ❌ Lower |
PIQA | 0-shot | 65.6% | 66.2% | -0.6% | ✅ Close |
ARC-c | 0-shot | 24.5% | 28.2% | -3.7% | ❌ Lower |
WinoGrande | 0-shot | 53.2% | 52.3% | +0.9% | ✅ Close |
BIG-Bench Hard | 3-shot | 26.8% | 26.7% | +0.1% | ✅ Match |
IFEval (inst_level) | 0-shot | 37.7% | 51.2% | -13.5% | ⚠️ Gap |
I am also facing this touble too! I guess I didn't use the same command as the official evaluation did.
This is the command I used,
lm_eval --model hf --model_args pretrained=google/gemma-3-270m-it --tasks leaderboard_ifeval --device cuda:0 --use_cache ./eval_cache/google_gemma-3-270m-it --apply_chat_template --fewshot_as_multiturn --batch_size auto --log_samples --output_path ./eval_out/ --trust_remote_code
In fact, I have trouble reproducing not only this model but also others like gpt-oss-20B and I've checked Github community about IFEval, some said batch_size may also somehow affect the result...
The batch_size cannot explain such a large gap, usually we should be able to reproduce within 3-5% of the given results. The surprising thing is it is only happening on this benchmark?
Thanks for the heads-up. We have forwarded your feedback about the gemma-3-270m-it
IFEval score discrepancy to the engineering team for a full investigation. We appreciate you bringing this to our attention.