Trouble Reproducing gemma-3-270m-it IFEval Score

#4
by fongya - opened

I'm trying to verify my setup by reproducing the IFEval benchmark score for gemma-3-270m-it. The official score is 51.2%, but my accuracy is only between 20-27% (run multiple times).

I am using the following settings:

  • temperature=1.0
  • top_p=0.95
  • top_k=64
  • min_p=0.0

Am I missing something? I suspect there's a misconfiguration somewhere in my setup.

+1. cannot reproduce IFEval score too, my evaluation results are among ~26%.

I'm working with temperature=0.2 and it's better

With

  • temperature=0.2
  • top_p=0.95
  • top_k=64
  • min_p=0.0

It got 27.9% on IFEval. It's a slight improvement, but there's still a gap to the 51.2%.

I honestly don’t know—maybe try 0.0 or 0.1 😅. Good luck.

Try this?

temperature = 0.1 // less random token picking
top_p = 0.95  
top_k = 64  
min_p = 0.25 // lower minimum probability

I’m really curious about IFEval score

By the way, I’m using llama.cpp. I forgot to mention that last time.

I am also having trouble replicating the results reported. I am using the standard lm_eval harness. I get the following results, and the biggest gap is in IFEval (with inst_level_loose_acc metric).

Gemma 3 270M IT - Actual Results vs Google's Reported Baseline

Benchmark n-shot Actual Results Google Reported Delta Match Status
HellaSwag 0-shot 33.5% 37.7% -4.2% ❌ Lower
PIQA 0-shot 65.6% 66.2% -0.6% ✅ Close
ARC-c 0-shot 24.5% 28.2% -3.7% ❌ Lower
WinoGrande 0-shot 53.2% 52.3% +0.9% ✅ Close
BIG-Bench Hard 3-shot 26.8% 26.7% +0.1% ✅ Match
IFEval (inst_level) 0-shot 37.7% 51.2% -13.5% ⚠️ Gap

I am also facing this touble too! I guess I didn't use the same command as the official evaluation did.
This is the command I used,

lm_eval --model hf --model_args pretrained=google/gemma-3-270m-it --tasks leaderboard_ifeval --device cuda:0 --use_cache ./eval_cache/google_gemma-3-270m-it --apply_chat_template --fewshot_as_multiturn --batch_size auto --log_samples --output_path ./eval_out/ --trust_remote_code

In fact, I have trouble reproducing not only this model but also others like gpt-oss-20B and I've checked Github community about IFEval, some said batch_size may also somehow affect the result...

The batch_size cannot explain such a large gap, usually we should be able to reproduce within 3-5% of the given results. The surprising thing is it is only happening on this benchmark?

Google org

Thanks for the heads-up. We have forwarded your feedback about the gemma-3-270m-it IFEval score discrepancy to the engineering team for a full investigation. We appreciate you bringing this to our attention.

Sign up or log in to comment