Trouble Reproducing gemma-3-270m-it IFEval Score

by fongya - opened 20 days ago

20 days ago

I'm trying to verify my setup by reproducing the IFEval benchmark score for gemma-3-270m-it. The official score is 51.2%, but my accuracy is only between 20-27% (run multiple times).

I am using the following settings:

temperature=1.0
top_p=0.95
top_k=64
min_p=0.0

Am I missing something? I suspect there's a misconfiguration somewhere in my setup.

beyoung

20 days ago

•

edited 20 days ago

+1. cannot reproduce IFEval score too, my evaluation results are among ~26%.

yousef1727

20 days ago

•

edited 20 days ago

I'm working with temperature=0.2 and it's better

fongya

20 days ago

With

temperature=0.2
top_p=0.95
top_k=64
min_p=0.0

It got 27.9% on IFEval. It's a slight improvement, but there's still a gap to the 51.2%.

yousef1727

19 days ago

•

edited 19 days ago

I honestly don’t know—maybe try 0.0 or 0.1 😅. Good luck.

Try this?

temperature = 0.1 // less random token picking
top_p = 0.95  
top_k = 64  
min_p = 0.25 // lower minimum probability

I’m really curious about IFEval score

By the way, I’m using llama.cpp. I forgot to mention that last time.

codelion

18 days ago

•

edited 18 days ago

I am also having trouble replicating the results reported. I am using the standard lm_eval harness. I get the following results, and the biggest gap is in IFEval (with inst_level_loose_acc metric).

Gemma 3 270M IT - Actual Results vs Google's Reported Baseline

Benchmark	n-shot	Actual Results	Google Reported	Delta	Match Status
HellaSwag	0-shot	33.5%	37.7%	-4.2%	❌ Lower
PIQA	0-shot	65.6%	66.2%	-0.6%	✅ Close
ARC-c	0-shot	24.5%	28.2%	-3.7%	❌ Lower
WinoGrande	0-shot	53.2%	52.3%	+0.9%	✅ Close
BIG-Bench Hard	3-shot	26.8%	26.7%	+0.1%	✅ Match
IFEval (inst_level)	0-shot	37.7%	51.2%	-13.5%	⚠️ Gap

wy11ing

9 days ago

I am also facing this touble too! I guess I didn't use the same command as the official evaluation did.
This is the command I used,

lm_eval --model hf --model_args pretrained=google/gemma-3-270m-it --tasks leaderboard_ifeval --device cuda:0 --use_cache ./eval_cache/google_gemma-3-270m-it --apply_chat_template --fewshot_as_multiturn --batch_size auto --log_samples --output_path ./eval_out/ --trust_remote_code

In fact, I have trouble reproducing not only this model but also others like gpt-oss-20B and I've checked Github community about IFEval, some said batch_size may also somehow affect the result...

codelion

9 days ago

The batch_size cannot explain such a large gap, usually we should be able to reproduce within 3-5% of the given results. The surprising thing is it is only happening on this benchmark?

Renu11

Google org 8 days ago

Thanks for the heads-up. We have forwarded your feedback about the gemma-3-270m-it IFEval score discrepancy to the engineering team for a full investigation. We appreciate you bringing this to our attention.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment