Sleeping
๐
The evaluation benchmarks use zero-shot, where usually few-shot is used. This raises the question whether the few-shot results weren't as good, compared to similar-sized models.
Hopefully the authors will release few-shot results, aligned with common practice (e.g. HF Open LLM Leaderboard)