Post
1670
As part of the Data is Better Together MPEP project, we are now at the point where some translation efforts have successfully translated 500 highly ranked prompts into a new target language (amazing work from
@Rijgersberg
et al!)
Our next step is to use these translated prompts to evaluate the performance of LLMs for non English languages.
Does LLM, as a judge, work outside of English?
Ideally, it would be compelling to leverage LLMs to judge models for non-English since this significantly lowers the barrier to evaluating models (although it doesn't remove this barrier altogether).
What we want to know is:
- does auto/LLM eval work in general for a particular language
- which model(s) works best as a judge
- do LLMs' judgments of non-English models match human preferences?
We're starting to think about how to approach this. If you have any ideas of possible approaches feel free to comment or join the discussion here: https://github.com/huggingface/data-is-better-together/issues/61
Other ideas...
Could an approach like Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models (2404.18796) with the SOA models for a particular language work? i.e., choose 4 of the best open LLMs for Arabic and use those at the pool of raters rather than relying on one powerful judge LLM?
Our next step is to use these translated prompts to evaluate the performance of LLMs for non English languages.
Does LLM, as a judge, work outside of English?
Ideally, it would be compelling to leverage LLMs to judge models for non-English since this significantly lowers the barrier to evaluating models (although it doesn't remove this barrier altogether).
What we want to know is:
- does auto/LLM eval work in general for a particular language
- which model(s) works best as a judge
- do LLMs' judgments of non-English models match human preferences?
We're starting to think about how to approach this. If you have any ideas of possible approaches feel free to comment or join the discussion here: https://github.com/huggingface/data-is-better-together/issues/61
Other ideas...
Could an approach like Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models (2404.18796) with the SOA models for a particular language work? i.e., choose 4 of the best open LLMs for Arabic and use those at the pool of raters rather than relying on one powerful judge LLM?