RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Abstract
With significant efforts in recent studies, LLM-as-a-Judge has become a cost-effective alternative to human evaluation for assessing the text generation quality in a wide range of tasks. However, there still remains a reliability gap between LLM-as-a-Judge and human evaluation. One important reason is the lack of guided oracles in the evaluation process. Motivated by the role of reference pervasively used in classic text evaluation, we introduce RevisEval, a novel text generation evaluation paradigm via the response-adapted references. RevisEval is driven by the key observation that an ideal reference should maintain the necessary relevance to the response to be evaluated. Specifically, RevisEval leverages the text revision capabilities of large language models (LLMs) to adaptively revise the response, then treat the revised text as the reference (response-adapted reference) for the subsequent evaluation. Extensive experiments demonstrate that RevisEval outperforms traditional reference-free and reference-based evaluation paradigms that use LLM-as-a-Judge across NLG tasks and open-ended instruction-following tasks. More importantly, our response-adapted references can further boost the classical text metrics, e.g., BLEU and BERTScore, compared to traditional references and even rival the LLM-as-a-Judge. A detailed analysis is also conducted to confirm RevisEval's effectiveness in bias reduction, the impact of inference cost, and reference relevance.
Community
''RevisEval: Improving LLM-as-a-Judge via Response-Adapted References'', Evaluation has long been a cornerstone of progress in text generation capabilities. With the limitations of traditional metrics, LLM-as-a-Judge has become a viable method for assessing generative abilities in open-ended tasks, though it still faces significant reliability gaps compared to human evaluation. By harnessing the revision capabilities of LLMs, we unlock the potential of references in traditional evaluations, generating response-adapted references that can significantly enhance general evaluation methods on various tasks. This approach not only boosts the accuracy of LLM-as-a-Judge but also revives traditional metrics like BLEU, enabling them to effectively evaluate tasks on benchmarks such as MTBench and Alpacafarm, with results that are even comparable to those of LLM-as-a-Judge. It also performs well in using weak LLMs for evaluation and mitigating positional bias.
Most interestingly, we found that recent efforts to train a strong judge by fine-tuning (SFT) a weak LLM, like Llama-2 7B, have encountered significant challenges, particularly bias issues. Surprisingly, our approach reveals that instead of fine-tuning a weak LLM into a judge, it may be more effective to use the same resources to train it as a reviser. By generating response-adapted references and combining them with traditional metrics, better results can be achieved. This offers an alternative feasible approach for using LLMs as judges.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text (2024)
- Direct Judgement Preference Optimization (2024)
- Self-Judge: Selective Instruction Following with Alignment Self-Evaluation (2024)
- HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models (2024)
- Better Instruction-Following Through Minimum Bayes Risk (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper