Collection of articles and resources focusing on automatic evaluation for LLM's and their role as unbiased judges in assessing other LLMs' outputs
-
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper • 2306.05685 • Published • 29 -
Generative Judge for Evaluating Alignment
Paper • 2310.05470 • Published • 1 -
Humans or LLMs as the Judge? A Study on Judgement Biases
Paper • 2402.10669 • Published -
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper • 2310.17631 • Published • 32