_README.md · jerome-white/alpaca-item-response at main

This Space applies item response theory to the Alpaca LLM evaluation framework.

Overview

Alpaca maintains a set of prompts, along with responses to those prompts from a collection of LLMs. It evaluates models based on average response preference. To establish preference, it first designates one LLM as a "baseline." For every other model m, and every prompt p, it presents the baseline's response to p and m's response to p to a judge. The judge determines which response better addresses the prompt, assigning a "win" to the preferred model. Alpaca ranks models based on their win percentage.

An alternative view on model comparison based on item response theory (IRT) is presented here. Item response theory is sometimes used to estimate student ability in conjunction with exam rigor. With respect to Alpaca, models are considered students and the collection of prompts their exam. A models answer to a question is "correct" if the judge feels it is better than the baseline. Alpaca data was fit to a two parameter IRT model. Items are ranked based on the medians of their respective parameter posteriors. Uncertainty of those posteriors is presented as 95% HDIs (high density intervals).

This Space is a work in progress. Comments and suggestions are welcome; please use the Community for doing so.

Resources

Parameters were estimated using Stan, following this tutorial.
Code for running the workflow can be found here. See bin/item-response.sh.
Transformed data, including Stan's __-suffixed variables, can be found here.