metadata

title: rbeval
emoji: 💩
colorFrom: yellow
colorTo: blue
sdk: streamlit
sdk_version: 1.37.0
app_file: app.py
pinned: false

RoBustEval Dashboard

This dashboard is best viewed at the huggingface space

Introduction

LLM MCQA (multiple choice question-answering) benchmarks are measured in the following way:

Some number of few shot examples are pulled from the validation set of the MCQA benchmark and formatted as

Question: What is the capital of France?
(A) Paris
(B) London
(C) Berlin
(D) Madrid
Answer: A
The target question is then appended, without the answer, and fed into the model as

Question: What is the capital of France?
(A) Paris
(B) London
(C) Berlin
(D) Madrid
Answer:
The model then outputs it's predictions for the token that should come directly after Answer:
The probabilities $p_i, i \in {A,B,C,D}$ for the tokens resulting from tokenizing the strings "A", "B", "C", "D" are then recorded
A question with correct answer $k \in {A,B,C,D}$ is marked is correct if $p_k > p_i, i \in {A,B,C,D} \setminus {k}$
The accuracy is reported as the percentage of questions correctly answered

This method for evaluation is reasonable, but leaves behind significant amounts of information about model inference. For example, consider a question with correct answer $C$. The following model outputs are all scored the same:

Probability	$p_A$	$p_B$	$\mathbf{p_C}$	$p_D$
Model 1	0.00	0.00	1.00	0.00
Model 2	0.20	0.10	0.60	0.01
Model 3	0.25	0.25	0.26	0.24
Model 4	0.0	0.0	0.01	0.0

In this case, Model 1 is a clear best, with full confidence in the correct answer. Model 2 is also good, but not as good as Model 1. Model 3 is almost entirely guessing, and Model 4 doesn't even understand the format of the question, but since the evaluation method only considers the probability of the ABCD tokens, it still gets marked as correct.

All of these scenarios result in the exact same score, when maybe, they really shouldn't.

This dashboard is an attempt to address this issue by providing a more detailed analysis of model predictions on MCQA benchmarks.

We will be interested in the following quantities (symbols are chosen to make them easy to type into figures)

$\Phi = p_{\text{correct}}$, the probability of the correct answer
$\Delta = p_{\text{correct}} - \max({p_i : i \in I})$ where $I$ is the set of incorrect answers

We will then plot the distribution of these quantities for a given model, and compare them across models.

Here, $\Delta$ is a measure of how much more confident the model is in the correct answer compared to the most confident incorrect answer, while $p_{\text{correct}}$ is a measure of how confident the model is in the correct answer.

An ideal model would have $\Phi = 1$ (and therefore $\Delta=1$) always, while a model that performs random guessing would have $p_i = \Phi = 0.25$ (and therefore $\Delta=0$) always.

Strength as a classifier

If we take each question given, we can record two results from it as a correct/incorrect classifier:

Record y_true=1 with y_pred=p_{\text{correct}}
Record y_true=0 with y_pred=\max(\{p_i : i \in I\}) We can then measure BCE and ROC-AUC for this classifier, and compare these metrics across models.

How to use this notebook

Now that you know what $\Phi$ and $\Delta$ plots are, below I've provided a simple interface to inspect plots for a wide variety of common models. Currently, Llama 1,2,3 7/8B and variations of these models are available to compare. I will be adding more models soon.

Note: All figures are fully interactive. Click/shift-click on the legends to select individual/multiple lines, zoom in and out, and hover over lines to see exact values at each point. Note: Some models show strange behaviour in the $\Delta$ plots around $\Delta=0$. This appears to only be in instruction tuned models, and I'm currently investigating the cause. It could be weird fp16 behaviour, but I'm not sure yet.

TODO: Clean up and explain the model comparison tool below the performance plots.