--- title: rbeval emoji: 💩 colorFrom: yellow colorTo: blue sdk: streamlit sdk_version: 1.37.0 app_file: app.py pinned: false --- # RoBustEval Dashboard This dashboard is best viewed at [the huggingface space](https://huggingface.co/spaces/mli-will/rbeval) ### Introduction LLM MCQA (multiple choice question-answering) benchmarks are measured in the following way: 1. Some number of few shot examples are pulled from the validation set of the MCQA benchmark and formatted as > **Question**: What is the capital of France? \ > (A) Paris \ > (B) London \ > (C) Berlin \ > (D) Madrid \ > **Answer**: A 2. The target question is then appended, without the answer, and fed into the model as > **Question**: What is the capital of France? \ > (A) Paris \ > (B) London \ > (C) Berlin \ > (D) Madrid \ > **Answer**: 3. The model then outputs it's predictions for the token that should come directly after **Answer**: 4. The probabilities $p_i, i \in \{A,B,C,D\}$ for the tokens resulting from tokenizing the strings `"A", "B", "C", "D"` are then recorded 5. A question with correct answer $k \in \{A,B,C,D\}$ is marked is correct if $p_k > p_i, i \in \{A,B,C,D\} \setminus \{k\}$ 6. The accuracy is reported as the percentage of questions correctly answered This method for evaluation is reasonable, but leaves behind significant amounts of information about model inference. For example, consider a question with correct answer $C$. *The following model outputs are all scored the same*: | Probability | $p_A$ | $p_B$ | $\mathbf{p_C}$ | $p_D$ | |-------------|-----|-----|-----|-----| | Model 1 | 0.00 | 0.00 | **1.00** | 0.00 | | Model 2 | 0.20 | 0.10 | **0.60** | 0.01 | | Model 3 | 0.25 | 0.25 | **0.26** | 0.24 | | Model 4 | 0.0 | 0.0 | **0.01** | 0.0 | In this case, Model 1 is a clear best, with full confidence in the correct answer. Model 2 is also good, but not as good as Model 1. Model 3 is _almost entirely guessing_, and Model 4 doesn't even understand the format of the question, but since the evaluation method only considers the probability of the ABCD tokens, it still gets marked as correct. All of these scenarios result in the exact same score, when maybe, they really shouldn't. **This dashboard is an attempt to address this issue by providing a more detailed analysis of model predictions on MCQA benchmarks.** We will be interested in the following quantities (symbols are chosen to make them easy to type into figures) 1. $\Phi = p_{\text{correct}}$, the probability of the correct answer 2. $\Delta = p_{\text{correct}} - \max(\{p_i : i \in I\})$ where $I$ is the set of incorrect answers We will then plot the distribution of these quantities for a given model, and compare them across models. Here, $\Delta$ is a measure of how much more confident the model is in the correct answer compared to the most confident incorrect answer, while $p_{\text{correct}}$ is a measure of how confident the model is in the correct answer. An ideal model would have $\Phi = 1$ (and therefore $\Delta=1$) always, while a model that performs random guessing would have $p_i = \Phi = 0.25$ (and therefore $\Delta=0$) always. ### Strength as a classifier If we take each question given, we can record two results from it as a correct/incorrect classifier: * Record `y_true=1` with `y_pred=p_{\text{correct}}` * Record `y_true=0` with `y_pred=\max(\{p_i : i \in I\})` We can then measure BCE and ROC-AUC for this classifier, and compare these metrics across models. ## How to use this notebook Now that you know what $\Phi$ and $\Delta$ plots are, below I've provided a simple interface to inspect plots for a wide variety of common models. Currently, Llama 1,2,3 7/8B and variations of these models are available to compare. I will be adding more models soon. **Note**: All figures are *fully interactive*. Click/shift-click on the legends to select individual/multiple lines, zoom in and out, and hover over lines to see exact values at each point. **Note**: Some models show strange behaviour in the $\Delta$ plots around $\Delta=0$. This appears to only be in instruction tuned models, and I'm currently investigating the cause. It could be weird fp16 behaviour, but I'm not sure yet. **TODO**: Clean up and explain the model comparison tool below the performance plots.