blogpost-scaling-test-time-compute

Running

hynky HF staff commited on Oct 23, 2024

Commit

750decd

1 Parent(s): 1fe8b25

fix footnote

Files changed (1) hide show

app/src/index.html CHANGED Viewed

@@ -247,7 +247,7 @@
     <p>As outlined in the task formulations, we are using MCF for this evaluation and employing a 5-shot approach, as recommended by OLMES<d-cite key="gu2024olmesstandardlanguagemodel"></d-cite> (and made possible by the large context size of the models).</p>
     <h3>Computing a global "multilingual" score</h3>
-    <p>In the previous sections, we treated each task independently. However, to determine an overall "multilingual" score of a model, we need to <b>aggregate</b> the results from these tasks. We begin by <b>rescaling</b> the individual task scores in line with the OpenLLM leaderboard <d-cite key="open-llm-leaderboard-v2"></d-cite>. Then, we <b>average the scores</b> across task types (GK, RES, etc) for each language separately. To compute the score for each language, we take the average of the task type scores.</p><d-footnote>We first average by task type to properly measure all model capabilities without letting a single category dominate.</d-footnote>
     <p>For the final global "multilingual" score we followed a different approach. Instead of averaging the language scores directly, we <b>ranked the model's performance across languages</b> in comparison to other models and then averaged those rank scores. This method ensures that the result reflects the overall model's performance across all languages, preventing an exceptionally high score in one language from skewing the final outcome.</p>

     <p>As outlined in the task formulations, we are using MCF for this evaluation and employing a 5-shot approach, as recommended by OLMES<d-cite key="gu2024olmesstandardlanguagemodel"></d-cite> (and made possible by the large context size of the models).</p>
     <h3>Computing a global "multilingual" score</h3>
+    <p>In the previous sections, we treated each task independently. However, to determine an overall "multilingual" score of a model, we need to <b>aggregate</b> the results from these tasks. We begin by <b>rescaling</b> the individual task scores in line with the OpenLLM leaderboard <d-cite key="open-llm-leaderboard-v2"></d-cite>. Then, we <b>average the scores</b> across task types (GK, RES, etc) for each language separately. To compute the score for each language, we take the average of the task type scores.<d-footnote>We first average by task type to properly measure all model capabilities without letting a single category dominate.</d-footnote></p>
     <p>For the final global "multilingual" score we followed a different approach. Instead of averaging the language scores directly, we <b>ranked the model's performance across languages</b> in comparison to other models and then averaged those rank scores. This method ensures that the result reflects the overall model's performance across all languages, preventing an exceptionally high score in one language from skewing the final outcome.</p>