blogpost-scaling-test-time-compute

Running

App Files Files Community

hynky HF staff commited on Oct 23, 2024

Commit

8507b02

1 Parent(s): ed9481f

hide aside on small screens + move one adie up

Browse files

Files changed (2) hide show

app/src/index.html +1 -1
app/src/style.css +11 -4

app/src/index.html CHANGED Viewed

@@ -125,12 +125,12 @@
         <div class="task-signal-plot" data-language="Telugu" data-task="tydiqa_tel" data-show-controls="false" data-task-metrics="snr" data-metric="acc_norm_token" data-group-seeds="false" data-title="❌ Bad SNR: tydiqa_tel [te]"></div>
     </div>
     <h4>Non-Random Performance</h4>
     <p>Many model capabilities are acquired later in training, thus <b>many tasks</b> (especially harder ones, such as math-related ones) <b>show baseline-level performance for an extended period</b>. While these tasks are useful, they're not ideal for early pre-training evaluation, and <b>we did not want to keep them</b> for this setting.</p>
     <p>We first computed the baseline random performance of the task (as the sum of 1/n_choices for all samples for multiple choice questions, and as zero for generative evaluations). Then we calculated the task's distance from the baseline as the maximum score across all models minus the baseline.</p>
-    <aside>Assuming model performance is normally distributed across different seeds, we want the benchmark-run performance to be at least 3 final-stds above the benchmark random baseline. This would mean that 99.85% of seed scores are above the random baseline (formally, benchmark-run performance - benchmark random baseline > 3 * final-std).</aside>
     <div style="display: flex; grid-column: middle">
         <div class="task-signal-plot" data-language="Chinese" data-task="agieval_zho_cf:_average" data-show-controls="false" data-task-metrics="randomness" data-metric="acc_norm_pmi" data-group-seeds="true" data-title="✅ Non-random: agieval_zho_cf/acc_pmi [zh]"></div>

         <div class="task-signal-plot" data-language="Telugu" data-task="tydiqa_tel" data-show-controls="false" data-task-metrics="snr" data-metric="acc_norm_token" data-group-seeds="false" data-title="❌ Bad SNR: tydiqa_tel [te]"></div>
     </div>
+    <aside>Assuming model performance is normally distributed across different seeds, we want the benchmark-run performance to be at least 3 final-stds above the benchmark random baseline. This would mean that 99.85% of seed scores are above the random baseline (formally, benchmark-run performance - benchmark random baseline > 3 * final-std).</aside>
     <h4>Non-Random Performance</h4>
     <p>Many model capabilities are acquired later in training, thus <b>many tasks</b> (especially harder ones, such as math-related ones) <b>show baseline-level performance for an extended period</b>. While these tasks are useful, they're not ideal for early pre-training evaluation, and <b>we did not want to keep them</b> for this setting.</p>
     <p>We first computed the baseline random performance of the task (as the sum of 1/n_choices for all samples for multiple choice questions, and as zero for generative evaluations). Then we calculated the task's distance from the baseline as the maximum score across all models minus the baseline.</p>
     <div style="display: flex; grid-column: middle">
         <div class="task-signal-plot" data-language="Chinese" data-task="agieval_zho_cf:_average" data-show-controls="false" data-task-metrics="randomness" data-metric="acc_norm_pmi" data-group-seeds="true" data-title="✅ Non-random: agieval_zho_cf/acc_pmi [zh]"></div>

app/src/style.css CHANGED Viewed

@@ -121,10 +121,17 @@ d-contents nav > div > a {
 }
 d-article aside {
-    height: 0px;
-    overflow: visible;
-    margin-bottom: 1em;
-    z-index: 1000;
 }
 @media (min-width: 768px) {

 }
 d-article aside {
+    display: none;
+}
+@media (min-width: 768px) {
+    d-article aside {
+        display: block;
+        height: 0px;
+        overflow: visible;
+        margin-bottom: 1em;
+        z-index: 1000;
+    }
 }
 @media (min-width: 768px) {