Spaces:

JosephusCheung
/

Goodharts-Law-on-Benchmarks-a-Page-for-miniG

Running

App Files Files Community

JosephusCheung commited on Aug 25

Commit

08818c1

•

1 Parent(s): 1357db5

Update index.html

Browse files

Files changed (1) hide show

index.html +186 -19

index.html CHANGED Viewed

@@ -1,19 +1,186 @@
-<!doctype html>
-<html>
-	<head>
-		<meta charset="utf-8" />
-		<meta name="viewport" content="width=device-width" />
-		<title>My static Space</title>
-		<link rel="stylesheet" href="style.css" />
-	</head>
-	<body>
-		<div class="card">
-			<h1>Welcome to your static Space!</h1>
-			<p>You can modify this app directly by editing <i>index.html</i> in the Files and versions tab.</p>
-			<p>
-				Also don't forget to check the
-				<a href="https://huggingface.co/docs/hub/spaces" target="_blank">Spaces documentation</a>.
-			</p>
-		</div>
-	</body>
-</html>

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <style>
+        body {
+            background-color: #111;
+            font-family: Arial, sans-serif;
+            color: #fff;
+            display: flex;
+            justify-content: center;
+            align-items: center;
+            flex-direction: column;
+            height: 100vh;
+            margin: 0;
+        }
+        h1 {
+            font-size: 36px;
+            margin-bottom: 10px;
+        }
+        h2 {
+            font-size: 18px;
+            font-weight: normal;
+            margin-bottom: 30px;
+            color: #ccc;
+        }
+        table {
+            width: 90%;
+            border-collapse: separate;
+            border-spacing: 0;
+            background-color: #1b1b1b;
+            border-radius: 12px;
+            overflow: hidden;
+            margin: 20px 0;
+            table-layout: fixed;
+        }
+        th, td {
+            text-align: center;
+            padding: 12px;
+            border: 1px solid #333;
+            vertical-align: middle;
+        }
+        th {
+            background-color: #222;
+            font-weight: bold;
+            font-size: 14px;
+        }
+        td {
+            background-color: #1b1b1b;
+            font-size: 14px;
+            word-wrap: break-word;
+        }
+        .highlight-column {
+            border-left: 3px solid #0066ff;
+            border-right: 3px solid #0066ff;
+        }
+        .highlight-header {
+            border-top: 3px solid #0066ff;
+            border-top-left-radius: 12px;
+            border-top-right-radius: 12px;
+        }
+        .highlight-footer {
+            border-bottom: 3px solid #0066ff;
+            border-bottom-left-radius: 12px;
+            border-bottom-right-radius: 12px;
+        }
+        .bold {
+            font-weight: 900; /* Extra bold */
+        }
+        tr:first-child th:first-child {
+            border-top-left-radius: 12px;
+        }
+        tr:first-child th:last-child {
+            border-top-right-radius: 12px;
+        }
+        tr:last-child td:first-child {
+            border-bottom-left-radius: 12px;
+        }
+        tr:last-child td:last-child {
+            border-bottom-right-radius: 12px;
+        }
+        .footnote {
+            font-size: 12px;
+            color: #888;
+            text-align: left;
+            max-width: 90%;
+            margin-top: 20px;
+        }
+    </style>
+</head>
+<body>
+<h1>田忌赛马</h1>
+<h2>Goodhart's Law on Benchmarks</h2>
+<table>
+    <tr>
+        <th>Capability</th>
+        <th>Description</th>
+        <th class="highlight-column highlight-header">miniG</th>
+        <th>Gemini-Flash</th>
+        <th>GLM-4-9B-Chat</th>
+        <th>Llama 3.1 8B Instruct</th>
+    </tr>
+    <tr>
+        <td class="bold">MMLU</td>
+        <td>Representation of questions in 57 subjects<br>(incl. STEM, humanities, and others)</td>
+        <td class="highlight-column bold">85.45</td>
+        <td>78.9</td>
+        <td>72.4</td>
+        <td>69.4</td>
+    </tr>
+    <tr>
+        <td class="bold">IFEval</td>
+        <td>Evaluation of instruction-following<br>using verifiable prompts</td>
+        <td class="highlight-column">74.22</td>
+        <td>-</td>
+        <td>69</td>
+        <td class="bold">80.4</td>
+    </tr>
+    <tr>
+        <td class="bold">GSM8K</td>
+        <td>Challenging math problems<br>(5-shot evaluation)</td>
+        <td class="highlight-column">75.89 (5-shot)</td>
+        <td class="bold">86.2 (11-shot)</td>
+        <td>79.6</td>
+        <td>84.5 (8-shot CoT)</td>
+    </tr>
+    <tr>
+        <td class="bold">HumanEval</td>
+        <td>Python code generation on a held-out dataset<br>(0-shot)</td>
+        <td class="highlight-column bold">79.88</td>
+        <td>74.3</td>
+        <td>71.8</td>
+        <td>72.6</td>
+    </tr>
+    <tr>
+        <td class="bold">GPQA</td>
+        <td>Challenging dataset of questions<br>from biology, physics, and chemistry</td>
+        <td class="highlight-column">37.37</td>
+        <td class="bold">39.5</td>
+        <td>34.3 (base)</td>
+        <td>34.2</td>
+    </tr>
+    <tr>
+        <td class="bold">Context Window</td>
+        <td>Maximum context length<br>the model can handle</td>
+        <td class="highlight-column bold">1M</td>
+        <td class="bold">1M</td>
+        <td>128K</td>
+        <td>128K</td>
+    </tr>
+    <tr>
+        <td class="bold">Input</td>
+        <td>Supported input modalities</td>
+        <td class="highlight-column highlight-footer">Text, image<br>(single model)</td>
+        <td>Text, image, audio, video</td>
+        <td>Text only</td>
+        <td>Text only</td>
+    </tr>
+</table>
+<div class="footnote">
+    1. miniG is a 14B parameter model derived from the 9B parameter glm-4-9b-chat-1m model weights. It continues pre-training on a selected corpus of 20B tokens while retaining long-context capabilities. The model is fine-tuned on a dataset of 120M+ conversation entries, synthesized through cross-page clustering similar to RAG on this selected corpus. Additionally, miniG underwent multimodal training in two stages for single image input, with the second stage reinitializing 5B parameters of a Vision Transformer from glm-4v-9b for Locked-Image Tuning.<br>
+    2. miniG outputs are formatted similarly to Gemini 1.5 Flash but were not trained on data generated by the Gemini models.
+</div>
+</body>
+</html>