JosephusCheung's picture
Update index.html
08818c1 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<style>
body {
background-color: #111;
font-family: Arial, sans-serif;
color: #fff;
display: flex;
justify-content: center;
align-items: center;
flex-direction: column;
height: 100vh;
margin: 0;
}
h1 {
font-size: 36px;
margin-bottom: 10px;
}
h2 {
font-size: 18px;
font-weight: normal;
margin-bottom: 30px;
color: #ccc;
}
table {
width: 90%;
border-collapse: separate;
border-spacing: 0;
background-color: #1b1b1b;
border-radius: 12px;
overflow: hidden;
margin: 20px 0;
table-layout: fixed;
}
th, td {
text-align: center;
padding: 12px;
border: 1px solid #333;
vertical-align: middle;
}
th {
background-color: #222;
font-weight: bold;
font-size: 14px;
}
td {
background-color: #1b1b1b;
font-size: 14px;
word-wrap: break-word;
}
.highlight-column {
border-left: 3px solid #0066ff;
border-right: 3px solid #0066ff;
}
.highlight-header {
border-top: 3px solid #0066ff;
border-top-left-radius: 12px;
border-top-right-radius: 12px;
}
.highlight-footer {
border-bottom: 3px solid #0066ff;
border-bottom-left-radius: 12px;
border-bottom-right-radius: 12px;
}
.bold {
font-weight: 900; /* Extra bold */
}
tr:first-child th:first-child {
border-top-left-radius: 12px;
}
tr:first-child th:last-child {
border-top-right-radius: 12px;
}
tr:last-child td:first-child {
border-bottom-left-radius: 12px;
}
tr:last-child td:last-child {
border-bottom-right-radius: 12px;
}
.footnote {
font-size: 12px;
color: #888;
text-align: left;
max-width: 90%;
margin-top: 20px;
}
</style>
</head>
<body>
<h1>田忌赛马</h1>
<h2>Goodhart's Law on Benchmarks</h2>
<table>
<tr>
<th>Capability</th>
<th>Description</th>
<th class="highlight-column highlight-header">miniG</th>
<th>Gemini-Flash</th>
<th>GLM-4-9B-Chat</th>
<th>Llama 3.1 8B Instruct</th>
</tr>
<tr>
<td class="bold">MMLU</td>
<td>Representation of questions in 57 subjects<br>(incl. STEM, humanities, and others)</td>
<td class="highlight-column bold">85.45</td>
<td>78.9</td>
<td>72.4</td>
<td>69.4</td>
</tr>
<tr>
<td class="bold">IFEval</td>
<td>Evaluation of instruction-following<br>using verifiable prompts</td>
<td class="highlight-column">74.22</td>
<td>-</td>
<td>69</td>
<td class="bold">80.4</td>
</tr>
<tr>
<td class="bold">GSM8K</td>
<td>Challenging math problems<br>(5-shot evaluation)</td>
<td class="highlight-column">75.89 (5-shot)</td>
<td class="bold">86.2 (11-shot)</td>
<td>79.6</td>
<td>84.5 (8-shot CoT)</td>
</tr>
<tr>
<td class="bold">HumanEval</td>
<td>Python code generation on a held-out dataset<br>(0-shot)</td>
<td class="highlight-column bold">79.88</td>
<td>74.3</td>
<td>71.8</td>
<td>72.6</td>
</tr>
<tr>
<td class="bold">GPQA</td>
<td>Challenging dataset of questions<br>from biology, physics, and chemistry</td>
<td class="highlight-column">37.37</td>
<td class="bold">39.5</td>
<td>34.3 (base)</td>
<td>34.2</td>
</tr>
<tr>
<td class="bold">Context Window</td>
<td>Maximum context length<br>the model can handle</td>
<td class="highlight-column bold">1M</td>
<td class="bold">1M</td>
<td>128K</td>
<td>128K</td>
</tr>
<tr>
<td class="bold">Input</td>
<td>Supported input modalities</td>
<td class="highlight-column highlight-footer">Text, image<br>(single model)</td>
<td>Text, image, audio, video</td>
<td>Text only</td>
<td>Text only</td>
</tr>
</table>
<div class="footnote">
1. miniG is a 14B parameter model derived from the 9B parameter glm-4-9b-chat-1m model weights. It continues pre-training on a selected corpus of 20B tokens while retaining long-context capabilities. The model is fine-tuned on a dataset of 120M+ conversation entries, synthesized through cross-page clustering similar to RAG on this selected corpus. Additionally, miniG underwent multimodal training in two stages for single image input, with the second stage reinitializing 5B parameters of a Vision Transformer from glm-4v-9b for Locked-Image Tuning.<br>
2. miniG outputs are formatted similarly to Gemini 1.5 Flash but were not trained on data generated by the Gemini models.
</div>
</body>
</html>