|
<!DOCTYPE html> |
|
<html lang="en"> |
|
<head> |
|
<meta charset="UTF-8"> |
|
<meta name="viewport" content="width=device-width, initial-scale=1.0"> |
|
<style> |
|
body { |
|
background-color: #111; |
|
font-family: Arial, sans-serif; |
|
color: #fff; |
|
display: flex; |
|
justify-content: center; |
|
align-items: center; |
|
flex-direction: column; |
|
height: 100vh; |
|
margin: 0; |
|
} |
|
|
|
h1 { |
|
font-size: 36px; |
|
margin-bottom: 10px; |
|
} |
|
|
|
h2 { |
|
font-size: 18px; |
|
font-weight: normal; |
|
margin-bottom: 30px; |
|
color: #ccc; |
|
} |
|
|
|
table { |
|
width: 90%; |
|
border-collapse: separate; |
|
border-spacing: 0; |
|
background-color: #1b1b1b; |
|
border-radius: 12px; |
|
overflow: hidden; |
|
margin: 20px 0; |
|
table-layout: fixed; |
|
} |
|
|
|
th, td { |
|
text-align: center; |
|
padding: 12px; |
|
border: 1px solid #333; |
|
vertical-align: middle; |
|
} |
|
|
|
th { |
|
background-color: #222; |
|
font-weight: bold; |
|
font-size: 14px; |
|
} |
|
|
|
td { |
|
background-color: #1b1b1b; |
|
font-size: 14px; |
|
word-wrap: break-word; |
|
} |
|
|
|
.highlight-column { |
|
border-left: 3px solid #0066ff; |
|
border-right: 3px solid #0066ff; |
|
} |
|
|
|
.highlight-header { |
|
border-top: 3px solid #0066ff; |
|
border-top-left-radius: 12px; |
|
border-top-right-radius: 12px; |
|
} |
|
|
|
.highlight-footer { |
|
border-bottom: 3px solid #0066ff; |
|
border-bottom-left-radius: 12px; |
|
border-bottom-right-radius: 12px; |
|
} |
|
|
|
.bold { |
|
font-weight: 900; |
|
} |
|
|
|
tr:first-child th:first-child { |
|
border-top-left-radius: 12px; |
|
} |
|
|
|
tr:first-child th:last-child { |
|
border-top-right-radius: 12px; |
|
} |
|
|
|
tr:last-child td:first-child { |
|
border-bottom-left-radius: 12px; |
|
} |
|
|
|
tr:last-child td:last-child { |
|
border-bottom-right-radius: 12px; |
|
} |
|
|
|
.footnote { |
|
font-size: 12px; |
|
color: #888; |
|
text-align: left; |
|
max-width: 90%; |
|
margin-top: 20px; |
|
} |
|
|
|
</style> |
|
</head> |
|
<body> |
|
|
|
<h1>田忌赛马</h1> |
|
<h2>Goodhart's Law on Benchmarks</h2> |
|
|
|
<table> |
|
<tr> |
|
<th>Capability</th> |
|
<th>Description</th> |
|
<th class="highlight-column highlight-header">miniG</th> |
|
<th>Gemini-Flash</th> |
|
<th>GLM-4-9B-Chat</th> |
|
<th>Llama 3.1 8B Instruct</th> |
|
</tr> |
|
<tr> |
|
<td class="bold">MMLU</td> |
|
<td>Representation of questions in 57 subjects<br>(incl. STEM, humanities, and others)</td> |
|
<td class="highlight-column bold">85.45</td> |
|
<td>78.9</td> |
|
<td>72.4</td> |
|
<td>69.4</td> |
|
</tr> |
|
<tr> |
|
<td class="bold">IFEval</td> |
|
<td>Evaluation of instruction-following<br>using verifiable prompts</td> |
|
<td class="highlight-column">74.22</td> |
|
<td>-</td> |
|
<td>69</td> |
|
<td class="bold">80.4</td> |
|
</tr> |
|
<tr> |
|
<td class="bold">GSM8K</td> |
|
<td>Challenging math problems<br>(5-shot evaluation)</td> |
|
<td class="highlight-column">75.89 (5-shot)</td> |
|
<td class="bold">86.2 (11-shot)</td> |
|
<td>79.6</td> |
|
<td>84.5 (8-shot CoT)</td> |
|
</tr> |
|
<tr> |
|
<td class="bold">HumanEval</td> |
|
<td>Python code generation on a held-out dataset<br>(0-shot)</td> |
|
<td class="highlight-column bold">79.88</td> |
|
<td>74.3</td> |
|
<td>71.8</td> |
|
<td>72.6</td> |
|
</tr> |
|
<tr> |
|
<td class="bold">GPQA</td> |
|
<td>Challenging dataset of questions<br>from biology, physics, and chemistry</td> |
|
<td class="highlight-column">37.37</td> |
|
<td class="bold">39.5</td> |
|
<td>34.3 (base)</td> |
|
<td>34.2</td> |
|
</tr> |
|
<tr> |
|
<td class="bold">Context Window</td> |
|
<td>Maximum context length<br>the model can handle</td> |
|
<td class="highlight-column bold">1M</td> |
|
<td class="bold">1M</td> |
|
<td>128K</td> |
|
<td>128K</td> |
|
</tr> |
|
<tr> |
|
<td class="bold">Input</td> |
|
<td>Supported input modalities</td> |
|
<td class="highlight-column highlight-footer">Text, image<br>(single model)</td> |
|
<td>Text, image, audio, video</td> |
|
<td>Text only</td> |
|
<td>Text only</td> |
|
</tr> |
|
</table> |
|
|
|
<div class="footnote"> |
|
1. miniG is a 14B parameter model derived from the 9B parameter glm-4-9b-chat-1m model weights. It continues pre-training on a selected corpus of 20B tokens while retaining long-context capabilities. The model is fine-tuned on a dataset of 120M+ conversation entries, synthesized through cross-page clustering similar to RAG on this selected corpus. Additionally, miniG underwent multimodal training in two stages for single image input, with the second stage reinitializing 5B parameters of a Vision Transformer from glm-4v-9b for Locked-Image Tuning.<br> |
|
2. miniG outputs are formatted similarly to Gemini 1.5 Flash but were not trained on data generated by the Gemini models. |
|
</div> |
|
|
|
</body> |
|
</html> |