JosephusCheung
commited on
Commit
•
08818c1
1
Parent(s):
1357db5
Update index.html
Browse files- index.html +186 -19
index.html
CHANGED
@@ -1,19 +1,186 @@
|
|
1 |
-
<!
|
2 |
-
<html>
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html lang="en">
|
3 |
+
<head>
|
4 |
+
<meta charset="UTF-8">
|
5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
6 |
+
<style>
|
7 |
+
body {
|
8 |
+
background-color: #111;
|
9 |
+
font-family: Arial, sans-serif;
|
10 |
+
color: #fff;
|
11 |
+
display: flex;
|
12 |
+
justify-content: center;
|
13 |
+
align-items: center;
|
14 |
+
flex-direction: column;
|
15 |
+
height: 100vh;
|
16 |
+
margin: 0;
|
17 |
+
}
|
18 |
+
|
19 |
+
h1 {
|
20 |
+
font-size: 36px;
|
21 |
+
margin-bottom: 10px;
|
22 |
+
}
|
23 |
+
|
24 |
+
h2 {
|
25 |
+
font-size: 18px;
|
26 |
+
font-weight: normal;
|
27 |
+
margin-bottom: 30px;
|
28 |
+
color: #ccc;
|
29 |
+
}
|
30 |
+
|
31 |
+
table {
|
32 |
+
width: 90%;
|
33 |
+
border-collapse: separate;
|
34 |
+
border-spacing: 0;
|
35 |
+
background-color: #1b1b1b;
|
36 |
+
border-radius: 12px;
|
37 |
+
overflow: hidden;
|
38 |
+
margin: 20px 0;
|
39 |
+
table-layout: fixed;
|
40 |
+
}
|
41 |
+
|
42 |
+
th, td {
|
43 |
+
text-align: center;
|
44 |
+
padding: 12px;
|
45 |
+
border: 1px solid #333;
|
46 |
+
vertical-align: middle;
|
47 |
+
}
|
48 |
+
|
49 |
+
th {
|
50 |
+
background-color: #222;
|
51 |
+
font-weight: bold;
|
52 |
+
font-size: 14px;
|
53 |
+
}
|
54 |
+
|
55 |
+
td {
|
56 |
+
background-color: #1b1b1b;
|
57 |
+
font-size: 14px;
|
58 |
+
word-wrap: break-word;
|
59 |
+
}
|
60 |
+
|
61 |
+
.highlight-column {
|
62 |
+
border-left: 3px solid #0066ff;
|
63 |
+
border-right: 3px solid #0066ff;
|
64 |
+
}
|
65 |
+
|
66 |
+
.highlight-header {
|
67 |
+
border-top: 3px solid #0066ff;
|
68 |
+
border-top-left-radius: 12px;
|
69 |
+
border-top-right-radius: 12px;
|
70 |
+
}
|
71 |
+
|
72 |
+
.highlight-footer {
|
73 |
+
border-bottom: 3px solid #0066ff;
|
74 |
+
border-bottom-left-radius: 12px;
|
75 |
+
border-bottom-right-radius: 12px;
|
76 |
+
}
|
77 |
+
|
78 |
+
.bold {
|
79 |
+
font-weight: 900; /* Extra bold */
|
80 |
+
}
|
81 |
+
|
82 |
+
tr:first-child th:first-child {
|
83 |
+
border-top-left-radius: 12px;
|
84 |
+
}
|
85 |
+
|
86 |
+
tr:first-child th:last-child {
|
87 |
+
border-top-right-radius: 12px;
|
88 |
+
}
|
89 |
+
|
90 |
+
tr:last-child td:first-child {
|
91 |
+
border-bottom-left-radius: 12px;
|
92 |
+
}
|
93 |
+
|
94 |
+
tr:last-child td:last-child {
|
95 |
+
border-bottom-right-radius: 12px;
|
96 |
+
}
|
97 |
+
|
98 |
+
.footnote {
|
99 |
+
font-size: 12px;
|
100 |
+
color: #888;
|
101 |
+
text-align: left;
|
102 |
+
max-width: 90%;
|
103 |
+
margin-top: 20px;
|
104 |
+
}
|
105 |
+
|
106 |
+
</style>
|
107 |
+
</head>
|
108 |
+
<body>
|
109 |
+
|
110 |
+
<h1>田忌赛马</h1>
|
111 |
+
<h2>Goodhart's Law on Benchmarks</h2>
|
112 |
+
|
113 |
+
<table>
|
114 |
+
<tr>
|
115 |
+
<th>Capability</th>
|
116 |
+
<th>Description</th>
|
117 |
+
<th class="highlight-column highlight-header">miniG</th>
|
118 |
+
<th>Gemini-Flash</th>
|
119 |
+
<th>GLM-4-9B-Chat</th>
|
120 |
+
<th>Llama 3.1 8B Instruct</th>
|
121 |
+
</tr>
|
122 |
+
<tr>
|
123 |
+
<td class="bold">MMLU</td>
|
124 |
+
<td>Representation of questions in 57 subjects<br>(incl. STEM, humanities, and others)</td>
|
125 |
+
<td class="highlight-column bold">85.45</td>
|
126 |
+
<td>78.9</td>
|
127 |
+
<td>72.4</td>
|
128 |
+
<td>69.4</td>
|
129 |
+
</tr>
|
130 |
+
<tr>
|
131 |
+
<td class="bold">IFEval</td>
|
132 |
+
<td>Evaluation of instruction-following<br>using verifiable prompts</td>
|
133 |
+
<td class="highlight-column">74.22</td>
|
134 |
+
<td>-</td>
|
135 |
+
<td>69</td>
|
136 |
+
<td class="bold">80.4</td>
|
137 |
+
</tr>
|
138 |
+
<tr>
|
139 |
+
<td class="bold">GSM8K</td>
|
140 |
+
<td>Challenging math problems<br>(5-shot evaluation)</td>
|
141 |
+
<td class="highlight-column">75.89 (5-shot)</td>
|
142 |
+
<td class="bold">86.2 (11-shot)</td>
|
143 |
+
<td>79.6</td>
|
144 |
+
<td>84.5 (8-shot CoT)</td>
|
145 |
+
</tr>
|
146 |
+
<tr>
|
147 |
+
<td class="bold">HumanEval</td>
|
148 |
+
<td>Python code generation on a held-out dataset<br>(0-shot)</td>
|
149 |
+
<td class="highlight-column bold">79.88</td>
|
150 |
+
<td>74.3</td>
|
151 |
+
<td>71.8</td>
|
152 |
+
<td>72.6</td>
|
153 |
+
</tr>
|
154 |
+
<tr>
|
155 |
+
<td class="bold">GPQA</td>
|
156 |
+
<td>Challenging dataset of questions<br>from biology, physics, and chemistry</td>
|
157 |
+
<td class="highlight-column">37.37</td>
|
158 |
+
<td class="bold">39.5</td>
|
159 |
+
<td>34.3 (base)</td>
|
160 |
+
<td>34.2</td>
|
161 |
+
</tr>
|
162 |
+
<tr>
|
163 |
+
<td class="bold">Context Window</td>
|
164 |
+
<td>Maximum context length<br>the model can handle</td>
|
165 |
+
<td class="highlight-column bold">1M</td>
|
166 |
+
<td class="bold">1M</td>
|
167 |
+
<td>128K</td>
|
168 |
+
<td>128K</td>
|
169 |
+
</tr>
|
170 |
+
<tr>
|
171 |
+
<td class="bold">Input</td>
|
172 |
+
<td>Supported input modalities</td>
|
173 |
+
<td class="highlight-column highlight-footer">Text, image<br>(single model)</td>
|
174 |
+
<td>Text, image, audio, video</td>
|
175 |
+
<td>Text only</td>
|
176 |
+
<td>Text only</td>
|
177 |
+
</tr>
|
178 |
+
</table>
|
179 |
+
|
180 |
+
<div class="footnote">
|
181 |
+
1. miniG is a 14B parameter model derived from the 9B parameter glm-4-9b-chat-1m model weights. It continues pre-training on a selected corpus of 20B tokens while retaining long-context capabilities. The model is fine-tuned on a dataset of 120M+ conversation entries, synthesized through cross-page clustering similar to RAG on this selected corpus. Additionally, miniG underwent multimodal training in two stages for single image input, with the second stage reinitializing 5B parameters of a Vision Transformer from glm-4v-9b for Locked-Image Tuning.<br>
|
182 |
+
2. miniG outputs are formatted similarly to Gemini 1.5 Flash but were not trained on data generated by the Gemini models.
|
183 |
+
</div>
|
184 |
+
|
185 |
+
</body>
|
186 |
+
</html>
|