luisrguerra
commited on
Commit
•
c302561
1
Parent(s):
acfc3b2
Update index.html
Browse files- index.html +36 -11
index.html
CHANGED
@@ -32,6 +32,7 @@
|
|
32 |
<p>The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.</p>
|
33 |
<p>MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.</p>
|
34 |
<p>GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.</p>
|
|
|
35 |
<div id="tableBenchMark"></div>
|
36 |
<h4>Best models for solving math problems:</h4>
|
37 |
<ul>
|
@@ -39,18 +40,31 @@
|
|
39 |
<li>gpt-4-1106-preview (turbo)</li>
|
40 |
<li>gpt-4-0613</li>
|
41 |
<li>gpt-4-0314</li>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
<li>Gemini Ultra</li>
|
43 |
<li>Gemini Pro 1.5</li>
|
|
|
|
|
|
|
|
|
|
|
44 |
</ul>
|
45 |
<h4>Models with the best cost benefit:</h4>
|
46 |
<ul>
|
47 |
-
<li>Gemini Pro</li>
|
48 |
<li>Gemini Pro 1.5</li>
|
49 |
<li>gpt-3.5-turbo-0613</li>
|
50 |
<li>gpt-3.5-turbo-1106</li>
|
51 |
-
<li>Claude
|
|
|
52 |
<li>Mixtral 8x7B Instruct</li>
|
53 |
-
<li>Mistral Medium</li>
|
54 |
</ul>
|
55 |
<h4>Models with fewer hallucinations:</h4>
|
56 |
<ul>
|
@@ -58,12 +72,16 @@
|
|
58 |
<li>gpt-4-1106-preview (turbo)</li>
|
59 |
<li>gpt-4-0613</li>
|
60 |
<li>gpt-4-0314</li>
|
61 |
-
<li>Gemini Ultra</li>
|
62 |
<li>Gemini Pro 1.5</li>
|
63 |
<li>Claude 2.1</li>
|
|
|
64 |
</ul>
|
65 |
<h4>Models with a high level of hallucinations:</h4>
|
66 |
<ul>
|
|
|
|
|
|
|
67 |
<li>Mixtral 8x7B Instruct</li>
|
68 |
<li>Yi 34B</li>
|
69 |
</ul>
|
@@ -91,8 +109,10 @@
|
|
91 |
<li>gpt-4-0314 - OpenAI</li>
|
92 |
<li>gpt-3.5-turbo-1106 - OpenAI</li>
|
93 |
<li>gpt-4-0314 - OpenAI</li>
|
94 |
-
<li>Gemini Pro - Openrouter with compatibility with OpenAI api, Google service
|
95 |
-
<li>Claude
|
|
|
|
|
96 |
<li>Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.</li>
|
97 |
<li>Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.</li>
|
98 |
<li>Yi 34B - Deepinfra with compatibility with OpenAI api.</li>
|
@@ -102,18 +122,23 @@
|
|
102 |
<li>Gemini Ultra</li>
|
103 |
<li>Gemini Pro 1.5</li>
|
104 |
<li>Gemini Pro (Bard/Online)</li>
|
|
|
105 |
</ul>
|
106 |
<h4>Models with the same level or better than GPT-3.5 but lower than GPT-4:</h4>
|
107 |
<ul>
|
108 |
<li>Gemini Pro</li>
|
109 |
-
<li>Claude
|
110 |
-
<li>Claude
|
111 |
-
<li>Claude 1
|
112 |
-
<li>Claude
|
|
|
113 |
<li>Mistral Medium</li>
|
114 |
</ul>
|
115 |
-
<h4>Versions of models already surpassed by fine-tune or new architectures:</h4>
|
116 |
<ul>
|
|
|
|
|
|
|
117 |
<li>Falcon 180B</li>
|
118 |
<li>Llama 1 and Llama 2</li>
|
119 |
<li>Guanaco 65B</li>
|
|
|
32 |
<p>The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.</p>
|
33 |
<p>MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.</p>
|
34 |
<p>GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.</p>
|
35 |
+
<p>Vectara's Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.</p>
|
36 |
<div id="tableBenchMark"></div>
|
37 |
<h4>Best models for solving math problems:</h4>
|
38 |
<ul>
|
|
|
40 |
<li>gpt-4-1106-preview (turbo)</li>
|
41 |
<li>gpt-4-0613</li>
|
42 |
<li>gpt-4-0314</li>
|
43 |
+
<li>Gemini Ultra 1.0</li>
|
44 |
+
<li>Gemini Pro 1.5</li>
|
45 |
+
<li>Claude 3 Opus</li>
|
46 |
+
</ul>
|
47 |
+
<h4>Best models for large text:</h4>
|
48 |
+
<ul>
|
49 |
+
<li>gpt-4-0125-preview (turbo)</li>
|
50 |
+
<li>gpt-4-1106-preview (turbo)</li>
|
51 |
<li>Gemini Ultra</li>
|
52 |
<li>Gemini Pro 1.5</li>
|
53 |
+
<li>Claude 3 Opus</li>
|
54 |
+
<li>Claude 3 Sonnet</li>
|
55 |
+
<li>Claude 3 Haiku</li>
|
56 |
+
<li>Claude 2-2.1</li>
|
57 |
+
<li>Claude Instant 1-1.2</li>
|
58 |
</ul>
|
59 |
<h4>Models with the best cost benefit:</h4>
|
60 |
<ul>
|
61 |
+
<li>Gemini Pro 1.0</li>
|
62 |
<li>Gemini Pro 1.5</li>
|
63 |
<li>gpt-3.5-turbo-0613</li>
|
64 |
<li>gpt-3.5-turbo-1106</li>
|
65 |
+
<li>Claude 3 Haiku</li>
|
66 |
+
<li>Claude Instant 1-1.2</li>
|
67 |
<li>Mixtral 8x7B Instruct</li>
|
|
|
68 |
</ul>
|
69 |
<h4>Models with fewer hallucinations:</h4>
|
70 |
<ul>
|
|
|
72 |
<li>gpt-4-1106-preview (turbo)</li>
|
73 |
<li>gpt-4-0613</li>
|
74 |
<li>gpt-4-0314</li>
|
75 |
+
<li>Gemini Ultra 1.0</li>
|
76 |
<li>Gemini Pro 1.5</li>
|
77 |
<li>Claude 2.1</li>
|
78 |
+
<li>Intel Neural Chat 7B</li>
|
79 |
</ul>
|
80 |
<h4>Models with a high level of hallucinations:</h4>
|
81 |
<ul>
|
82 |
+
<li>Microsoft Phi 2</li>
|
83 |
+
<li>Mistral 7B</li>
|
84 |
+
<li>Google Palm 2</li>
|
85 |
<li>Mixtral 8x7B Instruct</li>
|
86 |
<li>Yi 34B</li>
|
87 |
</ul>
|
|
|
109 |
<li>gpt-4-0314 - OpenAI</li>
|
110 |
<li>gpt-3.5-turbo-1106 - OpenAI</li>
|
111 |
<li>gpt-4-0314 - OpenAI</li>
|
112 |
+
<li>Gemini Pro 1.0 - Openrouter with compatibility with OpenAI api, Google api service.</li>
|
113 |
+
<li>Claude 3 - Openrouter with compatibility with OpenAI api, Anthropic api service.</li>
|
114 |
+
<li>Claude 2-2.1 - Openrouter with compatibility with OpenAI api, Anthropic api service.</li>
|
115 |
+
<li>Claude Instant 1-1.2 - Openrouter with compatibility with OpenAI api, Anthropic api service.</li>
|
116 |
<li>Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.</li>
|
117 |
<li>Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.</li>
|
118 |
<li>Yi 34B - Deepinfra with compatibility with OpenAI api.</li>
|
|
|
122 |
<li>Gemini Ultra</li>
|
123 |
<li>Gemini Pro 1.5</li>
|
124 |
<li>Gemini Pro (Bard/Online)</li>
|
125 |
+
<li>Claude 3 Opus</li>
|
126 |
</ul>
|
127 |
<h4>Models with the same level or better than GPT-3.5 but lower than GPT-4:</h4>
|
128 |
<ul>
|
129 |
<li>Gemini Pro</li>
|
130 |
+
<li>Claude 3 Sonnet</li>
|
131 |
+
<li>Claude 3 Haiku</li>
|
132 |
+
<li>Claude 2-2.1</li>
|
133 |
+
<li>Claude 1</li>
|
134 |
+
<li>Claude Instant 1-1.2</li>
|
135 |
<li>Mistral Medium</li>
|
136 |
</ul>
|
137 |
+
<h4>Versions of models already surpassed by fine-tune, new versions or new architectures:</h4>
|
138 |
<ul>
|
139 |
+
<li>gpt-4-0314</li>
|
140 |
+
<li>Claude 2-2.1</li>
|
141 |
+
<li>Claude Instant 1-1.2</li>
|
142 |
<li>Falcon 180B</li>
|
143 |
<li>Llama 1 and Llama 2</li>
|
144 |
<li>Guanaco 65B</li>
|