guerra-llm-ai-leaderboard

Running

App Files Files Community

luisrguerra commited on Mar 22

Commit

c302561

•

1 Parent(s): acfc3b2

Update index.html

Browse files

Files changed (1) hide show

index.html +36 -11

index.html CHANGED Viewed

@@ -32,6 +32,7 @@
     <p>The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.</p>
     <p>MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.</p>
     <p>GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.</p>
     <div id="tableBenchMark"></div>
     <h4>Best models for solving math problems:</h4>
     <ul>
@@ -39,18 +40,31 @@
       <li>gpt-4-1106-preview (turbo)</li>
       <li>gpt-4-0613</li>
       <li>gpt-4-0314</li>
       <li>Gemini Ultra</li>
       <li>Gemini Pro 1.5</li>
     </ul>
     <h4>Models with the best cost benefit:</h4>
     <ul>
-      <li>Gemini Pro</li>
       <li>Gemini Pro 1.5</li>
       <li>gpt-3.5-turbo-0613</li>
       <li>gpt-3.5-turbo-1106</li>
-      <li>Claude Instant 1</li>
       <li>Mixtral 8x7B Instruct</li>
-      <li>Mistral Medium</li>
     </ul>
     <h4>Models with fewer hallucinations:</h4>
     <ul>
@@ -58,12 +72,16 @@
       <li>gpt-4-1106-preview (turbo)</li>
       <li>gpt-4-0613</li>
       <li>gpt-4-0314</li>
-      <li>Gemini Ultra</li>
       <li>Gemini Pro 1.5</li>
       <li>Claude 2.1</li>
     </ul>
     <h4>Models with a high level of hallucinations:</h4>
     <ul>
       <li>Mixtral 8x7B Instruct</li>
       <li>Yi 34B</li>
     </ul>
@@ -91,8 +109,10 @@
       <li>gpt-4-0314 - OpenAI</li>
       <li>gpt-3.5-turbo-1106 - OpenAI</li>
       <li>gpt-4-0314 - OpenAI</li>
-      <li>Gemini Pro - Openrouter with compatibility with OpenAI api, Google service has a waiting list.</li>
-      <li>Claude Instant 1 - Openrouter with compatibility with OpenAI api, Anthropic service has a waiting list.</li>
       <li>Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.</li>
       <li>Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.</li>
       <li>Yi 34B - Deepinfra with compatibility with OpenAI api.</li>
@@ -102,18 +122,23 @@
       <li>Gemini Ultra</li>
       <li>Gemini Pro 1.5</li>
       <li>Gemini Pro (Bard/Online)</li>
     </ul>
     <h4>Models with the same level or better than GPT-3.5 but lower than GPT-4:</h4>
     <ul>
       <li>Gemini Pro</li>
-      <li>Claude 2.1</li>
-      <li>Claude 2.0</li>
-      <li>Claude 1.0</li>
-      <li>Claude Instant 1</li>
       <li>Mistral Medium</li>
     </ul>
-    <h4>Versions of models already surpassed by fine-tune or new architectures:</h4>
     <ul>
       <li>Falcon 180B</li>
       <li>Llama 1 and Llama 2</li>
       <li>Guanaco 65B</li>

     <p>The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.</p>
     <p>MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.</p>
     <p>GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.</p>
+    <p>Vectara's Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.</p>
     <div id="tableBenchMark"></div>
     <h4>Best models for solving math problems:</h4>
     <ul>
       <li>gpt-4-1106-preview (turbo)</li>
       <li>gpt-4-0613</li>
       <li>gpt-4-0314</li>
+      <li>Gemini Ultra 1.0</li>
+      <li>Gemini Pro 1.5</li>
+      <li>Claude 3 Opus</li>
+    </ul>
+    <h4>Best models for large text:</h4>
+    <ul>
+      <li>gpt-4-0125-preview (turbo)</li>
+      <li>gpt-4-1106-preview (turbo)</li>
       <li>Gemini Ultra</li>
       <li>Gemini Pro 1.5</li>
+      <li>Claude 3 Opus</li>
+      <li>Claude 3 Sonnet</li>
+      <li>Claude 3 Haiku</li>
+      <li>Claude 2-2.1</li>
+      <li>Claude Instant 1-1.2</li>
     </ul>
     <h4>Models with the best cost benefit:</h4>
     <ul>
+      <li>Gemini Pro 1.0</li>
       <li>Gemini Pro 1.5</li>
       <li>gpt-3.5-turbo-0613</li>
       <li>gpt-3.5-turbo-1106</li>
+      <li>Claude 3 Haiku</li>
+      <li>Claude Instant 1-1.2</li>
       <li>Mixtral 8x7B Instruct</li>
     </ul>
     <h4>Models with fewer hallucinations:</h4>
     <ul>
       <li>gpt-4-1106-preview (turbo)</li>
       <li>gpt-4-0613</li>
       <li>gpt-4-0314</li>
+      <li>Gemini Ultra 1.0</li>
       <li>Gemini Pro 1.5</li>
       <li>Claude 2.1</li>
+      <li>Intel Neural Chat 7B</li>
     </ul>
     <h4>Models with a high level of hallucinations:</h4>
     <ul>
+      <li>Microsoft Phi 2</li>
+      <li>Mistral 7B</li>
+      <li>Google Palm 2</li>
       <li>Mixtral 8x7B Instruct</li>
       <li>Yi 34B</li>
     </ul>
       <li>gpt-4-0314 - OpenAI</li>
       <li>gpt-3.5-turbo-1106 - OpenAI</li>
       <li>gpt-4-0314 - OpenAI</li>
+      <li>Gemini Pro 1.0 - Openrouter with compatibility with OpenAI api, Google api service.</li>
+      <li>Claude 3 - Openrouter with compatibility with OpenAI api, Anthropic api service.</li>
+      <li>Claude 2-2.1 - Openrouter with compatibility with OpenAI api, Anthropic api service.</li>
+      <li>Claude Instant 1-1.2 - Openrouter with compatibility with OpenAI api, Anthropic api service.</li>
       <li>Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.</li>
       <li>Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.</li>
       <li>Yi 34B - Deepinfra with compatibility with OpenAI api.</li>
       <li>Gemini Ultra</li>
       <li>Gemini Pro 1.5</li>
       <li>Gemini Pro (Bard/Online)</li>
+      <li>Claude 3 Opus</li>
     </ul>
     <h4>Models with the same level or better than GPT-3.5 but lower than GPT-4:</h4>
     <ul>
       <li>Gemini Pro</li>
+      <li>Claude 3 Sonnet</li>
+      <li>Claude 3 Haiku</li>
+      <li>Claude 2-2.1</li>
+      <li>Claude 1</li>
+      <li>Claude Instant 1-1.2</li>
       <li>Mistral Medium</li>
     </ul>
+    <h4>Versions of models already surpassed by fine-tune, new versions or new architectures:</h4>
     <ul>
+      <li>gpt-4-0314</li>
+      <li>Claude 2-2.1</li>
+      <li>Claude Instant 1-1.2</li>
       <li>Falcon 180B</li>
       <li>Llama 1 and Llama 2</li>
       <li>Guanaco 65B</li>