luisrguerra commited on
Commit
c302561
1 Parent(s): acfc3b2

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +36 -11
index.html CHANGED
@@ -32,6 +32,7 @@
32
  <p>The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.</p>
33
  <p>MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.</p>
34
  <p>GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.</p>
 
35
  <div id="tableBenchMark"></div>
36
  <h4>Best models for solving math problems:</h4>
37
  <ul>
@@ -39,18 +40,31 @@
39
  <li>gpt-4-1106-preview (turbo)</li>
40
  <li>gpt-4-0613</li>
41
  <li>gpt-4-0314</li>
 
 
 
 
 
 
 
 
42
  <li>Gemini Ultra</li>
43
  <li>Gemini Pro 1.5</li>
 
 
 
 
 
44
  </ul>
45
  <h4>Models with the best cost benefit:</h4>
46
  <ul>
47
- <li>Gemini Pro</li>
48
  <li>Gemini Pro 1.5</li>
49
  <li>gpt-3.5-turbo-0613</li>
50
  <li>gpt-3.5-turbo-1106</li>
51
- <li>Claude Instant 1</li>
 
52
  <li>Mixtral 8x7B Instruct</li>
53
- <li>Mistral Medium</li>
54
  </ul>
55
  <h4>Models with fewer hallucinations:</h4>
56
  <ul>
@@ -58,12 +72,16 @@
58
  <li>gpt-4-1106-preview (turbo)</li>
59
  <li>gpt-4-0613</li>
60
  <li>gpt-4-0314</li>
61
- <li>Gemini Ultra</li>
62
  <li>Gemini Pro 1.5</li>
63
  <li>Claude 2.1</li>
 
64
  </ul>
65
  <h4>Models with a high level of hallucinations:</h4>
66
  <ul>
 
 
 
67
  <li>Mixtral 8x7B Instruct</li>
68
  <li>Yi 34B</li>
69
  </ul>
@@ -91,8 +109,10 @@
91
  <li>gpt-4-0314 - OpenAI</li>
92
  <li>gpt-3.5-turbo-1106 - OpenAI</li>
93
  <li>gpt-4-0314 - OpenAI</li>
94
- <li>Gemini Pro - Openrouter with compatibility with OpenAI api, Google service has a waiting list.</li>
95
- <li>Claude Instant 1 - Openrouter with compatibility with OpenAI api, Anthropic service has a waiting list.</li>
 
 
96
  <li>Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.</li>
97
  <li>Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.</li>
98
  <li>Yi 34B - Deepinfra with compatibility with OpenAI api.</li>
@@ -102,18 +122,23 @@
102
  <li>Gemini Ultra</li>
103
  <li>Gemini Pro 1.5</li>
104
  <li>Gemini Pro (Bard/Online)</li>
 
105
  </ul>
106
  <h4>Models with the same level or better than GPT-3.5 but lower than GPT-4:</h4>
107
  <ul>
108
  <li>Gemini Pro</li>
109
- <li>Claude 2.1</li>
110
- <li>Claude 2.0</li>
111
- <li>Claude 1.0</li>
112
- <li>Claude Instant 1</li>
 
113
  <li>Mistral Medium</li>
114
  </ul>
115
- <h4>Versions of models already surpassed by fine-tune or new architectures:</h4>
116
  <ul>
 
 
 
117
  <li>Falcon 180B</li>
118
  <li>Llama 1 and Llama 2</li>
119
  <li>Guanaco 65B</li>
 
32
  <p>The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.</p>
33
  <p>MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.</p>
34
  <p>GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.</p>
35
+ <p>Vectara's Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.</p>
36
  <div id="tableBenchMark"></div>
37
  <h4>Best models for solving math problems:</h4>
38
  <ul>
 
40
  <li>gpt-4-1106-preview (turbo)</li>
41
  <li>gpt-4-0613</li>
42
  <li>gpt-4-0314</li>
43
+ <li>Gemini Ultra 1.0</li>
44
+ <li>Gemini Pro 1.5</li>
45
+ <li>Claude 3 Opus</li>
46
+ </ul>
47
+ <h4>Best models for large text:</h4>
48
+ <ul>
49
+ <li>gpt-4-0125-preview (turbo)</li>
50
+ <li>gpt-4-1106-preview (turbo)</li>
51
  <li>Gemini Ultra</li>
52
  <li>Gemini Pro 1.5</li>
53
+ <li>Claude 3 Opus</li>
54
+ <li>Claude 3 Sonnet</li>
55
+ <li>Claude 3 Haiku</li>
56
+ <li>Claude 2-2.1</li>
57
+ <li>Claude Instant 1-1.2</li>
58
  </ul>
59
  <h4>Models with the best cost benefit:</h4>
60
  <ul>
61
+ <li>Gemini Pro 1.0</li>
62
  <li>Gemini Pro 1.5</li>
63
  <li>gpt-3.5-turbo-0613</li>
64
  <li>gpt-3.5-turbo-1106</li>
65
+ <li>Claude 3 Haiku</li>
66
+ <li>Claude Instant 1-1.2</li>
67
  <li>Mixtral 8x7B Instruct</li>
 
68
  </ul>
69
  <h4>Models with fewer hallucinations:</h4>
70
  <ul>
 
72
  <li>gpt-4-1106-preview (turbo)</li>
73
  <li>gpt-4-0613</li>
74
  <li>gpt-4-0314</li>
75
+ <li>Gemini Ultra 1.0</li>
76
  <li>Gemini Pro 1.5</li>
77
  <li>Claude 2.1</li>
78
+ <li>Intel Neural Chat 7B</li>
79
  </ul>
80
  <h4>Models with a high level of hallucinations:</h4>
81
  <ul>
82
+ <li>Microsoft Phi 2</li>
83
+ <li>Mistral 7B</li>
84
+ <li>Google Palm 2</li>
85
  <li>Mixtral 8x7B Instruct</li>
86
  <li>Yi 34B</li>
87
  </ul>
 
109
  <li>gpt-4-0314 - OpenAI</li>
110
  <li>gpt-3.5-turbo-1106 - OpenAI</li>
111
  <li>gpt-4-0314 - OpenAI</li>
112
+ <li>Gemini Pro 1.0 - Openrouter with compatibility with OpenAI api, Google api service.</li>
113
+ <li>Claude 3 - Openrouter with compatibility with OpenAI api, Anthropic api service.</li>
114
+ <li>Claude 2-2.1 - Openrouter with compatibility with OpenAI api, Anthropic api service.</li>
115
+ <li>Claude Instant 1-1.2 - Openrouter with compatibility with OpenAI api, Anthropic api service.</li>
116
  <li>Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.</li>
117
  <li>Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.</li>
118
  <li>Yi 34B - Deepinfra with compatibility with OpenAI api.</li>
 
122
  <li>Gemini Ultra</li>
123
  <li>Gemini Pro 1.5</li>
124
  <li>Gemini Pro (Bard/Online)</li>
125
+ <li>Claude 3 Opus</li>
126
  </ul>
127
  <h4>Models with the same level or better than GPT-3.5 but lower than GPT-4:</h4>
128
  <ul>
129
  <li>Gemini Pro</li>
130
+ <li>Claude 3 Sonnet</li>
131
+ <li>Claude 3 Haiku</li>
132
+ <li>Claude 2-2.1</li>
133
+ <li>Claude 1</li>
134
+ <li>Claude Instant 1-1.2</li>
135
  <li>Mistral Medium</li>
136
  </ul>
137
+ <h4>Versions of models already surpassed by fine-tune, new versions or new architectures:</h4>
138
  <ul>
139
+ <li>gpt-4-0314</li>
140
+ <li>Claude 2-2.1</li>
141
+ <li>Claude Instant 1-1.2</li>
142
  <li>Falcon 180B</li>
143
  <li>Llama 1 and Llama 2</li>
144
  <li>Guanaco 65B</li>