Guerra LLM Ranking

The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.

MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.

Vectara's Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.

Best models for solving math problems:

gpt-4-0125-preview (turbo)
gpt-4-1106-preview (turbo)
gpt-4-0613
gpt-4-0314
Gemini Ultra 1.0
Gemini Pro 1.5
Claude 3 Opus

Best models for large text:

gpt-4-0125-preview (turbo)
gpt-4-1106-preview (turbo)
Gemini Ultra
Gemini Pro 1.5
Claude 3 Opus
Claude 3 Sonnet
Claude 3 Haiku
Claude 2-2.1
Claude Instant 1-1.2

Models with the best cost benefit:

Gemini Pro 1.0
Gemini Pro 1.5
gpt-3.5-turbo-0613
gpt-3.5-turbo-1106
Claude 3 Haiku
Claude Instant 1-1.2
Mixtral 8x7B Instruct

Models with fewer hallucinations:

gpt-4-0125-preview (turbo)
gpt-4-1106-preview (turbo)
gpt-4-0613
gpt-4-0314
Gemini Ultra 1.0
Gemini Pro 1.5
Claude 2.1
Intel Neural Chat 7B

Models with a high level of hallucinations:

Microsoft Phi 2
Mistral 7B
Google Palm 2
Mixtral 8x7B Instruct
Yi 34B

Open Models:

Mixtral 8x7B Instruct
Yi 34B

Can be trained in online service:

gpt-3.5-turbo-1106
gpt-3.5-turbo-0613
gpt-4-0613

Can be trained locally:

Mixtral 8x7B Instruct
Yi 34B

Has widely available api service:

gpt-4-0125-preview (turbo) - OpenAI
gpt-4-1106-preview (turbo) - OpenAI
gpt-4-0613 - OpenAI
gpt-4-0314 - OpenAI
gpt-3.5-turbo-1106 - OpenAI
gpt-4-0314 - OpenAI
Gemini Pro 1.0 - Openrouter with compatibility with OpenAI api, Google api service.
Claude 3 - Openrouter with compatibility with OpenAI api, Anthropic api service.
Claude 2-2.1 - Openrouter with compatibility with OpenAI api, Anthropic api service.
Claude Instant 1-1.2 - Openrouter with compatibility with OpenAI api, Anthropic api service.
Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.
Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.
Yi 34B - Deepinfra with compatibility with OpenAI api.

Models with the same level of GPT-4:

Gemini Ultra
Gemini Pro 1.5
Gemini Pro (Bard/Online)
Claude 3 Opus

Models with the same level or better than GPT-3.5 but lower than GPT-4:

Gemini Pro
Claude 3 Sonnet
Claude 3 Haiku
Claude 2-2.1
Claude 1
Claude Instant 1-1.2
Mistral Medium

Versions of models already surpassed by fine-tune, new versions or new architectures:

gpt-4-0314
Claude 2-2.1
Claude Instant 1-1.2
Falcon 180B
Llama 1 and Llama 2
Guanaco 65B
Palm 2 Chat Bison
Dolly V2
Alpaca
CodeLlama-34b-Instruct-hf
Mistral-7B-v0.1
MythoMax-L2
Zephyr 7B Alpha and Beta
Airoboros 70b
OpenChat-3.5-1210
StableLM Tuned Alpha
Stable Beluga 2