Guerra LLM Ranking

The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.

MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.

Best models for solving math problems:

gpt-4-0125-preview (turbo)
gpt-4-1106-preview (turbo)
gpt-4-0613
gpt-4-0314
Gemini Ultra
Gemini Pro 1.5

Models with the best cost benefit:

Gemini Pro
Gemini Pro 1.5
gpt-3.5-turbo-0613
gpt-3.5-turbo-1106
Claude Instant 1
Mixtral 8x7B Instruct
Mistral Medium

Models with fewer hallucinations:

gpt-4-0125-preview (turbo)
gpt-4-1106-preview (turbo)
gpt-4-0613
gpt-4-0314
Gemini Ultra
Gemini Pro 1.5
Claude 2.1

Models with a high level of hallucinations:

Mixtral 8x7B Instruct
Yi 34B

Open Models:

Mixtral 8x7B Instruct
Yi 34B

Can be trained in online service:

gpt-3.5-turbo-1106
gpt-3.5-turbo-0613
gpt-4-0613

Can be trained locally:

Mixtral 8x7B Instruct
Yi 34B

Has widely available api service:

gpt-4-0125-preview (turbo) - OpenAI
gpt-4-1106-preview (turbo) - OpenAI
gpt-4-0613 - OpenAI
gpt-4-0314 - OpenAI
gpt-3.5-turbo-1106 - OpenAI
gpt-4-0314 - OpenAI
Gemini Pro - Openrouter with compatibility with OpenAI api, Google service has a waiting list.
Claude Instant 1 - Openrouter with compatibility with OpenAI api, Anthropic service has a waiting list.
Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.
Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.
Yi 34B - Deepinfra with compatibility with OpenAI api.

Models with the same level of GPT-4:

Gemini Ultra
Gemini Pro 1.5
Gemini Pro (Bard/Online)

Models with the same level or better than GPT-3.5 but lower than GPT-4:

Gemini Pro
Claude 2.1
Claude 2.0
Claude 1.0
Claude Instant 1
Mistral Medium

Versions of models already surpassed by fine-tune or new architectures:

Falcon 180B
Llama 1 and Llama 2
Guanaco 65B
Palm 2 Chat Bison
Dolly V2
Alpaca
CodeLlama-34b-Instruct-hf
Mistral-7B-v0.1
MythoMax-L2
Zephyr 7B Alpha and Beta
Airoboros 70b
OpenChat-3.5-1210
StableLM Tuned Alpha
Stable Beluga 2