The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.
MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.
Vectara's Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.
Best models for solving math problems:
- gpt-4o-2024-05-13
- gpt-4-Turbo-2024-04-09
- gpt-4-0125-preview (turbo)
- gpt-4-1106-preview (turbo)
- gpt-4-0613
- gpt-4-0314
- Gemini Ultra 1.0
- Gemini Pro 1.5
- Gemini Advanced
- Claude 3 Opus
- Claude 3 Sonnet
Best models for large text:
- gpt-4o-2024-05-13
- gpt-4-Turbo-2024-04-09
- gpt-4-0125-preview (turbo)
- gpt-4-1106-preview (turbo)
- Gemini Ultra 1.0
- Gemini Pro 1.5
- Gemini Advanced
- Claude 3 Opus
- Claude 3 Sonnet
- Claude 3 Haiku
- Claude 2-2.1
- Claude Instant 1-1.2
Models with the best cost benefit:
- gpt-4o-2024-05-13
- Gemini Pro 1.5
- gpt-3.5-turbo-0125
- gpt-3.5-turbo-0613
- Claude 3 Haiku
- Meta Llama 3 70B Instruct
Models with fewer hallucinations:
- gpt-4o-2024-05-13
- gpt-4-0125-preview (turbo)
- gpt-4-1106-preview (turbo)
- gpt-4-0613
- gpt-4-0314
- Gemini Ultra 1.0
- Gemini Pro 1.5
- Claude 2.1
- Snowflake Arctic Instruct
- Intel Neural Chat 7B
Models with a high level of hallucinations:
- Gemma 1-1.1 7B
- DBRX Instruct
- Microsoft Phi 2
- Mistral 7B
- Google Palm 2
- Mixtral 8x7B Instruct
- Yi 34B
Open Models:
- Mixtral 8x7B Instruct
- Mistral 7B
- Phi-3
- Yi 34B
- Grok 1
- DBRX Instruct
- Llama 3 8-70B
- Gemma 2-7B
Can be trained in online service:
- gpt-3.5-turbo-1106
- gpt-3.5-turbo-0613
- gpt-4-0613
Can be trained locally:
- Llama 3 8-70B
- Mixtral 8x7B Instruct
- Yi 34B
Has widely available api service:
- gpt-4-0125-preview (turbo) - OpenAI
- gpt-4-1106-preview (turbo) - OpenAI
- gpt-4-0613 - OpenAI
- gpt-4-0314 - OpenAI
- gpt-3.5-turbo-1106 - OpenAI
- gpt-4-0314 - OpenAI
- Gemini Pro 1.0-1.5 - Openrouter with compatibility with OpenAI api, Google api service.
- Claude 3 - Openrouter with compatibility with OpenAI api, Anthropic api service.
- Claude 2-2.1 - Openrouter with compatibility with OpenAI api, Anthropic api service.
- Claude Instant 1-1.2 - Openrouter with compatibility with OpenAI api, Anthropic api service.
- Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.
- Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.
- Yi 34B - Deepinfra with compatibility with OpenAI api.
Models with the same level of GPT-4 Turbo:
Models with the same level of GPT-4 but lower than GPT-4 Turbo:
- Gemini Ultra 1.0
- Gemini Pro 1.5
- Gemini Advanced
- Gemini Pro (Bard/Online)
- Claude 3 Sonnet
Models with the same level or better than GPT-3.5 but lower than GPT-4:
- Claude 3 Haiku
- Claude 2-2.1
- Claude 1
- Claude Instant 1-1.2
- Phi-3 Medium
- Llama 3 70B Instruct
- Gemini-1.5-Flash-API-0514
- Command R+
Versions of models already surpassed by fine-tune, new versions or new architectures:
- gpt-4-0613
- gpt-4-0314
- Gemini Pro 1.0
- Grok 1
- Phi-2
- DBRX Instruct
- Mistral Medium
- Gemma 1.0 7B
- Zephyr-ORPO-141b-A35b-v0.1
- Yi 1.0 34B
- gpt-4-0613
- gpt-4-0314
- Claude 2-2.1
- Claude Instant 1-1.2
- Qwen 1.0
- Falcon 180B
- Llama 1 and Llama 2
- Guanaco 65B
- Palm 2 Chat Bison
- Dolly V2
- Alpaca
- CodeLlama-34b-Instruct-hf
- SOLAR-10.7B-Instruct-v1.0
- Mistral-7B-v0.2
- Mistral-7B-v0.1
- MythoMax-L2
- Zephyr 7B Alpha and Beta
- Airoboros 70b
- OpenChat-3.5-1210
- StableLM Tuned Alpha
- Stable Beluga 2
Best OpenAI Models:
- gpt-4o-2024-05-13
- gpt-4-Turbo-2024-04-09
- gpt-4-0125-preview (turbo)
- gpt-4-1106-preview (turbo)
- gpt-3.5-turbo-0613
- gpt-3.5-turbo-0125
API services:
- Openrouter
- OpenAI
- Google Cloud
- Anthropic
- Azure
- Deepinfra
- Together
- OctoAI
- Lepton
- Fireworks
- Perplexity
- Groq
- Mistral
- NovitaAI
- Cohere
- DeepSeek