The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.
MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.
Vectara's Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document.
Best models for solving math problems:
- gpt-4-0125-preview (turbo)
- gpt-4-1106-preview (turbo)
- gpt-4-0613
- gpt-4-0314
- Gemini Ultra 1.0
- Gemini Pro 1.5
- Claude 3 Opus
Best models for large text:
- gpt-4-0125-preview (turbo)
- gpt-4-1106-preview (turbo)
- Gemini Ultra
- Gemini Pro 1.5
- Claude 3 Opus
- Claude 3 Sonnet
- Claude 3 Haiku
- Claude 2-2.1
- Claude Instant 1-1.2
Models with the best cost benefit:
- Gemini Pro 1.0
- Gemini Pro 1.5
- gpt-3.5-turbo-0613
- gpt-3.5-turbo-1106
- Claude 3 Haiku
- Claude Instant 1-1.2
- Mixtral 8x7B Instruct
Models with fewer hallucinations:
- gpt-4-0125-preview (turbo)
- gpt-4-1106-preview (turbo)
- gpt-4-0613
- gpt-4-0314
- Gemini Ultra 1.0
- Gemini Pro 1.5
- Claude 2.1
- Intel Neural Chat 7B
Models with a high level of hallucinations:
- Microsoft Phi 2
- Mistral 7B
- Google Palm 2
- Mixtral 8x7B Instruct
- Yi 34B
Open Models:
- Mixtral 8x7B Instruct
- Yi 34B
Can be trained in online service:
- gpt-3.5-turbo-1106
- gpt-3.5-turbo-0613
- gpt-4-0613
Can be trained locally:
- Mixtral 8x7B Instruct
- Yi 34B
Has widely available api service:
- gpt-4-0125-preview (turbo) - OpenAI
- gpt-4-1106-preview (turbo) - OpenAI
- gpt-4-0613 - OpenAI
- gpt-4-0314 - OpenAI
- gpt-3.5-turbo-1106 - OpenAI
- gpt-4-0314 - OpenAI
- Gemini Pro 1.0 - Openrouter with compatibility with OpenAI api, Google api service.
- Claude 3 - Openrouter with compatibility with OpenAI api, Anthropic api service.
- Claude 2-2.1 - Openrouter with compatibility with OpenAI api, Anthropic api service.
- Claude Instant 1-1.2 - Openrouter with compatibility with OpenAI api, Anthropic api service.
- Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.
- Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.
- Yi 34B - Deepinfra with compatibility with OpenAI api.
Models with the same level of GPT-4:
- Gemini Ultra
- Gemini Pro 1.5
- Gemini Pro (Bard/Online)
- Claude 3 Opus
Models with the same level or better than GPT-3.5 but lower than GPT-4:
- Gemini Pro
- Claude 3 Sonnet
- Claude 3 Haiku
- Claude 2-2.1
- Claude 1
- Claude Instant 1-1.2
- Mistral Medium
Versions of models already surpassed by fine-tune, new versions or new architectures:
- gpt-4-0314
- Claude 2-2.1
- Claude Instant 1-1.2
- Falcon 180B
- Llama 1 and Llama 2
- Guanaco 65B
- Palm 2 Chat Bison
- Dolly V2
- Alpaca
- CodeLlama-34b-Instruct-hf
- Mistral-7B-v0.1
- MythoMax-L2
- Zephyr 7B Alpha and Beta
- Airoboros 70b
- OpenChat-3.5-1210
- StableLM Tuned Alpha
- Stable Beluga 2