The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.
MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.
Best models for solving math problems:
- gpt-4-0125-preview (turbo)
- gpt-4-1106-preview (turbo)
- gpt-4-0613
- gpt-4-0314
- Gemini Ultra
- Gemini Pro 1.5
Models with the best cost benefit:
- Gemini Pro
- Gemini Pro 1.5
- gpt-3.5-turbo-0613
- gpt-3.5-turbo-1106
- Claude Instant 1
- Mixtral 8x7B Instruct
- Mistral Medium
Models with fewer hallucinations:
- gpt-4-0125-preview (turbo)
- gpt-4-1106-preview (turbo)
- gpt-4-0613
- gpt-4-0314
- Gemini Ultra
- Gemini Pro 1.5
- Claude 2.1
Models with a high level of hallucinations:
- Mixtral 8x7B Instruct
- Yi 34B
Open Models:
- Mixtral 8x7B Instruct
- Yi 34B
Can be trained in online service:
- gpt-3.5-turbo-1106
- gpt-3.5-turbo-0613
- gpt-4-0613
Can be trained locally:
- Mixtral 8x7B Instruct
- Yi 34B
Has widely available api service:
- gpt-4-0125-preview (turbo) - OpenAI
- gpt-4-1106-preview (turbo) - OpenAI
- gpt-4-0613 - OpenAI
- gpt-4-0314 - OpenAI
- gpt-3.5-turbo-1106 - OpenAI
- gpt-4-0314 - OpenAI
- Gemini Pro - Openrouter with compatibility with OpenAI api, Google service has a waiting list.
- Claude Instant 1 - Openrouter with compatibility with OpenAI api, Anthropic service has a waiting list.
- Mistral Medium - Openrouter with compatibility with OpenAI api, Mistral service has a waiting list.
- Mixtral 8x7B Instruct - Deepinfra with compatibility with OpenAI api.
- Yi 34B - Deepinfra with compatibility with OpenAI api.
Models with the same level of GPT-4:
- Gemini Ultra
- Gemini Pro 1.5
- Gemini Pro (Bard/Online)
Models with the same level or better than GPT-3.5 but lower than GPT-4:
- Gemini Pro
- Claude 2.1
- Claude 2.0
- Claude 1.0
- Claude Instant 1
- Mistral Medium
Versions of models already surpassed by fine-tune or new architectures:
- Falcon 180B
- Llama 1 and Llama 2
- Guanaco 65B
- Palm 2 Chat Bison
- Dolly V2
- Alpaca
- CodeLlama-34b-Instruct-hf
- Mistral-7B-v0.1
- MythoMax-L2
- Zephyr 7B Alpha and Beta
- Airoboros 70b
- OpenChat-3.5-1210
- StableLM Tuned Alpha
- Stable Beluga 2