The MMLU (Massive Multitask Language Understanding) test is a benchmark that measures language understanding and performance on 57 tasks.

MT-Bench: Benchmark test with questions prepared by the Chatbot Arena team. Uses GPT-4 to evaluate responses.

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems created by human problem writers. A bright middle school student should be able to solve every problem.

Best models for solving math problems:

Models with the best cost benefit:

Models with fewer hallucinations:

Models with a high level of hallucinations:

Open Models:

Can be trained in online service:

Can be trained locally:

Has widely available api service:

Models with the same level of GPT-4:

Models with the same level or better than GPT-3.5 but lower than GPT-4:

Versions of models already surpassed by fine-tune or new architectures: