llemma_7b / README.md
zhangir-azerbayev
readme
74cd0a5
|
raw
history blame
3.4 kB
metadata
license: llama2
datasets:
  - EleutherAI/proof-pile-2
language:
  - en
tags:
  - math
  - reasoning

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck

Github | ArXiv

Llemma 7B is a language model for mathematics. It was initialized with Code Llama 7B weights, and trained on the Proof-Pile-2 for 200B tokens.

This model also comes in a 34B parameter version: Llemma 34B.

Evaluations

Llemma models are particularly strong at chain-of-thought mathematical reasoning and using computational tools for mathematics, such as Python and formal theorem provers.

Chain-of-thought Math

On chain-of-thought mathematics tasks, Llemma models outperform Llama-2, Code Llama, and when controlled for model size, outperform Minerva.

Model Size GSM8k OCW MMLU-STEM SAT MATH
Llama 2 7B 11.8% 3.7% 29.9% 25% 3.2%
Code Llama 7B 10.5% 4.4% 25.1% 9.4% 4.4%
LLEMMA 7B 36.4% 7.7% 37.7% 53.1% 17.2%
Minerva 8B 16.2% 7.7% 35.6% - 14.1%
------------ ------ -------- ------- ----------- ------- -------
Code Llama 34B 29.6% 7.0% 40.5% 40.6% 11.9%
LLEMMA 34B 51.5% 11.8% 49.0% 71.9% 24.1%
------------ ------ -------- ------- ----------- ------- -------
Minerva 62B 52.4% 12.0% 53.9% - 27.6%
Minerva 540B 58.8% 17.6% 63.9% - 33.6%

Further performance can be extracted by using majority voting:

Model Size GSM8k maj@100 OCW maj@100 MMLU-STEM maj@16 SAT maj@16 MATH maj@256
LLEMMA 7B 54.0% 14.3% 49.9% 78.1% 32.0%
Minerva 8B 28.4% 12.5% 43.4% - 25.4%
--------- ------ ------------- ----------- ----------------- ----------- ------------
LLEMMA 34B 69.3% 18.4% 59.7% 81.3% 41.0%
--------- ------ ------------- ----------- ----------------- ----------- ------------
Minerva 62B 68.5% 23.5% 63.5% - 43.4%
Minerva 540B 78.5% 30.8% 75.0% - 50.3%

Tool Use and Theorem Proving

In addition to chain-of-thought reasoning, Llemma has strong capabilities in computational mathematics tasks. For tool use and formal theorem proving evaluations, see our paper.