ArXiv | Models | Data | Code | Blog | Sample Explorer

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck

Llemma 7B is a language model for mathematics. It was initialized with Code Llama 7B weights, and trained on the Proof-Pile-2 for 200B tokens.

This model also comes in a 34B parameter version: Llemma 34B.

Evaluations

Llemma models are particularly strong at chain-of-thought mathematical reasoning and using computational tools for mathematics, such as Python and formal theorem provers.

Chain-of-thought Math

On chain-of-thought mathematics tasks, Llemma models outperform Llama-2, Code Llama, and when controlled for model size, outperform Minerva.

Model Size GSM8k OCW MMLU-STEM SAT MATH
Llama 2 7B 11.8% 3.7% 29.9% 25% 3.2%
Code Llama 7B 10.5% 4.4% 25.1% 9.4% 4.5%
LLEMMA 7B 36.4% 7.7% 37.7% 53.1% 18.0%
Minerva 8B 16.2% 7.7% 35.6% - 14.1%
------------ ------ -------- ------- ----------- ------- -------
Code Llama 34B 29.6% 7.0% 40.5% 40.6% 12.2%
LLEMMA 34B 51.5% 11.8% 49.0% 71.9% 25.0%
------------ ------ -------- ------- ----------- ------- -------
Minerva 62B 52.4% 12.0% 53.9% - 27.6%
Minerva 540B 58.8% 17.6% 63.9% - 33.6%

Further performance can be extracted by using majority voting:

Model Size GSM8k maj@100 OCW maj@100 MMLU-STEM maj@16 SAT maj@16 MATH maj@256
LLEMMA 7B 54.0% 14.3% 49.9% 78.1% 33.5
Minerva 8B 28.4% 12.5% 43.4% - 25.4%
--------- ------ ------------- ----------- ----------------- ----------- ------------
LLEMMA 34B 69.3% 18.4% 59.7% 81.3% 43.1%
--------- ------ ------------- ----------- ----------------- ----------- ------------
Minerva 62B 68.5% 23.5% 63.5% - 43.4%
Minerva 540B 78.5% 30.8% 75.0% - 50.3%

Tool Use and Theorem Proving

In addition to chain-of-thought reasoning, Llemma has strong capabilities in computational mathematics tasks. For tool use and formal theorem proving evaluations, see our paper.

Citation

@misc{azerbayev2023llemma,
      title={Llemma: An Open Language Model For Mathematics}, 
      author={Zhangir Azerbayev and Hailey Schoelkopf and Keiran Paster and Marco Dos Santos and Stephen McAleer and Albert Q. Jiang and Jia Deng and Stella Biderman and Sean Welleck},
      year={2023},
      eprint={2310.10631},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
21
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train piercemaloney/llemma_7b