arxiv:2509.20186

Thinking Augmented Pre-training

Published on Sep 24

· Submitted by

Liang Wang on Sep 26

Upvote

Authors:

Liang Wang ,

Shaohan Huang ,

Li Dong ,

Furu Wei

Abstract

Thinking augmented pre-training improves data efficiency and performance of large language models by augmenting text with automatically generated thinking trajectories.

AI-generated summary

This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories. The compute for pre-training LLMs has been growing at an unprecedented rate, while the availability of high-quality data remains limited. Consequently, maximizing the utility of available data constitutes a significant research challenge. A primary impediment is that certain high-quality tokens are difficult to learn given a fixed model capacity, as the underlying rationale for a single token can be exceptionally complex and deep. To address this issue, we propose Thinking augmented Pre-Training (TPT), a universal methodology that augments text with automatically generated thinking trajectories. Such augmentation effectively increases the volume of the training data and makes high-quality tokens more learnable through step-by-step reasoning and decomposition. We apply TPT across diverse training configurations up to 100B tokens, encompassing pre-training with both constrained and abundant data, as well as mid-training from strong open-source checkpoints. Experimental results indicate that our method substantially improves the performance of LLMs across various model sizes and families. Notably, TPT enhances the data efficiency of LLM pre-training by a factor of 3. For a 3B parameter model, it improves the post-training performance by over 10% on several challenging reasoning benchmarks.

View arXiv page View PDF Add to collection

Community

intfloat

Paper author Paper submitter 19 days ago

•

edited 19 days ago

We introduce a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.