arxiv:2510.01265

RLP: Reinforcement as a Pretraining Objective

Published on Sep 26

· Submitted by

Ali on Oct 3

NVIDIA

Upvote

Authors:

Ali Hatamizadeh ,

Syeda Nahida Akter ,

Abstract

RLP, an information-driven reinforcement pretraining objective, enhances reasoning models by integrating exploration into pretraining, leading to significant performance improvements across various benchmarks.

AI-generated summary

The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

View arXiv page View PDF Project page GitHub 182 Add to collection

Community

ahatamiz

Paper author Paper submitter 13 days ago

Let’s discuss: Why Reinforcement Pretraining (RLP) for reasoning

Motivation.
Today’s LLMs learn almost everything with next‑token prediction, then only after pretraining try to teach themselves to reason via SFT and RLHF/RLVR. That split means the base model never practices “thinking before predicting” while it’s learning from raw text. We asked: what if exploration happened during pretraining itself, and we rewarded thoughts that genuinely help predict the next token—without any task‑specific verifiers?

What we built.
RLP is a pretraining objective that treats a short chain‑of‑thought (CoT) as an action taken before predicting each token.

For each position, the model samples a brief thought, then scores the true next token twice:
1. with the thought, and 2) with a “no‑think” baseline (a slowly updated EMA teacher).
The reward is the increase in the model’s log‑likelihood of the observed next token with the thought compared to without it.
We update only the thought tokens using a clipped, group‑relative advantage objective; we do not backprop through the reward scores.
The signal is dense (every position gets a reward), verifier‑free (works on ordinary text), and scales to full documents (no entropy filtering or curated checkers).

Why earlier approaches (e.g., RPT) fell short.
Prior “reinforcement pretraining” with prefix‑matching rewards (RPT) usually relies on:

Sparse, binary rewards tied to next‑token correctness that ignore the content of the thought,
Auxiliary entropy filters to pick a subset of tokens to train on, and
Experiments on distilled checkpoints, leaving open whether it helps base models.
RLP instead delivers a continuous, per‑token improvement signal, trains on all tokens in full documents, needs no side model or heuristics, and is explicitly designed to shape base‑model thinking early.

Key results.

Qwen3‑1.7B, base pretraining:
- Overall average across our math‑and‑science suite improves by about 19% vs. the base model (36.03 vs. 30.32) and about 17% vs. continuous pretraining on the same tokens (36.03 vs. 30.85).
- Under compute matching, we gave the CPT baseline 35× more tokens to equalize FLOPs; RLP still leads on the same setup by +5.32 points overall on the NC corpus (43.36 vs. 38.04).
- Versus RPT with matched data/compute, RLP improves Overall Avg by +1.66 points (about +4% relative) and also leads on Math and Science aggregates.
After identical post‑training (SFT + RLVR) for all models:
- Gains compound: RLP+Post 42.51 vs. Base+Post 39.34 (about +8% relative) and vs. CPT+Post 39.90 (about +6.5% relative).
- The biggest lifts are on reasoning‑heavy tasks such as AIME25 and MMLU‑Pro.
Scaling to a 12B hybrid (NeMo‑12B):
- Applying only 250M RLP tokens to an intermediate checkpoint boosts the overall average from 42.81 to 61.32 (+18.51 points; about +43% relative).
- Science Avg rises from 34.51 to 57.26 (about +23 points), showing strong cross‑domain transfer, not just math‑specific gains.
Domain breadth and data practicality:
- RLP works on SFT‑style reasoning corpora and general pretraining sources (academic papers, textbooks, web QA).
- Improvements persist even when the baseline sees vastly more tokens to match FLOPs, indicating the benefits come from the objective rather than from a larger data budget.

In short, RLP rewards “useful thoughts” during pretraining. The signal is simple, dense, verifier‑free, and compatible with standard pipelines—yielding stronger base models whose gains survive and compound after alignment.

Paper: https://arxiv.org/pdf/2510.01265
Code: https://github.com/NVlabs/RLP