Qwen2.5-Math-7B Abstract-CoT (compressed reasoning via GRPO)

Reimplementation of Abstract-CoT (Ramji, Naseem, Astudillo, "Thinking Without Words", arXiv:2604.22709) at academic-scale compute on a single H100. Built on Qwen2.5-Math-7B-Instruct.

This is the GRPO best-by-eval checkpoint (step 750 / 800).

Headline numbers (MATH-500)

Setup	Accuracy	Mean tokens	n
Qwen2.5-Math-7B verbal CoT (unconstrained, maj@8 T=0.6)	87.2%	625	500
Qwen2.5-Math-7B verbal CoT (truncated@240, maj@8 T=0.6)	19.0%	235	100
This model (maj@8 T=0.6, max_answer_len=256)	51.0%	237	100
This model (maj@8 T=0.6, max_answer_len=320)	57.0%	271	100

Compressed-CoT model maintains 65% of full-budget verbal accuracy at 43% of the tokens. Substantially above the budget-matched truncated verbal baseline (+32-38pp).

Honest caveats

The abstract-token mechanism reported in the paper does not emerge at this compute budget. Direct ablation on this checkpoint (n=100, maj@8) yields:

Condition	Accuracy	Δ vs baseline
Normal z̃	51%	—
Random z̃ (replace each position with a random V_abs token)	54%	+3pp
Zero z̃ (skip the abstract block entirely)	52%	+1pp

The model produces a near-constant 9-token abstract prefix regardless of input; all of the load-bearing reasoning happens in the answer phase as verbal CoT, just compressed under a length budget. This is a budget-constrained verbal-CoT model with a vestigial abstract prefix, not a latent reasoning model in the mechanistic sense the paper claims.

Training recipe

Substrate: Qwen2.5-Math-7B-Instruct
Warmup: 1 policy-iteration round (Phase A + Phase B), 3000 examples from Dolci-Think-SFT-7B, 1 epoch each, full fine-tuning, on 1× H100. ~70 minutes. Post-warmup MATH-500 accuracy: 15.6% (n=32 probe).
GRPO: 800 steps, beta=0.03, lr=5e-6, length_penalty_coef=0.1, m_max=64 abstract tokens, min_z_length=8, max_answer_len=256, group_size=4, per_prompt_batch=2, grad_accum=4, optimizer_8bit (paged AdamW). RL data: NuminaMath-CoT problems. Reward: MathVerifierReward (boxed-answer match) + length penalty.

Total compute: ~6 hours on a single H100 (Lambda GH200 80GB). Approximately 0.5% of the compute used in the original paper for warmup; ~5% for RL.

Architecture & inference

The model expects the two-phase decode used in Abstract-CoT:

Constrained decoding over V_abs = {TOKEN_A, ..., TOKEN_BL} for up to m_max=64 tokens, terminating with <endabstract>.
Free decoding for the answer phase, with V_abs tokens forbidden, terminating at EOS or max_answer_len.

See LauraGomezjurado/latent-reasoning-interp for full inference / eval code.

Intended use & limitations

Engineering use case: chain-of-thought token-efficient math reasoning at a manageable accuracy/budget Pareto point on academic compute.
Not a working latent reasoning model — the abstract tokens are essentially decorative. If you are studying real latent reasoning, this checkpoint is useful as a negative-result reference.
Performance on out-of-distribution math problems is unverified beyond MATH-500.

Citation

If you use this model, please cite the original Abstract-CoT paper:

@misc{ramji2026thinkingwithoutwords,
  title={Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought},
  author={Keshav Ramji and Tahira Naseem and Ramón Fernandez Astudillo},
  year={2026},
  eprint={2604.22709},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}