Qwen2.5-Math-7B Abstract-CoT (compressed reasoning via GRPO)
Reimplementation of Abstract-CoT (Ramji, Naseem, Astudillo, "Thinking Without Words", arXiv:2604.22709) at academic-scale compute on a single H100. Built on Qwen2.5-Math-7B-Instruct.
This is the GRPO best-by-eval checkpoint (step 750 / 800).
Headline numbers (MATH-500)
| Setup | Accuracy | Mean tokens | n |
|---|---|---|---|
| Qwen2.5-Math-7B verbal CoT (unconstrained, maj@8 T=0.6) | 87.2% | 625 | 500 |
| Qwen2.5-Math-7B verbal CoT (truncated@240, maj@8 T=0.6) | 19.0% | 235 | 100 |
| This model (maj@8 T=0.6, max_answer_len=256) | 51.0% | 237 | 100 |
| This model (maj@8 T=0.6, max_answer_len=320) | 57.0% | 271 | 100 |
Compressed-CoT model maintains 65% of full-budget verbal accuracy at 43% of the tokens. Substantially above the budget-matched truncated verbal baseline (+32-38pp).
Honest caveats
The abstract-token mechanism reported in the paper does not emerge at this compute budget. Direct ablation on this checkpoint (n=100, maj@8) yields:
| Condition | Accuracy | Δ vs baseline |
|---|---|---|
| Normal z̃ | 51% | — |
| Random z̃ (replace each position with a random V_abs token) | 54% | +3pp |
| Zero z̃ (skip the abstract block entirely) | 52% | +1pp |
The model produces a near-constant 9-token abstract prefix regardless of input; all of the load-bearing reasoning happens in the answer phase as verbal CoT, just compressed under a length budget. This is a budget-constrained verbal-CoT model with a vestigial abstract prefix, not a latent reasoning model in the mechanistic sense the paper claims.
Training recipe
- Substrate: Qwen2.5-Math-7B-Instruct
- Warmup: 1 policy-iteration round (Phase A + Phase B), 3000 examples from Dolci-Think-SFT-7B, 1 epoch each, full fine-tuning, on 1× H100. ~70 minutes. Post-warmup MATH-500 accuracy: 15.6% (n=32 probe).
- GRPO: 800 steps, beta=0.03, lr=5e-6, length_penalty_coef=0.1, m_max=64 abstract tokens, min_z_length=8, max_answer_len=256, group_size=4, per_prompt_batch=2, grad_accum=4, optimizer_8bit (paged AdamW). RL data: NuminaMath-CoT problems. Reward: MathVerifierReward (boxed-answer match) + length penalty.
Total compute: ~6 hours on a single H100 (Lambda GH200 80GB). Approximately 0.5% of the compute used in the original paper for warmup; ~5% for RL.
Architecture & inference
The model expects the two-phase decode used in Abstract-CoT:
- Constrained decoding over V_abs = {TOKEN_A, ..., TOKEN_BL} for up to
m_max=64 tokens, terminating with
<endabstract>. - Free decoding for the answer phase, with V_abs tokens forbidden, terminating at EOS or max_answer_len.
See LauraGomezjurado/latent-reasoning-interp for full inference / eval code.
Intended use & limitations
- Engineering use case: chain-of-thought token-efficient math reasoning at a manageable accuracy/budget Pareto point on academic compute.
- Not a working latent reasoning model — the abstract tokens are essentially decorative. If you are studying real latent reasoning, this checkpoint is useful as a negative-result reference.
- Performance on out-of-distribution math problems is unverified beyond MATH-500.
Citation
If you use this model, please cite the original Abstract-CoT paper:
@misc{ramji2026thinkingwithoutwords,
title={Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought},
author={Keshav Ramji and Tahira Naseem and Ramón Fernandez Astudillo},
year={2026},
eprint={2604.22709},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Downloads last month
- 16