Qwen2.5-Math-7B Abstract-CoT (compressed reasoning via GRPO)

Reimplementation of Abstract-CoT (Ramji, Naseem, Astudillo, "Thinking Without Words", arXiv:2604.22709) at academic-scale compute on a single H100. Built on Qwen2.5-Math-7B-Instruct.

This is the GRPO best-by-eval checkpoint (step 750 / 800).

Headline numbers (MATH-500)

Setup Accuracy Mean tokens n
Qwen2.5-Math-7B verbal CoT (unconstrained, maj@8 T=0.6) 87.2% 625 500
Qwen2.5-Math-7B verbal CoT (truncated@240, maj@8 T=0.6) 19.0% 235 100
This model (maj@8 T=0.6, max_answer_len=256) 51.0% 237 100
This model (maj@8 T=0.6, max_answer_len=320) 57.0% 271 100

Compressed-CoT model maintains 65% of full-budget verbal accuracy at 43% of the tokens. Substantially above the budget-matched truncated verbal baseline (+32-38pp).

Honest caveats

The abstract-token mechanism reported in the paper does not emerge at this compute budget. Direct ablation on this checkpoint (n=100, maj@8) yields:

Condition Accuracy Δ vs baseline
Normal z̃ 51% —
Random z̃ (replace each position with a random V_abs token) 54% +3pp
Zero z̃ (skip the abstract block entirely) 52% +1pp

The model produces a near-constant 9-token abstract prefix regardless of input; all of the load-bearing reasoning happens in the answer phase as verbal CoT, just compressed under a length budget. This is a budget-constrained verbal-CoT model with a vestigial abstract prefix, not a latent reasoning model in the mechanistic sense the paper claims.

Training recipe

  • Substrate: Qwen2.5-Math-7B-Instruct
  • Warmup: 1 policy-iteration round (Phase A + Phase B), 3000 examples from Dolci-Think-SFT-7B, 1 epoch each, full fine-tuning, on 1× H100. ~70 minutes. Post-warmup MATH-500 accuracy: 15.6% (n=32 probe).
  • GRPO: 800 steps, beta=0.03, lr=5e-6, length_penalty_coef=0.1, m_max=64 abstract tokens, min_z_length=8, max_answer_len=256, group_size=4, per_prompt_batch=2, grad_accum=4, optimizer_8bit (paged AdamW). RL data: NuminaMath-CoT problems. Reward: MathVerifierReward (boxed-answer match) + length penalty.

Total compute: ~6 hours on a single H100 (Lambda GH200 80GB). Approximately 0.5% of the compute used in the original paper for warmup; ~5% for RL.

Architecture & inference

The model expects the two-phase decode used in Abstract-CoT:

  1. Constrained decoding over V_abs = {TOKEN_A, ..., TOKEN_BL} for up to m_max=64 tokens, terminating with <endabstract>.
  2. Free decoding for the answer phase, with V_abs tokens forbidden, terminating at EOS or max_answer_len.

See LauraGomezjurado/latent-reasoning-interp for full inference / eval code.

Intended use & limitations

  • Engineering use case: chain-of-thought token-efficient math reasoning at a manageable accuracy/budget Pareto point on academic compute.
  • Not a working latent reasoning model — the abstract tokens are essentially decorative. If you are studying real latent reasoning, this checkpoint is useful as a negative-result reference.
  • Performance on out-of-distribution math problems is unverified beyond MATH-500.

Citation

If you use this model, please cite the original Abstract-CoT paper:

@misc{ramji2026thinkingwithoutwords,
  title={Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought},
  author={Keshav Ramji and Tahira Naseem and Ramón Fernandez Astudillo},
  year={2026},
  eprint={2604.22709},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
Downloads last month
16
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LauraGG/qwen25math-7b-abstract-cot-grpo

Base model

Qwen/Qwen2.5-7B
Finetuned
(136)
this model

Datasets used to train LauraGG/qwen25math-7b-abstract-cot-grpo

Paper for LauraGG/qwen25math-7b-abstract-cot-grpo