Papers
arxiv:2509.19894

PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning

Published on Sep 24
· Submitted by Xueliang Zhao on Sep 29
Authors:
,
,
,

Abstract

PromptCoT 2.0 uses an EM loop to generate harder and more diverse synthetic prompts, improving reasoning capabilities in large language models through self-play and supervised fine-tuning.

AI-generated summary

Large language models (LLMs) are evolving from conversational systems into strong reasoners for tasks such as Olympiad mathematics and competitive programming. While scaling parameters and test-time computation has driven progress, a key bottleneck is the lack of high-quality training problems: human-curated datasets are costly and limited, while existing synthetic corpora are often too easy or narrow. PromptCoT 1.0 showed that injecting rationales into prompt synthesis increases problem difficulty. Building on this, we present PromptCoT 2.0, a scalable framework that replaces hand-crafted heuristics with an expectation-maximization (EM) loop, where rationales are iteratively refined to guide prompt construction. This produces problems that are both harder and more diverse than prior corpora. The synthetic prompts support two post-training regimes: (1) Self-Play, where strong models improve autonomously via verifiable feedback without stronger teachers; and (2) Supervised Fine-Tuning (SFT), where weaker models learn from teacher-distilled traces. Extensive experiments demonstrate the effectiveness of this approach. In self-play, applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 sets new state-of-the-art results at the 30B scale, with +4.4, +4.8, and +5.3 on AIME 24/25 and HMMT 25, +6.1 and +5.0 on LiveCodeBench v5/v6, and +35 Elo on Codeforces. In SFT, training Qwen2.5-7B-Instruct solely on synthetic prompts boosts accuracy to 73.1 (AIME 24), 65.6 (AIME 25), and 53.4 (LiveCodeBench v5), surpassing models trained on human or hybrid data. Analyses further confirm that PromptCoT 2.0 yields fundamentally harder and distributionally distinct problems. These results establish prompt synthesis as a new axis for scaling reasoning and position PromptCoT 2.0 as a scalable foundation for future open-source models. The implementation is available at https://github.com/inclusionAI/PromptCoT.

Community

Paper author Paper submitter
•
edited 14 days ago

TL;DR

  1. 🧩 Method Upgrade – PromptCoT 2.0:

    We introduce an EM-style rationale–driven synthesis loop (concept → rationale → problem) that generates harder, more diverse math & code problems than previous datasets, without relying on handcrafted heuristics.

  2. 📚 SFT with Fully Synthetic Data:

    Training a 7B model on our 4.8M synthetic prompts + trajectories—without any human-written problems—outperforms OpenMathReasoning and OpenCodeReasoning.

    👉 This shows that purely synthetic prompts can serve as a stronger and more scalable alternative to the best human-curated corpora.

  3. 🏆 Self-Play for Larger Models:

    Applying PromptCoT 2.0 to Qwen3-30B-A3B-Thinking-2507 via self-play achieves new SOTA at the 30B scale, with results competitive to Gemini 2.5 Pro and OpenAI o3, while activating only 3B parameters.

  4. ⚡ Open Resources for the Community:

    We release 4.8M prompts + GPT-OSS-120B (medium) responses, where responses are much shorter than those in OpenMathReasoning / OpenCodeReasoning (mainly from DeepSeek-R1).

    This makes our dataset especially suitable for next-gen architectures (e.g., diffusion LLMs) and efficient training pipelines.

Paper author Paper submitter

Self-Play (30B-A3B)

  • Establishes new SOTA at 30B-A3B scale across math (AIME, HMMT) and code (LiveCodeBench, Codeforces).
    87abc8046464863cfd1149fc94e47747
Paper author Paper submitter

SFT (7B, 100% synthetic)

  • Outperforms OpenMathReasoning & OpenCodeReasoning (both rely on human-written prompts).
    2ec3a1b84d730fa20c30e9f8079860a3
Paper author Paper submitter

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.19894 in a Space README.md to link it from this page.

Collections including this paper 6