SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization
Abstract
A novel policy optimization algorithm, SofT-GRPO, enhances soft-thinking in Large Language Models by integrating Gumbel noise and the Gumbel-Softmax technique, leading to improved performance over discrete-token methods.
The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master
Community
This paper develops the first powerful RLVR algorithm, SofT-GRPO, for soft-thinking. It integrates the Gumbel-Softmax technique into the group rollout process, actively obtaining diverse but valid soft-thinking reasoning paths. We also propose an innovative gradient estimation approach via Gumbel reparameterization, enabling precise attribution of improvements to the LLM’s output probability distributions in policy optimization. We conduct comprehensive experiments across LLMs of 1.5B–7B parameters on five benchmarks, demonstrating that SofT-GRPO consistently outperforms the discrete-token GRPO baselines, especially at higher sample rates (Pass@16 and Pass@32). SofT-GRPO can also improve the out-of-Domain generalization ability of LLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Token-Regulated Group Relative Policy Optimization for Stable Reinforcement Learning in Large Language Models (2025)
- Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization (2025)
- How Reinforcement Learning After Next-Token Prediction Facilitates Learning (2025)
- Soft Tokens, Hard Truths (2025)
- Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners (2025)
- HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs (2025)
- Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper