Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step
Abstract
Proposed decoding strategies and reinforcement learning algorithms improve the performance and efficiency of masked diffusion language models during inference.
Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.
Community
๐ More Consistent Trajectories, Fewer Steps, Stronger Reasoning! Masked Diffusion Language Models Shine with Reinforcement Learning
Fudan University, Shanghai Artificial Intelligence Laboratory, and Shanghai Jiao Tong University jointly present their latest research:
"Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Steps"
๐ Code: https://github.com/yjyddq/EOSER-ASS-RL
๐ Paper: https://arxiv.org/pdf/2509.23924
๐ What Problems Did We Solve?
Masked Diffusion Models (MDLMs) like LLaDA show great potential but face three major challenges:
โ Full diffusion-style decoding tends to "end too early" โ falling into the trap
โ Uniform step-size decoding is inefficient
โ Existing reinforcement learning algorithms suffer from trajectory inconsistency during training, affecting performance
๐ก Our Three Innovative Solutions:
1๏ธโฃ EOS Early Rejection
Actively suppresses confidence in early decoding steps
Gradually restores it later to ensure proper completion
Prevents the model from "giving up halfway"
2๏ธโฃ Ascending Step-Size Scheduler
Decodes cautiously early, aggressively later
Reduces steps from O(L) to O(log L)
Significantly accelerates inference!
3๏ธโฃ Consistency Trajectory Optimization
Aligns training and inference trajectories for masked diffusion language models
Resolves optimization errors caused by trajectory inconsistency
Enables more stable training and better performance
๐ฏ Impressive Experimental Results:
On mathematical reasoning (GSM8K, MATH500) and planning tasks (Countdown, Sudoku):
โ
Consistency trajectory optimization outperforms baselines across all mathematical and planning tasks
โ
Planning task performance improved by 2โ4ร compared to baselines
โ
Achieves performance with only log L steps that matches L/2 steps
โ
Discovered planning tasks suit parallel reasoning, while math problems fit sequential reasoning
โ
Truly achieves "faster and better"
๐ฎ Research Significance:
Identified suitable scenarios for parallel reasoning (planning tasks) and sequential reasoning (mathematical tasks)
Lays the foundation for next-generation hybrid reasoning models
๐ซ In One Sentence:
We optimize diffusion language models with more consistent trajectories and fewer decoding steps, enabling complex reasoning with reduced computation โ opening a new chapter for practical non-autoregressive models!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper