arxiv:2509.23924

Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

Published on Sep 28

· Submitted by

jingyi Yang on Sep 30

Fudan University

Upvote

Authors:

Jingyi Yang ,

Guanxu Chen ,

Xuhao Hu ,

Abstract

Proposed decoding strategies and reinforcement learning algorithms improve the performance and efficiency of masked diffusion language models during inference.

AI-generated summary

Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.

View arXiv page View PDF GitHub 18 Add to collection

Community

JY-Young

Paper author Paper submitter 15 days ago

🚀 More Consistent Trajectories, Fewer Steps, Stronger Reasoning! Masked Diffusion Language Models Shine with Reinforcement Learning

Fudan University, Shanghai Artificial Intelligence Laboratory, and Shanghai Jiao Tong University jointly present their latest research:
"Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Steps"

🔗 Code: https://github.com/yjyddq/EOSER-ASS-RL
🔗 Paper: https://arxiv.org/pdf/2509.23924

🌟 What Problems Did We Solve?
Masked Diffusion Models (MDLMs) like LLaDA show great potential but face three major challenges:
❌ Full diffusion-style decoding tends to "end too early" — falling into the trap
❌ Uniform step-size decoding is inefficient
❌ Existing reinforcement learning algorithms suffer from trajectory inconsistency during training, affecting performance

💡 Our Three Innovative Solutions:

1️⃣ EOS Early Rejection

Actively suppresses confidence in early decoding steps

Gradually restores it later to ensure proper completion

Prevents the model from "giving up halfway"

2️⃣ Ascending Step-Size Scheduler

Decodes cautiously early, aggressively later

Reduces steps from O(L) to O(log L)

Significantly accelerates inference!

3️⃣ Consistency Trajectory Optimization

Aligns training and inference trajectories for masked diffusion language models

Resolves optimization errors caused by trajectory inconsistency

Enables more stable training and better performance

🎯 Impressive Experimental Results:
On mathematical reasoning (GSM8K, MATH500) and planning tasks (Countdown, Sudoku):
✅ Consistency trajectory optimization outperforms baselines across all mathematical and planning tasks
✅ Planning task performance improved by 2–4× compared to baselines
✅ Achieves performance with only log L steps that matches L/2 steps
✅ Discovered planning tasks suit parallel reasoning, while math problems fit sequential reasoning
✅ Truly achieves "faster and better"

🔮 Research Significance:

Identified suitable scenarios for parallel reasoning (planning tasks) and sequential reasoning (mathematical tasks)

Lays the foundation for next-generation hybrid reasoning models

💫 In One Sentence:
We optimize diffusion language models with more consistent trajectories and fewer decoding steps, enabling complex reasoning with reduced computation — opening a new chapter for practical non-autoregressive models!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.23924 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.23924 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.23924 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.