Papers
arxiv:2509.23924

Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

Published on Sep 28
ยท Submitted by jingyi Yang on Sep 30
Authors:

Abstract

Proposed decoding strategies and reinforcement learning algorithms improve the performance and efficiency of masked diffusion language models during inference.

AI-generated summary

Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.

Community

Paper author Paper submitter

๐Ÿš€ More Consistent Trajectories, Fewer Steps, Stronger Reasoning! Masked Diffusion Language Models Shine with Reinforcement Learning

Fudan University, Shanghai Artificial Intelligence Laboratory, and Shanghai Jiao Tong University jointly present their latest research:
"Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Steps"

๐Ÿ”— Code: https://github.com/yjyddq/EOSER-ASS-RL
๐Ÿ”— Paper: https://arxiv.org/pdf/2509.23924

๐ŸŒŸ What Problems Did We Solve?
Masked Diffusion Models (MDLMs) like LLaDA show great potential but face three major challenges:
โŒ Full diffusion-style decoding tends to "end too early" โ€” falling into the trap
โŒ Uniform step-size decoding is inefficient
โŒ Existing reinforcement learning algorithms suffer from trajectory inconsistency during training, affecting performance

๐Ÿ’ก Our Three Innovative Solutions:

1๏ธโƒฃ EOS Early Rejection

Actively suppresses confidence in early decoding steps

Gradually restores it later to ensure proper completion

Prevents the model from "giving up halfway"

2๏ธโƒฃ Ascending Step-Size Scheduler

Decodes cautiously early, aggressively later

Reduces steps from O(L) to O(log L)

Significantly accelerates inference!

3๏ธโƒฃ Consistency Trajectory Optimization

Aligns training and inference trajectories for masked diffusion language models

Resolves optimization errors caused by trajectory inconsistency

Enables more stable training and better performance

๐ŸŽฏ Impressive Experimental Results:
On mathematical reasoning (GSM8K, MATH500) and planning tasks (Countdown, Sudoku):
โœ… Consistency trajectory optimization outperforms baselines across all mathematical and planning tasks
โœ… Planning task performance improved by 2โ€“4ร— compared to baselines
โœ… Achieves performance with only log L steps that matches L/2 steps
โœ… Discovered planning tasks suit parallel reasoning, while math problems fit sequential reasoning
โœ… Truly achieves "faster and better"

๐Ÿ”ฎ Research Significance:

Identified suitable scenarios for parallel reasoning (planning tasks) and sequential reasoning (mathematical tasks)

Lays the foundation for next-generation hybrid reasoning models

๐Ÿ’ซ In One Sentence:
We optimize diffusion language models with more consistent trajectories and fewer decoding steps, enabling complex reasoning with reduced computation โ€” opening a new chapter for practical non-autoregressive models!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.23924 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.23924 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.23924 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.