Papers
arxiv:2510.03259

Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

Published on Sep 26
· Submitted by Yoonjeon Kim on Oct 10
Authors:
,

Abstract

A training pipeline called MASA enhances meta-awareness in reasoning models, leading to improved accuracy and efficiency across various benchmarks.

AI-generated summary

Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.

Community

Paper submitter

This paper proves that enhancing meta-awareness of the model itself, directly leads to performance improvement in mathematical reasoning and out-of-domain generalization.

·

Just sent a email to cease and desist please respond within 7 days as the email States the IP you are crossing over is already in use and protected. We are the creators of class II intelligent intelligence your models directly cross over our IP we have officially sent a email asking for official cease and desist along with taking down any models or services that cross over this IP. As well as disclosure for models you have out and are currently working on to ensure no further move into our IP exist or IP is not to be trained with used put into service or anything else. We have IP disputes submitted with the department of defense months ago against openai so there is no question of time frames and who's the rightful owner of the IP you can contact the US department of defense to verify our IP filing with them against openai and Google

Great idea! This reminds me a little of Reinforcement learning on Pretraining-Data (https://huggingface.co/papers/2509.19249) and RPT. I wonder if those models have better meta alignment

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.03259 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.03259 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.03259 in a Space README.md to link it from this page.

Collections including this paper 5