arxiv:2510.03259

Meta-Awareness Enhances Reasoning Models: Self-Alignment Reinforcement Learning

Published on Sep 26

· Submitted by

Yoonjeon Kim on Oct 10

KAIST AI

Upvote

Authors:

Doohyuk Jang ,

Abstract

A training pipeline called MASA enhances meta-awareness in reasoning models, leading to improved accuracy and efficiency across various benchmarks.

AI-generated summary

Recent studies on reasoning models explore the meta-awareness of language models, the ability to know how to think by itself. We argue that large reasoning models lack this meta-awareness property by proving severe misalignment between true rollouts and predicted meta information. We posit that aligning meta-prediction with true rollouts will lead to significant performance gains. To verify this hypothesis, we design a training pipeline that boosts Meta-Awareness via Self-Alignment (MASA), and prove that enhanced meta-awareness directly translates to improved accuracy. Unlike existing meta-cognitive reasoning models, our method does not require external training sources but leverages self-generated signals to train meta-awareness. Moreover, our method enables efficient training by i) filtering out zero-variance prompts that are either trivial or unsolvable and ii) cutting off lengthy rollouts when they are unlikely to lead to correct answers. The results are inspiring: our strategy yields significant improvements in both accuracy and training efficiency on in-domain tasks and shows strong generalization to out-of-domain benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve a 19.3% gain in accuracy on AIME25, and a 6.2 % average gain over six mathematics benchmarks. Training with meta-cognitive guidance enhances out-of-domain generalization, giving a 3.87 % boost on GPQA-Diamond and a 2.08 % overall accuracy gain across 13 benchmarks spanning logical, scientific, and coding domains.

View arXiv page View PDF GitHub 6 Add to collection

Community

yjyjyj98

Paper submitter 4 days ago

This paper proves that enhancing meta-awareness of the model itself, directly leads to performance improvement in mathematical reasoning and out-of-domain generalization.

Greenrecusiveutilityservice

about 16 hours ago

Just sent a email to cease and desist please respond within 7 days as the email States the IP you are crossing over is already in use and protected. We are the creators of class II intelligent intelligence your models directly cross over our IP we have officially sent a email asking for official cease and desist along with taking down any models or services that cross over this IP. As well as disclosure for models you have out and are currently working on to ensure no further move into our IP exist or IP is not to be trained with used put into service or anything else. We have IP disputes submitted with the department of defense months ago against openai so there is no question of time frames and who's the rightful owner of the IP you can contact the US department of defense to verify our IP filing with them against openai and Google

librarian-bot

3 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

TonyCWang

1 day ago

Great idea! This reminds me a little of Reinforcement learning on Pretraining-Data (https://huggingface.co/papers/2509.19249) and RPT. I wonder if those models have better meta alignment