arxiv:2601.01306

Towards a Principled Muon under μP: Ensuring Spectral Conditions throughout Training

Published on Jan 4

Authors:

Abstract

Researchers developed Muon++ to maintain spectral conditions required by μ-parameterization during long-horizon LLM training, eliminating the need for costly spectral normalization while enabling adaptive spectral conditions for improved practical deployment.

AI-generated summary

The μ-parameterization (μP) provides a principled foundation for large language model (LLM) training by prescribing width-independent learning dynamics, which in turn enables predictable scaling behavior and robust hyperparameter transfer across model sizes. A central requirement of μP is the satisfaction of certain spectral conditions on weight matrices, which ensure consistent feature learning and optimization behavior as model width grows. While these conditions are well understood in theory, guaranteeing their validity in practical training for matrix-based optimizers such as Muon is still under studied. Existing works that study Muon under μP exhibit important limitations: they either do not ensure that the spectral conditions hold throughout the entire training horizon, or require repeated spectral normalization (or Newton-Schulz iterations) applied to both weights and updates, leading to significant computational overhead and reduced practicality. In this work, we show how to reliably guarantee the spectral conditions required by μP for Muon during the entire training process. Our key insight is that for moderately large models, maintaining spectral control at the level of optimizer updates alone is sufficient to preserve μP-compatible scaling, eliminating the need for explicit spectral normalization of the weights. Based on this principle, we develop a variant of Muon, namely Muon++, that satisfies spectral condition throughout the training process. Our results bridge the gap between the theoretical promises of μP and the practical deployment of matrix-based optimizers in long-horizon training. We also take the first step towards an adaptive spectral condition by incorporating data-dependent effects, making it better suited for long-horizon LLM training.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.01306 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.01306 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.01306 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.