Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls
Abstract
Reverse-engineering a model that learns multi-digit multiplication via implicit chain-of-thought reveals that it uses attention to encode long-range dependencies and represents partial products efficiently, insights that help address limitations in standard fine-tuning.
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via implicit chain-of-thought, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.
Community
Transformers stumble on multi-digit multiplication, but why?
By reverse-engineering a model that does succeed---trained with implicit chain-of-thought (ICoT)---we found:
1️⃣ Attention builds a DAG to cache/retrieve partial products
2️⃣ Digits are encoded via a Fourier basis
3️⃣ Attention heads implement Minkowski sums
Hey nice work!
I just check the repo page but i didn't found any training script, do you plan to share it?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Language Modeling with Learned Meta-Tokens (2025)
- R-Capsule: Compressing High-Level Plans for Efficient Large Language Model Reasoning (2025)
- Understanding and Enhancing the Planning Capability of Language Models via Multi-Token Prediction (2025)
- Fresh in memory: Training-order recency is linearly encoded in language model activations (2025)
- Exploring System 1 and 2 communication for latent reasoning in LLMs (2025)
- Identity Bridge: Enabling Implicit Reasoning via Shared Latent Memory (2025)
- Bilinear relational structure fixes reversal curse and enables consistent model editing (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper