The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain
Abstract
BDH, a biologically inspired Large Language Model, combines scale-free network architecture with Hebbian learning to achieve Transformer-like performance while maintaining interpretability.
The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models. We introduce `Dragon Hatchling' (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \n locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance. BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data. BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech. BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.
Community
Brains are a network of neurons, with very specific connection patterns. Transformers use dense matrix multiplications which hide this network structure. What happens when an LLM is designed to be a brain-like, scale-free information transportation network?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpikingBrain Technical Report: Spiking Brain-inspired Large Models (2025)
- Learning Internal Biological Neuron Parameters and Complexity-Based Encoding for Improved Spiking Neural Networks Performance (2025)
- Gated Associative Memory: A Parallel O(N) Architecture for Efficient Sequence Modeling (2025)
- Fast weight programming and linear transformers: from machine learning to neurobiology (2025)
- A Fully Spectral Neuro-Symbolic Reasoning Architecture with Graph Signal Processing as the Computational Backbone (2025)
- Understanding Transformers through the Lens of Pavlovian Conditioning (2025)
- Neuro-inspired Ensemble-to-Ensemble Communication Primitives for Sparse and Efficient ANNs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
“Excited to explore the BDH architecture as part of the Pathway hackathon!”
Can we train tts models using this architecture? I'm not talking about the fancy new llm-based ones, but the smaller ones like Piper
Great!!
So you removed softmax and added a relu gating after attention? How is this anything new? The modifications aren't even good. They degrade standard attention performance.
Fascinating.
I'll be studying this intensely over the next week.
High dimensionality and connectivity are indeed important links.
I may have missed it, but it seems like you only used letter-level tokenization when testing BDH. Did you test using word-level tokenization?
You're spot on – for this whole experiment, we stuck strictly to character-level tokenization. Honestly, it was a deliberate choice. We wanted to keep things as simple as possible to really isolate and test the BDH architecture itself. By feeding it raw characters, we gave it the toughest possible challenge: it had to figure out everything from scratch, including what a 'word' even is. The fact that it managed to learn spelling, grammar, and style just from individual letters is, for us, the most powerful proof that the architecture is doing something special. Testing it with a proper BPE tokenizer is definitely the next logical step, and I'd be really curious to see how it performs then. Thanks again for the great question!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper