Papers
arxiv:2509.26507

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

Published on Sep 30
· Submitted by Jan Chorowski on Oct 1
#1 Paper of the day
Authors:
,
,

Abstract

BDH, a biologically inspired Large Language Model, combines scale-free network architecture with Hebbian learning to achieve Transformer-like performance while maintaining interpretability.

AI-generated summary

The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models. We introduce `Dragon Hatchling' (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of \n locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance. BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data. BDH can be represented as a brain model. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons. We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech. BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.

Community

Paper author Paper submitter

Brains are a network of neurons, with very specific connection patterns. Transformers use dense matrix multiplications which hide this network structure. What happens when an LLM is designed to be a brain-like, scale-free information transportation network?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

“Excited to explore the BDH architecture as part of the Pathway hackathon!”

Can we train tts models using this architecture? I'm not talking about the fancy new llm-based ones, but the smaller ones like Piper

Great!!

So you removed softmax and added a relu gating after attention? How is this anything new? The modifications aren't even good. They degrade standard attention performance.

·

@xz259 There are about 5 main differences, which should not be tried in isolation - or they will indeed not work (and cannot work in isolation, as explained in the theory part of the paper). See also table in comments below.

From the authors: For readers looking for a synthetic (less rigorous) comparison to the Transformer, here is a summary table of differences.

bdh_versus_transformer

Fascinating.

I'll be studying this intensely over the next week.

High dimensionality and connectivity are indeed important links.

I may have missed it, but it seems like you only used letter-level tokenization when testing BDH. Did you test using word-level tokenization?

·

You're spot on – for this whole experiment, we stuck strictly to character-level tokenization. Honestly, it was a deliberate choice. We wanted to keep things as simple as possible to really isolate and test the BDH architecture itself. By feeding it raw characters, we gave it the toughest possible challenge: it had to figure out everything from scratch, including what a 'word' even is. The fact that it managed to learn spelling, grammar, and style just from individual letters is, for us, the most powerful proof that the architecture is doing something special. Testing it with a proper BPE tokenizer is definitely the next logical step, and I'd be really curious to see how it performs then. Thanks again for the great question!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.26507 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.26507 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.26507 in a Space README.md to link it from this page.

Collections including this paper 24