Full YatNMN (Attn + MLP) d=12 (96M) — FineWeb-Edu 3x Chinchilla — PyTorch

A 96M-parameter GPT with both YatNMN attention and YatNMN MLP, no value embeddings. Trained on FineWeb-Edu 100BT at 3x Chinchilla-optimal compute.

YatNMN Attention (novel — no Q/K projections)

Unlike standard attention which projects Q, K, V separately, YatNMN attention uses the input directly as both query and key:

x_heads = RoPE(x)
dots = x_heads @ x_heads^T                       # pairwise dot products
dist² = ||x_i||² + ||x_j||² - 2·dots             # pairwise distances
scores = (dots + softplus(b))² / (dist² + softplus(ε))   # per-head b, ε
scores = L1_normalize(scores)                     # NOT softmax
y = scores @ V

Key differences from standard attention:

No Q/K projections — saves 2/3 of attention parameters
L1 normalization instead of softmax (scores are non-negative by construction)
Per-head learnable bias and epsilon (both through softplus)
Strict causal: tokens cannot attend to themselves (j < i)

YatNMN MLP

y = α · (x·W + softplus(b))² / (||x - W||² + softplus(ε))

Training


Parameters	95,945,942
Final smooth loss	2.8675
Tokens	5.76B (3× Chinchilla)
Data	FineWeb-Edu 100BT
LR	0.01, warmup-cosine
Hardware	TPU v6e-8, batch=32

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mlnomad/yatnmn-full-noVE-fineweb-3x-d12-pytorch",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

Flax version

mlnomad/yatnmn-full-noVE-fineweb-3x-d12

License

Apache 2.0.

Downloads last month: 215

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

mlnomad
/

yatnmn-full-noVE-fineweb-3x-d12-pytorch