Full YatNMN (Attn + MLP) d=12 (96M) — FineWeb-Edu 3x Chinchilla — PyTorch

A 96M-parameter GPT with both YatNMN attention and YatNMN MLP, no value embeddings. Trained on FineWeb-Edu 100BT at 3x Chinchilla-optimal compute.

YatNMN Attention (novel — no Q/K projections)

Unlike standard attention which projects Q, K, V separately, YatNMN attention uses the input directly as both query and key:

x_heads = RoPE(x)
dots = x_heads @ x_heads^T                       # pairwise dot products
dist² = ||x_i||² + ||x_j||² - 2·dots             # pairwise distances
scores = (dots + softplus(b))² / (dist² + softplus(ε))   # per-head b, ε
scores = L1_normalize(scores)                     # NOT softmax
y = scores @ V

Key differences from standard attention:

  • No Q/K projections — saves 2/3 of attention parameters
  • L1 normalization instead of softmax (scores are non-negative by construction)
  • Per-head learnable bias and epsilon (both through softplus)
  • Strict causal: tokens cannot attend to themselves (j < i)

YatNMN MLP

y = α · (x·W + softplus(b))² / (||x - W||² + softplus(ε))

Training

Parameters 95,945,942
Final smooth loss 2.8675
Tokens 5.76B (3× Chinchilla)
Data FineWeb-Edu 100BT
LR 0.01, warmup-cosine
Hardware TPU v6e-8, batch=32

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mlnomad/yatnmn-full-noVE-fineweb-3x-d12-pytorch",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

Flax version

mlnomad/yatnmn-full-noVE-fineweb-3x-d12

License

Apache 2.0.

Downloads last month
215
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train mlnomad/yatnmn-full-noVE-fineweb-3x-d12-pytorch

Space using mlnomad/yatnmn-full-noVE-fineweb-3x-d12-pytorch 1