HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 578k • 1.07k
How to use mlnomad/yatnmn-full-noVE-fineweb-3x-d12-pytorch with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("mlnomad/yatnmn-full-noVE-fineweb-3x-d12-pytorch", trust_remote_code=True, dtype="auto")A 96M-parameter GPT with both YatNMN attention and YatNMN MLP, no value embeddings. Trained on FineWeb-Edu 100BT at 3x Chinchilla-optimal compute.
Unlike standard attention which projects Q, K, V separately, YatNMN attention uses the input directly as both query and key:
x_heads = RoPE(x)
dots = x_heads @ x_heads^T # pairwise dot products
dist² = ||x_i||² + ||x_j||² - 2·dots # pairwise distances
scores = (dots + softplus(b))² / (dist² + softplus(ε)) # per-head b, ε
scores = L1_normalize(scores) # NOT softmax
y = scores @ V
Key differences from standard attention:
y = α · (x·W + softplus(b))² / (||x - W||² + softplus(ε))
| Parameters | 95,945,942 |
| Final smooth loss | 2.8675 |
| Tokens | 5.76B (3× Chinchilla) |
| Data | FineWeb-Edu 100BT |
| LR | 0.01, warmup-cosine |
| Hardware | TPU v6e-8, batch=32 |
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mlnomad/yatnmn-full-noVE-fineweb-3x-d12-pytorch",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
mlnomad/yatnmn-full-noVE-fineweb-3x-d12
Apache 2.0.