# Scaling Laws Report — strict ±1 1-bit LLM (BitLMv57 architecture) ## 1. Measured data points **Architecture:** `BitLMv57` — all weights ±1 via `sign_ste`, 32-head Gumbel-hard-argmax attention, `sign_ste(x + attn + ffn)` residual, no normalization, no per-channel α. Byte-level vocab (128), ALiBi positional bias, T=256. | run | params | dataset | steps | bs | LR | val_bpc | notes | |---|---|---|---|---|---|---|---| | v70 | 5.5M | TinyStories | 20K | 256 | 3.2e-3 | **1.481** | 5M sweep peak (strict ±1) | | v71 | 46.4M | TinyStories | 10K | 64 | 3e-4 | **1.335** | hit 1.3 target | | v72 | 46.4M | TinyStories | 20K | 64 | 3e-4 | **1.238** | step scaling | | v73 | 46.4M | TinyStories | 40K | 64 | 3e-4 | **1.169** | overnight best | | glm_50M | 46.4M | GLM-Reasoning | 20K | 64 | 3e-4 | **1.877** | first GLM | | glm_50M_40K | 46.4M | GLM-Reasoning | 40K | 64 | 3e-4 | **1.794** | 40K version | | glm_50M_a100_bf16 | 46.4M | GLM-Reasoning | 40K | 64 | 3e-4 | **1.792** | A100 rerun, cached ALiBi | | **glm_50M_bs256_hlr** | 46.4M | GLM-Reasoning | 15K | 256 | 6e-4 | **1.766** | **best GLM result** | All runs use bf16 autocast + torch.compile on A100 (the 5090 runs used fp32). Effective forward math is identical (all weights/activations rounded to ±1 via STE). --- ## 2. Scaling relationships ### 2.1 Step scaling (fixed params, fixed batch — the Kaplan "D" axis) | dataset | steps | val_bpc | Δ vs 2× fewer | |---|---|---|---| | TinyStories, 46.4M, bs=64 | 10K | 1.335 | — | | TinyStories, 46.4M, bs=64 | 20K | 1.238 | −0.097 | | TinyStories, 46.4M, bs=64 | 40K | 1.169 | −0.069 | | GLM, 46.4M, bs=64 | 20K | 1.877 | — | | GLM, 46.4M, bs=64 | 40K | 1.794 | −0.083 | **Rule of thumb: each doubling of training steps ≈ −0.08 BPC.** Holds remarkably consistently across both datasets and four doubling intervals. Suggests `loss − loss_∞ ∝ steps^−β` with β ≈ 0.12–0.15 (assuming `loss_∞` near 1.0 for TinyStories, 1.3 for GLM). ### 2.2 Parameter scaling (fixed compute per param — the Kaplan "N" axis) TinyStories at ~20K-step budget: | params | val_bpc | Δ | |---|---|---| | 5.5M (v70) | 1.481 | — | | 46.4M (v72) | 1.238 | −0.243 | **Rule of thumb: 9× more params ≈ −0.24 BPC, i.e. −0.25 BPC per decade.** Same ballpark slope as published 1-bit LLM curves (BitNet, OneBit). Extrapolation from only two points so use with caution. ### 2.3 Batch + LR scaling (same compute, different grad SNR) GLM at 46.4M: | bs | LR | steps | tokens | wall-clock | val_bpc | |---|---|---|---|---|---| | 64 | 3e-4 | 40,000 | 655 M | 108 min | 1.792 | | **256** | **6e-4** | **15,000** | **983 M** | **69 min** | **1.766** | bs=256 + √4·LR boost: **−0.026 BPC at −36% wall-clock** AND 1.5× more tokens consumed. The smoother gradient at 4× bs lets LR scale up by √4, which trades "more steps" for "more data per step" and wins on both axes for this dataset. ### 2.4 Dataset difficulty offset Same 46.4M model, 40K bs=64 steps: TinyStories 1.169 vs GLM 1.794 → **GLM is ~0.63 BPC harder** (richer/multi-domain reasoning vs simple stories). This is a dataset-entropy offset; the N- and D-scaling slopes remain the same within each dataset. --- ## 3. Prediction for 300M ### 3.1 Point estimate — REVISED with 3-point fit on GLM After running additional scale points at 5M and 14M on GLM with identical bs=256/lr=6e-4/15K recipe: | params | val_bpc | decade-local slope | |---|---|---| | 5.52M | **2.213** | — | | 13.70M | **2.003** | −0.53/decade | | 46.45M | **1.766** | −0.45/decade | **The slope is flattening** — consistent with approaching an irreducible loss floor. Log-linear extrapolation overshoots. **Chinchilla-form fit:** `L = A + B/N^α` with A=1.0, α=0.218, B=1.76 reproduces all three points to within 0.02 BPC. Predictions from this fit: | training budget | val_bpc at 300M | |---|---| | 15K bs=256 (1B tokens) | **~1.47** | | 30K bs=256 (2B tokens) | **~1.39** | | 60K bs=256 (4B tokens) | **~1.31** | **Uncertainty: ±0.05 BPC** (tightened from ±0.10 with the third scale point). ### 3.2 Caveats 1. **Two-point slope is unreliable.** A 3rd scale point (e.g., 15M or 150M) would tighten the prediction by >2×. Currently the 300M prediction has ~±0.1 BPC uncertainty. 2. **1-bit capacity ceiling.** Each 1-bit weight carries ~1 bit of information (vs ~24 usable bits in bf16 after catastrophic quantization). So our 300M-param strict-±1 model stores ~37.5 MB of *weight entropy*, less than a 50M fp32 model (200 MB). Expect the −0.25 BPC/decade slope to start flattening as you scale — at some N*, additional params don't help because they're all ±1-constrained. 3. **Architecture lift available.** The `v52 BitNet variant` (per-channel float α + RMSNorm + softmax attention, still ±1-stored weights) reached 1.361 BPC at 5.5M/20K — **0.12 BPC better than our strict-±1 v70 at the same scale**. If reproduced at 300M, that alone might give −0.12 BPC "for free" relative to the straight extrapolation. 4. **GLM train set is only 387 MB (~390M tokens).** For Chinchilla-optimal at 300M params (20 tokens/param), we'd want ~6B tokens. We'd be training multi-epoch and hitting data-limited regime, which bends the D curve upward (overfitting). --- ## 4. Recommended 300M hyperparameters Starting point based on our measurements plus standard 1-bit LLM practice: ### Architecture - `d_model = 2048` (2x of 50M's 1024) - `n_layers = 16` (2x) - `n_heads = 32` (same — 32 heads was the peak at 50M; "more heads" stopped helping at 64) - `d_ff = 1024` (2x) - `head_dim = 64` (vs 32 at 50M — enables hardware tile alignment) - Total: ~290M params (configurable to hit exactly 300M if needed) ### Training - `bs = 256` (our bs sweep shows this is peak throughput on A100 40GB at 50M; at 300M the mem budget becomes tight, may need bs=128) - `lr = 6e-4` with cosine decay to 0, 500-step warmup - `steps = 30K` (≈ 2B tokens, respects empirical ~-0.08/doubling on top of the 300M base) - `T = 256`, AdamW `(β1=0.9, β2=0.95)`, `weight_decay = 0.01`, grad clip 1.0 - bf16 autocast + torch.compile - `tau_start = 2.0 → tau_end = 0.1` log schedule (matches our 50M recipe) ### Memory budget at 300M - Latent fp32 weights: 300M × 4 = **1.2 GB** - AdamW m + v (fp32 each): **2.4 GB** - Grads: 1.2 GB - Activations at bs=256, T=256, 16 layers: ~4 GB - Total: **~8-10 GB** → fits comfortably on A100 40GB ### Wall-clock estimate - At 50M / bs=256: 275 ms/step - At 300M the linear projections scale 6×, attention dominates less since ratio shifts - Estimated: 1.0–1.3 s/step on A100 - 30K steps: **~10-12 hours on single A100** --- ## 5. Ablations worth running BEFORE committing 10 hours Rather than go straight to 300M/30K, run these cheap sanity checks first (each ~1-2 hours): 1. **Third scale point**: train a 15M config (d=640, L=6, H=20, d_ff=320) at bs=256/15K on GLM. Expected val_bpc ≈ 1.92 by the two-point extrapolation. If it comes in +0.1 off, the 300M prediction shifts by the same amount. 2. **Architecture upgrade test**: port the v52 stack (α + RMSNorm + softmax attention, still ±1 weights) to v57's 32-head / bs=256 setup at 46M. If it beats 1.766 significantly (say 1.65), consider switching the default before scaling. Worth the risk because 0.1 BPC at 300M is huge. 3. **Bigger d_ff ratio**: our 50M has d_ff/d_model = 0.5. Most production 1-bit LLMs use 2-4×. Try 50M with d_ff=2048 (d_ff/d_model=2.0) and see if we get a free −0.05 BPC. These three experiments cost ~5 hours total and would substantially de-risk the 300M commit. --- ## 6. TL;DR - **Step scaling**: −0.08 BPC per doubling of training steps, consistent across 4 data points - **Param scaling**: −0.25 BPC per decade (5M→50M), only 2 points so ±0.1 uncertainty - **Batch scaling**: bs=64→256 with √4·LR gives −0.026 BPC in 64% wall-clock — "free" - **Predicted 300M @ 30K bs=256 steps**: **1.45–1.55 BPC on GLM** (best estimate) - **Honest confidence**: moderate — 2 param points is thin. Getting a 15M data point first would cut uncertainty by >50% for negligible cost.