Scaling Laws Report — strict ±1 1-bit LLM (BitLMv57 architecture)

1. Measured data points

Architecture: BitLMv57 — all weights ±1 via sign_ste, 32-head Gumbel-hard-argmax attention, sign_ste(x + attn + ffn) residual, no normalization, no per-channel α. Byte-level vocab (128), ALiBi positional bias, T=256.

run	params	dataset	steps	bs	LR	val_bpc	notes
v70	5.5M	TinyStories	20K	256	3.2e-3	1.481	5M sweep peak (strict ±1)
v71	46.4M	TinyStories	10K	64	3e-4	1.335	hit 1.3 target
v72	46.4M	TinyStories	20K	64	3e-4	1.238	step scaling
v73	46.4M	TinyStories	40K	64	3e-4	1.169	overnight best
glm_50M	46.4M	GLM-Reasoning	20K	64	3e-4	1.877	first GLM
glm_50M_40K	46.4M	GLM-Reasoning	40K	64	3e-4	1.794	40K version
glm_50M_a100_bf16	46.4M	GLM-Reasoning	40K	64	3e-4	1.792	A100 rerun, cached ALiBi
glm_50M_bs256_hlr	46.4M	GLM-Reasoning	15K	256	6e-4	1.766	best GLM result

All runs use bf16 autocast + torch.compile on A100 (the 5090 runs used fp32). Effective forward math is identical (all weights/activations rounded to ±1 via STE).

2. Scaling relationships

2.1 Step scaling (fixed params, fixed batch — the Kaplan "D" axis)

dataset	steps	val_bpc	Δ vs 2× fewer
TinyStories, 46.4M, bs=64	10K	1.335	—
TinyStories, 46.4M, bs=64	20K	1.238	−0.097
TinyStories, 46.4M, bs=64	40K	1.169	−0.069
GLM, 46.4M, bs=64	20K	1.877	—
GLM, 46.4M, bs=64	40K	1.794	−0.083

Rule of thumb: each doubling of training steps ≈ −0.08 BPC. Holds remarkably consistently across both datasets and four doubling intervals. Suggests loss − loss_∞ ∝ steps^−β with β ≈ 0.12–0.15 (assuming loss_∞ near 1.0 for TinyStories, 1.3 for GLM).

2.2 Parameter scaling (fixed compute per param — the Kaplan "N" axis)

TinyStories at ~20K-step budget:

params	val_bpc	Δ
5.5M (v70)	1.481	—
46.4M (v72)	1.238	−0.243

Rule of thumb: 9× more params ≈ −0.24 BPC, i.e. −0.25 BPC per decade. Same ballpark slope as published 1-bit LLM curves (BitNet, OneBit). Extrapolation from only two points so use with caution.

2.3 Batch + LR scaling (same compute, different grad SNR)

GLM at 46.4M:

bs	LR	steps	tokens	wall-clock	val_bpc
64	3e-4	40,000	655 M	108 min	1.792
256	6e-4	15,000	983 M	69 min	1.766

bs=256 + √4·LR boost: −0.026 BPC at −36% wall-clock AND 1.5× more tokens consumed. The smoother gradient at 4× bs lets LR scale up by √4, which trades "more steps" for "more data per step" and wins on both axes for this dataset.

2.4 Dataset difficulty offset

Same 46.4M model, 40K bs=64 steps: TinyStories 1.169 vs GLM 1.794 → GLM is ~0.63 BPC harder (richer/multi-domain reasoning vs simple stories). This is a dataset-entropy offset; the N- and D-scaling slopes remain the same within each dataset.

3. Prediction for 300M

3.1 Point estimate — REVISED with 3-point fit on GLM

After running additional scale points at 5M and 14M on GLM with identical bs=256/lr=6e-4/15K recipe:

params	val_bpc	decade-local slope
5.52M	2.213	—
13.70M	2.003	−0.53/decade
46.45M	1.766	−0.45/decade

The slope is flattening — consistent with approaching an irreducible loss floor. Log-linear extrapolation overshoots.

Chinchilla-form fit: L = A + B/N^α with A=1.0, α=0.218, B=1.76 reproduces all three points to within 0.02 BPC.

Predictions from this fit:

training budget	val_bpc at 300M
15K bs=256 (1B tokens)	~1.47
30K bs=256 (2B tokens)	~1.39
60K bs=256 (4B tokens)	~1.31

Uncertainty: ±0.05 BPC (tightened from ±0.10 with the third scale point).

3.2 Caveats

Two-point slope is unreliable. A 3rd scale point (e.g., 15M or 150M) would tighten the prediction by >2×. Currently the 300M prediction has ~±0.1 BPC uncertainty.
1-bit capacity ceiling. Each 1-bit weight carries ~1 bit of information (vs ~24 usable bits in bf16 after catastrophic quantization). So our 300M-param strict-±1 model stores ~37.5 MB of weight entropy, less than a 50M fp32 model (200 MB). Expect the −0.25 BPC/decade slope to start flattening as you scale — at some N*, additional params don't help because they're all ±1-constrained.
Architecture lift available. The v52 BitNet variant (per-channel float α + RMSNorm + softmax attention, still ±1-stored weights) reached 1.361 BPC at 5.5M/20K — 0.12 BPC better than our strict-±1 v70 at the same scale. If reproduced at 300M, that alone might give −0.12 BPC "for free" relative to the straight extrapolation.
GLM train set is only 387 MB (~390M tokens). For Chinchilla-optimal at 300M params (20 tokens/param), we'd want ~6B tokens. We'd be training multi-epoch and hitting data-limited regime, which bends the D curve upward (overfitting).

4. Recommended 300M hyperparameters

Starting point based on our measurements plus standard 1-bit LLM practice:

Architecture

d_model = 2048 (2x of 50M's 1024)
n_layers = 16 (2x)
n_heads = 32 (same — 32 heads was the peak at 50M; "more heads" stopped helping at 64)
d_ff = 1024 (2x)
head_dim = 64 (vs 32 at 50M — enables hardware tile alignment)
Total: ~290M params (configurable to hit exactly 300M if needed)

Training

bs = 256 (our bs sweep shows this is peak throughput on A100 40GB at 50M; at 300M the mem budget becomes tight, may need bs=128)
lr = 6e-4 with cosine decay to 0, 500-step warmup
steps = 30K (≈ 2B tokens, respects empirical ~-0.08/doubling on top of the 300M base)
T = 256, AdamW (β1=0.9, β2=0.95), weight_decay = 0.01, grad clip 1.0
bf16 autocast + torch.compile
tau_start = 2.0 → tau_end = 0.1 log schedule (matches our 50M recipe)

Memory budget at 300M

Latent fp32 weights: 300M × 4 = 1.2 GB
AdamW m + v (fp32 each): 2.4 GB
Grads: 1.2 GB
Activations at bs=256, T=256, 16 layers: ~4 GB
Total: ~8-10 GB → fits comfortably on A100 40GB

Wall-clock estimate

At 50M / bs=256: 275 ms/step
At 300M the linear projections scale 6×, attention dominates less since ratio shifts
Estimated: 1.0–1.3 s/step on A100
30K steps: ~10-12 hours on single A100

5. Ablations worth running BEFORE committing 10 hours

Rather than go straight to 300M/30K, run these cheap sanity checks first (each ~1-2 hours):

Third scale point: train a 15M config (d=640, L=6, H=20, d_ff=320) at bs=256/15K on GLM. Expected val_bpc ≈ 1.92 by the two-point extrapolation. If it comes in +0.1 off, the 300M prediction shifts by the same amount.
Architecture upgrade test: port the v52 stack (α + RMSNorm + softmax attention, still ±1 weights) to v57's 32-head / bs=256 setup at 46M. If it beats 1.766 significantly (say 1.65), consider switching the default before scaling. Worth the risk because 0.1 BPC at 300M is huge.
Bigger d_ff ratio: our 50M has d_ff/d_model = 0.5. Most production 1-bit LLMs use 2-4×. Try 50M with d_ff=2048 (d_ff/d_model=2.0) and see if we get a free −0.05 BPC.

These three experiments cost ~5 hours total and would substantially de-risk the 300M commit.

6. TL;DR

Step scaling: −0.08 BPC per doubling of training steps, consistent across 4 data points
Param scaling: −0.25 BPC per decade (5M→50M), only 2 points so ±0.1 uncertainty
Batch scaling: bs=64→256 with √4·LR gives −0.026 BPC in 64% wall-clock — "free"
Predicted 300M @ 30K bs=256 steps: 1.45–1.55 BPC on GLM (best estimate)
Honest confidence: moderate — 2 param points is thin. Getting a 15M data point first would cut uncertainty by >50% for negligible cost.