Scaling Laws Report β strict Β±1 1-bit LLM (BitLMv57 architecture)
1. Measured data points
Architecture: BitLMv57 β all weights Β±1 via sign_ste, 32-head Gumbel-hard-argmax attention, sign_ste(x + attn + ffn) residual, no normalization, no per-channel Ξ±. Byte-level vocab (128), ALiBi positional bias, T=256.
| run | params | dataset | steps | bs | LR | val_bpc | notes |
|---|---|---|---|---|---|---|---|
| v70 | 5.5M | TinyStories | 20K | 256 | 3.2e-3 | 1.481 | 5M sweep peak (strict Β±1) |
| v71 | 46.4M | TinyStories | 10K | 64 | 3e-4 | 1.335 | hit 1.3 target |
| v72 | 46.4M | TinyStories | 20K | 64 | 3e-4 | 1.238 | step scaling |
| v73 | 46.4M | TinyStories | 40K | 64 | 3e-4 | 1.169 | overnight best |
| glm_50M | 46.4M | GLM-Reasoning | 20K | 64 | 3e-4 | 1.877 | first GLM |
| glm_50M_40K | 46.4M | GLM-Reasoning | 40K | 64 | 3e-4 | 1.794 | 40K version |
| glm_50M_a100_bf16 | 46.4M | GLM-Reasoning | 40K | 64 | 3e-4 | 1.792 | A100 rerun, cached ALiBi |
| glm_50M_bs256_hlr | 46.4M | GLM-Reasoning | 15K | 256 | 6e-4 | 1.766 | best GLM result |
All runs use bf16 autocast + torch.compile on A100 (the 5090 runs used fp32). Effective forward math is identical (all weights/activations rounded to Β±1 via STE).
2. Scaling relationships
2.1 Step scaling (fixed params, fixed batch β the Kaplan "D" axis)
| dataset | steps | val_bpc | Ξ vs 2Γ fewer |
|---|---|---|---|
| TinyStories, 46.4M, bs=64 | 10K | 1.335 | β |
| TinyStories, 46.4M, bs=64 | 20K | 1.238 | β0.097 |
| TinyStories, 46.4M, bs=64 | 40K | 1.169 | β0.069 |
| GLM, 46.4M, bs=64 | 20K | 1.877 | β |
| GLM, 46.4M, bs=64 | 40K | 1.794 | β0.083 |
Rule of thumb: each doubling of training steps β β0.08 BPC. Holds remarkably consistently across both datasets and four doubling intervals. Suggests loss β loss_β β steps^βΞ² with Ξ² β 0.12β0.15 (assuming loss_β near 1.0 for TinyStories, 1.3 for GLM).
2.2 Parameter scaling (fixed compute per param β the Kaplan "N" axis)
TinyStories at ~20K-step budget:
| params | val_bpc | Ξ |
|---|---|---|
| 5.5M (v70) | 1.481 | β |
| 46.4M (v72) | 1.238 | β0.243 |
Rule of thumb: 9Γ more params β β0.24 BPC, i.e. β0.25 BPC per decade. Same ballpark slope as published 1-bit LLM curves (BitNet, OneBit). Extrapolation from only two points so use with caution.
2.3 Batch + LR scaling (same compute, different grad SNR)
GLM at 46.4M:
| bs | LR | steps | tokens | wall-clock | val_bpc |
|---|---|---|---|---|---|
| 64 | 3e-4 | 40,000 | 655 M | 108 min | 1.792 |
| 256 | 6e-4 | 15,000 | 983 M | 69 min | 1.766 |
bs=256 + β4Β·LR boost: β0.026 BPC at β36% wall-clock AND 1.5Γ more tokens consumed. The smoother gradient at 4Γ bs lets LR scale up by β4, which trades "more steps" for "more data per step" and wins on both axes for this dataset.
2.4 Dataset difficulty offset
Same 46.4M model, 40K bs=64 steps: TinyStories 1.169 vs GLM 1.794 β GLM is ~0.63 BPC harder (richer/multi-domain reasoning vs simple stories). This is a dataset-entropy offset; the N- and D-scaling slopes remain the same within each dataset.
3. Prediction for 300M
3.1 Point estimate β REVISED with 3-point fit on GLM
After running additional scale points at 5M and 14M on GLM with identical bs=256/lr=6e-4/15K recipe:
| params | val_bpc | decade-local slope |
|---|---|---|
| 5.52M | 2.213 | β |
| 13.70M | 2.003 | β0.53/decade |
| 46.45M | 1.766 | β0.45/decade |
The slope is flattening β consistent with approaching an irreducible loss floor. Log-linear extrapolation overshoots.
Chinchilla-form fit: L = A + B/N^Ξ± with A=1.0, Ξ±=0.218, B=1.76 reproduces all three points to within 0.02 BPC.
Predictions from this fit:
| training budget | val_bpc at 300M |
|---|---|
| 15K bs=256 (1B tokens) | ~1.47 |
| 30K bs=256 (2B tokens) | ~1.39 |
| 60K bs=256 (4B tokens) | ~1.31 |
Uncertainty: Β±0.05 BPC (tightened from Β±0.10 with the third scale point).
3.2 Caveats
Two-point slope is unreliable. A 3rd scale point (e.g., 15M or 150M) would tighten the prediction by >2Γ. Currently the 300M prediction has ~Β±0.1 BPC uncertainty.
1-bit capacity ceiling. Each 1-bit weight carries ~1 bit of information (vs ~24 usable bits in bf16 after catastrophic quantization). So our 300M-param strict-Β±1 model stores ~37.5 MB of weight entropy, less than a 50M fp32 model (200 MB). Expect the β0.25 BPC/decade slope to start flattening as you scale β at some N*, additional params don't help because they're all Β±1-constrained.
Architecture lift available. The
v52 BitNet variant(per-channel float Ξ± + RMSNorm + softmax attention, still Β±1-stored weights) reached 1.361 BPC at 5.5M/20K β 0.12 BPC better than our strict-Β±1 v70 at the same scale. If reproduced at 300M, that alone might give β0.12 BPC "for free" relative to the straight extrapolation.GLM train set is only 387 MB (~390M tokens). For Chinchilla-optimal at 300M params (20 tokens/param), we'd want ~6B tokens. We'd be training multi-epoch and hitting data-limited regime, which bends the D curve upward (overfitting).
4. Recommended 300M hyperparameters
Starting point based on our measurements plus standard 1-bit LLM practice:
Architecture
d_model = 2048(2x of 50M's 1024)n_layers = 16(2x)n_heads = 32(same β 32 heads was the peak at 50M; "more heads" stopped helping at 64)d_ff = 1024(2x)head_dim = 64(vs 32 at 50M β enables hardware tile alignment)- Total: ~290M params (configurable to hit exactly 300M if needed)
Training
bs = 256(our bs sweep shows this is peak throughput on A100 40GB at 50M; at 300M the mem budget becomes tight, may need bs=128)lr = 6e-4with cosine decay to 0, 500-step warmupsteps = 30K(β 2B tokens, respects empirical ~-0.08/doubling on top of the 300M base)T = 256, AdamW(Ξ²1=0.9, Ξ²2=0.95),weight_decay = 0.01, grad clip 1.0- bf16 autocast + torch.compile
tau_start = 2.0 β tau_end = 0.1log schedule (matches our 50M recipe)
Memory budget at 300M
- Latent fp32 weights: 300M Γ 4 = 1.2 GB
- AdamW m + v (fp32 each): 2.4 GB
- Grads: 1.2 GB
- Activations at bs=256, T=256, 16 layers: ~4 GB
- Total: ~8-10 GB β fits comfortably on A100 40GB
Wall-clock estimate
- At 50M / bs=256: 275 ms/step
- At 300M the linear projections scale 6Γ, attention dominates less since ratio shifts
- Estimated: 1.0β1.3 s/step on A100
- 30K steps: ~10-12 hours on single A100
5. Ablations worth running BEFORE committing 10 hours
Rather than go straight to 300M/30K, run these cheap sanity checks first (each ~1-2 hours):
Third scale point: train a 15M config (d=640, L=6, H=20, d_ff=320) at bs=256/15K on GLM. Expected val_bpc β 1.92 by the two-point extrapolation. If it comes in +0.1 off, the 300M prediction shifts by the same amount.
Architecture upgrade test: port the v52 stack (Ξ± + RMSNorm + softmax attention, still Β±1 weights) to v57's 32-head / bs=256 setup at 46M. If it beats 1.766 significantly (say 1.65), consider switching the default before scaling. Worth the risk because 0.1 BPC at 300M is huge.
Bigger d_ff ratio: our 50M has d_ff/d_model = 0.5. Most production 1-bit LLMs use 2-4Γ. Try 50M with d_ff=2048 (d_ff/d_model=2.0) and see if we get a free β0.05 BPC.
These three experiments cost ~5 hours total and would substantially de-risk the 300M commit.
6. TL;DR
- Step scaling: β0.08 BPC per doubling of training steps, consistent across 4 data points
- Param scaling: β0.25 BPC per decade (5Mβ50M), only 2 points so Β±0.1 uncertainty
- Batch scaling: bs=64β256 with β4Β·LR gives β0.026 BPC in 64% wall-clock β "free"
- Predicted 300M @ 30K bs=256 steps: 1.45β1.55 BPC on GLM (best estimate)
- Honest confidence: moderate β 2 param points is thin. Getting a 15M data point first would cut uncertainty by >50% for negligible cost.