bitnet-1bitllm / SCALING_REPORT.md
hidude562's picture
1bitllm code (checkpoints to follow)
4754707 verified

Scaling Laws Report β€” strict Β±1 1-bit LLM (BitLMv57 architecture)

1. Measured data points

Architecture: BitLMv57 β€” all weights Β±1 via sign_ste, 32-head Gumbel-hard-argmax attention, sign_ste(x + attn + ffn) residual, no normalization, no per-channel Ξ±. Byte-level vocab (128), ALiBi positional bias, T=256.

run params dataset steps bs LR val_bpc notes
v70 5.5M TinyStories 20K 256 3.2e-3 1.481 5M sweep peak (strict Β±1)
v71 46.4M TinyStories 10K 64 3e-4 1.335 hit 1.3 target
v72 46.4M TinyStories 20K 64 3e-4 1.238 step scaling
v73 46.4M TinyStories 40K 64 3e-4 1.169 overnight best
glm_50M 46.4M GLM-Reasoning 20K 64 3e-4 1.877 first GLM
glm_50M_40K 46.4M GLM-Reasoning 40K 64 3e-4 1.794 40K version
glm_50M_a100_bf16 46.4M GLM-Reasoning 40K 64 3e-4 1.792 A100 rerun, cached ALiBi
glm_50M_bs256_hlr 46.4M GLM-Reasoning 15K 256 6e-4 1.766 best GLM result

All runs use bf16 autocast + torch.compile on A100 (the 5090 runs used fp32). Effective forward math is identical (all weights/activations rounded to Β±1 via STE).


2. Scaling relationships

2.1 Step scaling (fixed params, fixed batch β€” the Kaplan "D" axis)

dataset steps val_bpc Ξ” vs 2Γ— fewer
TinyStories, 46.4M, bs=64 10K 1.335 β€”
TinyStories, 46.4M, bs=64 20K 1.238 βˆ’0.097
TinyStories, 46.4M, bs=64 40K 1.169 βˆ’0.069
GLM, 46.4M, bs=64 20K 1.877 β€”
GLM, 46.4M, bs=64 40K 1.794 βˆ’0.083

Rule of thumb: each doubling of training steps β‰ˆ βˆ’0.08 BPC. Holds remarkably consistently across both datasets and four doubling intervals. Suggests loss βˆ’ loss_∞ ∝ steps^βˆ’Ξ² with Ξ² β‰ˆ 0.12–0.15 (assuming loss_∞ near 1.0 for TinyStories, 1.3 for GLM).

2.2 Parameter scaling (fixed compute per param β€” the Kaplan "N" axis)

TinyStories at ~20K-step budget:

params val_bpc Ξ”
5.5M (v70) 1.481 β€”
46.4M (v72) 1.238 βˆ’0.243

Rule of thumb: 9Γ— more params β‰ˆ βˆ’0.24 BPC, i.e. βˆ’0.25 BPC per decade. Same ballpark slope as published 1-bit LLM curves (BitNet, OneBit). Extrapolation from only two points so use with caution.

2.3 Batch + LR scaling (same compute, different grad SNR)

GLM at 46.4M:

bs LR steps tokens wall-clock val_bpc
64 3e-4 40,000 655 M 108 min 1.792
256 6e-4 15,000 983 M 69 min 1.766

bs=256 + √4Β·LR boost: βˆ’0.026 BPC at βˆ’36% wall-clock AND 1.5Γ— more tokens consumed. The smoother gradient at 4Γ— bs lets LR scale up by √4, which trades "more steps" for "more data per step" and wins on both axes for this dataset.

2.4 Dataset difficulty offset

Same 46.4M model, 40K bs=64 steps: TinyStories 1.169 vs GLM 1.794 β†’ GLM is ~0.63 BPC harder (richer/multi-domain reasoning vs simple stories). This is a dataset-entropy offset; the N- and D-scaling slopes remain the same within each dataset.


3. Prediction for 300M

3.1 Point estimate β€” REVISED with 3-point fit on GLM

After running additional scale points at 5M and 14M on GLM with identical bs=256/lr=6e-4/15K recipe:

params val_bpc decade-local slope
5.52M 2.213 β€”
13.70M 2.003 βˆ’0.53/decade
46.45M 1.766 βˆ’0.45/decade

The slope is flattening β€” consistent with approaching an irreducible loss floor. Log-linear extrapolation overshoots.

Chinchilla-form fit: L = A + B/N^Ξ± with A=1.0, Ξ±=0.218, B=1.76 reproduces all three points to within 0.02 BPC.

Predictions from this fit:

training budget val_bpc at 300M
15K bs=256 (1B tokens) ~1.47
30K bs=256 (2B tokens) ~1.39
60K bs=256 (4B tokens) ~1.31

Uncertainty: Β±0.05 BPC (tightened from Β±0.10 with the third scale point).

3.2 Caveats

  1. Two-point slope is unreliable. A 3rd scale point (e.g., 15M or 150M) would tighten the prediction by >2Γ—. Currently the 300M prediction has ~Β±0.1 BPC uncertainty.

  2. 1-bit capacity ceiling. Each 1-bit weight carries ~1 bit of information (vs ~24 usable bits in bf16 after catastrophic quantization). So our 300M-param strict-Β±1 model stores ~37.5 MB of weight entropy, less than a 50M fp32 model (200 MB). Expect the βˆ’0.25 BPC/decade slope to start flattening as you scale β€” at some N*, additional params don't help because they're all Β±1-constrained.

  3. Architecture lift available. The v52 BitNet variant (per-channel float Ξ± + RMSNorm + softmax attention, still Β±1-stored weights) reached 1.361 BPC at 5.5M/20K β€” 0.12 BPC better than our strict-Β±1 v70 at the same scale. If reproduced at 300M, that alone might give βˆ’0.12 BPC "for free" relative to the straight extrapolation.

  4. GLM train set is only 387 MB (~390M tokens). For Chinchilla-optimal at 300M params (20 tokens/param), we'd want ~6B tokens. We'd be training multi-epoch and hitting data-limited regime, which bends the D curve upward (overfitting).


4. Recommended 300M hyperparameters

Starting point based on our measurements plus standard 1-bit LLM practice:

Architecture

  • d_model = 2048 (2x of 50M's 1024)
  • n_layers = 16 (2x)
  • n_heads = 32 (same β€” 32 heads was the peak at 50M; "more heads" stopped helping at 64)
  • d_ff = 1024 (2x)
  • head_dim = 64 (vs 32 at 50M β€” enables hardware tile alignment)
  • Total: ~290M params (configurable to hit exactly 300M if needed)

Training

  • bs = 256 (our bs sweep shows this is peak throughput on A100 40GB at 50M; at 300M the mem budget becomes tight, may need bs=128)
  • lr = 6e-4 with cosine decay to 0, 500-step warmup
  • steps = 30K (β‰ˆ 2B tokens, respects empirical ~-0.08/doubling on top of the 300M base)
  • T = 256, AdamW (Ξ²1=0.9, Ξ²2=0.95), weight_decay = 0.01, grad clip 1.0
  • bf16 autocast + torch.compile
  • tau_start = 2.0 β†’ tau_end = 0.1 log schedule (matches our 50M recipe)

Memory budget at 300M

  • Latent fp32 weights: 300M Γ— 4 = 1.2 GB
  • AdamW m + v (fp32 each): 2.4 GB
  • Grads: 1.2 GB
  • Activations at bs=256, T=256, 16 layers: ~4 GB
  • Total: ~8-10 GB β†’ fits comfortably on A100 40GB

Wall-clock estimate

  • At 50M / bs=256: 275 ms/step
  • At 300M the linear projections scale 6Γ—, attention dominates less since ratio shifts
  • Estimated: 1.0–1.3 s/step on A100
  • 30K steps: ~10-12 hours on single A100

5. Ablations worth running BEFORE committing 10 hours

Rather than go straight to 300M/30K, run these cheap sanity checks first (each ~1-2 hours):

  1. Third scale point: train a 15M config (d=640, L=6, H=20, d_ff=320) at bs=256/15K on GLM. Expected val_bpc β‰ˆ 1.92 by the two-point extrapolation. If it comes in +0.1 off, the 300M prediction shifts by the same amount.

  2. Architecture upgrade test: port the v52 stack (Ξ± + RMSNorm + softmax attention, still Β±1 weights) to v57's 32-head / bs=256 setup at 46M. If it beats 1.766 significantly (say 1.65), consider switching the default before scaling. Worth the risk because 0.1 BPC at 300M is huge.

  3. Bigger d_ff ratio: our 50M has d_ff/d_model = 0.5. Most production 1-bit LLMs use 2-4Γ—. Try 50M with d_ff=2048 (d_ff/d_model=2.0) and see if we get a free βˆ’0.05 BPC.

These three experiments cost ~5 hours total and would substantially de-risk the 300M commit.


6. TL;DR

  • Step scaling: βˆ’0.08 BPC per doubling of training steps, consistent across 4 data points
  • Param scaling: βˆ’0.25 BPC per decade (5Mβ†’50M), only 2 points so Β±0.1 uncertainty
  • Batch scaling: bs=64β†’256 with √4Β·LR gives βˆ’0.026 BPC in 64% wall-clock β€” "free"
  • Predicted 300M @ 30K bs=256 steps: 1.45–1.55 BPC on GLM (best estimate)
  • Honest confidence: moderate β€” 2 param points is thin. Getting a 15M data point first would cut uncertainty by >50% for negligible cost.