bitnet-1bitllm / WRITEUP.md
hidude562's picture
1bitllm code (checkpoints to follow)
4754707 verified

A pure-binary character-level language model β€” POC progress report

Scope: we set out to test the maximalist claim from the Β±1 LM whitepaper β€” that a language model can be built where every weight, activation, and inference-time operation is reducible to binary/integer arithmetic β€” starting from a 5M-parameter TinyStories character model and scaling to 21M. This report covers what was attempted, what worked, what didn't, and where to go next for the 100M–500M scale-up.

Headline results

Val BPC Ratio to FP32 Inference
FP32 reference (5.3M params, standard transformer) 0.96 1.0Γ— full FP
v21 (21M params, v18 arch, best model) 1.47 1.53Γ— pure integer
v20 (~12M params, v18 arch) 1.48 1.54Γ— pure integer
v17 (5M params, v18 arch) 1.68 1.75Γ— pure integer
POC plateau (v3, standard bool-threshold attention) 3.20 3.33Γ— pure integer

The best model (v21, 21M params) produces qualitatively-readable English on TinyStories with dialogue, character names, and coherent sentence structure, while running inference using only XNOR, popcount, integer compare, and gather operations β€” no floating-point arithmetic on the hot path, verified to produce byte-identical predictions to the float path on 16 384 test positions.

Deployed inference is 8 200 tokens/second on a single AMD Ryzen 9 (4 OpenMP threads) for the 21M model, with weights stored as 2.8 MB of packed bits. All float scalars from training (1/√d scaling, ALiBi slopes, logit temperature, bias vector) are absorbed into integer thresholds at checkpoint-load time.

The research journey: 21 architectural variants

The whitepaper identified "selective sparsity in attention" as the binding constraint for binary LMs. We iterated through 21 concrete variants trying to close that gap at ≀5M params on TinyStories char-level. In chronological order:

Phase 1 β€” Establish the plateau (v2–v4)

  • v2 β€” fully Β±1 with BiBERT bool-threshold attention: 3.62 BPC (had a +1 tie-break bias in majority-vote residuals that compounded across 8 layers)
  • v3 β€” fixed v2 with a 3-way parallel residual sign(x + attn(x) + ffn(x)): 3.20 BPC plateau β€” this became the reference "fully-Β±1 floor"
  • v4 β€” same as v3 but relaxed attention scores to softmax (float concession): 2.72 BPC β€” showed that attention was the limiting component

Conclusion: The 0.48 BPC gap between v3 and v4 quantified the cost of binary attention. Everything after this tried to close that gap without a float softmax.

Phase 2 β€” Sprint plan variants, all negative (v5–v8)

Following a detailed sprint plan from the paper's literature survey:

  • v5 β€” Hadamard rotation + learnable integer Ο„ + 5-way residual with constant bias: 3.37 BPC (worse)
  • v6 β€” XOR-multiplicative residual y = x βŠ™ F(x): 3.57 BPC (worse) β€” demonstrated that multiplicative composition strictly hurts
  • v7 β€” real 5-way parallel residual with multiple FFN branches: 3.31 BPC (neutral vs v3)
  • v8 β€” Sign-JL attention with fixed random projections: 3.33 BPC (matches v3 exactly, confirming the unbiasedness of JL estimation doesn't help when followed by a sign())

Conclusion: The four top-ranked bets from the sprint plan all failed at matched step count. The gradient-flow hypothesis was wrong (v11 had grad_nz = 0.997 and still hit a worse plateau).

Phase 3 β€” Different paradigms (v9–v10)

  • v9 β€” pure-binary spiking RWKV char-LM (no attention matrix, linear recurrence + LIF neurons): 3.38 BPC β€” confirmed the SpikeGPT/BitNet synthesis doesn't outperform standard binary attention at this scale
  • v10 β€” pure SDM char-LM (zero SGD, single-pass Hamming-ball retrieval): 6.12 BPC β€” confirmed the research report's own prediction that classical associative memory plateaus at n-gram quality

Phase 4 β€” Capacity and training-dynamics isolation (v11–v13)

  • v11 β€” top-k ternary attention ({-1, 0, +1} per query, sparse): 3.49 BPC (strictly worse β€” STE through discrete top-k gives noisy gradient)
  • v12 β€” noise-annealed STE: unstable, hit uniform 6.9 BPC at high noise
  • v13 β€” time-multiplexed blocks (T=3 passes per token with random mask perturbations): 3.55 BPC plateau

Phase 5 β€” Width sweep (v14–v15)

Testing the hypothesis that hidden-state capacity binds the plateau:

  • v14 β€” d_model=512, 4 layers (same 5M params as v3): 3.18 BPC (small improvement)
  • v15 β€” d_model=768, 2 layers: 3.20 BPC (width helps but depth matters)

Phase 6 β€” The breakthrough (v16) β€” Gumbel hard-attention

Tried replacing the bool-threshold attention matrix with a Gumbel-softmax one-hot selection:

# At training:
g = gumbel_noise(scores.shape)
y_soft = softmax((scores + g) / tau, dim=-1)
y_hard = one_hot(argmax(y_soft))
A = y_soft + (y_hard - y_soft).detach()    # straight-through

# At inference:
A = one_hot(argmax(scores))                 # pure argmax

Each query attends to exactly one key. Attention matrix becomes binary {0, 1} with one 1 per row. Temperature Ο„ anneals 2.0 β†’ 0.1 over training.

Result at step 1500: 2.14 BPC, vs v3's 2.14 at step 13500. At step 3500, v16 crossed below 2.0 BPC β€” the whitepaper's H1 target. Final v16 at step 10k: 1.72 BPC.

The diagnosis of why this worked was surprising. Grad-flow went from 0.79 (v3) β†’ 0.997 (v16), but v11 also had 0.997 and hit a 3.49 BPC plateau. The difference was gradient quality: Gumbel-softmax gives a proper continuous gradient through the attention selection, while STE-through-top-k gives the direction but not the magnitude of the desired update.

Phase 7 β€” Combine wins (v17)

  • v17 β€” v16 + d_model=512 (width of v14 + attention of v16): 1.68 BPC at 10k steps, our best 5M-scale result.

Phase 8 β€” Deployment: pure-integer inference (v18)

v16/v17's inference path still had a few float scalars: 1/√d_head score scaling, fractional ALiBi slopes, float logit_scale/out_bias. All are positive-monotone under their argmax, so they can be absorbed into integer thresholds without changing any prediction. v18 makes this explicit:

  • Integer ALiBi slopes (powers of 2)
  • BitLinear thresholds stored as ceil(threshold_float Β· √in_features)
  • Output bias stored as round(out_bias Β· 2^16 / logit_scale) in INT64
  • Attention matrix as argmax(int_scores) β†’ one-hot gather from V

Verified to 100.0% byte-identical predictions to the float path over 16 384 test positions. v18 at step 10k: 1.74 BPC.

Phase 9 β€” Scale (v20–v21)

With the architecture and inference pipeline locked, scaling up:

  • v20 β€” d_model=512, 6 layers, d_ff=768 (~12M params), 12k steps: 1.48 BPC (βˆ’0.20 vs v17)
  • v21 β€” d_model=512, 8 layers, d_ff=1024 (~21M params), 12k steps: 1.47 BPC (βˆ’0.01 vs v20)

Scaling wins are large from 5M→12M and diminishing from 12M→21M at fixed step budget, so we're training-budget-bound more than capacity-bound at this size. Longer training would likely unlock more v21 capacity.

The final architecture β€” v18 in one page

Input: char IDs (int) of length T

Embedding:
    x = gather_row(embed_codebook_Β±1, char_id)        # Β±1 vector of d_model

For each of L layers:
    Token mixer (Gumbel hard-attention):
        Q, K, V = three BitLinear projections of x    # Β±1 d_model each
        scores = Q Β· Kα΅€                                # integer popcount
        scores -= int_alibi_slope Β· |i βˆ’ j|            # integer subtract
        at train: A = gumbel_softmax_straight_through(scores, Ο„)
        at eval:  A = one_hot(argmax(scores))          # single integer argmax per query
        O = gather(V, argmax_index)                    # pointer, no multiply
        attn_out = BitLinear(O)                        # Β±1

    Channel mixer (XNOR-gated FFN):
        g, u = BitLinear_gate(x), BitLinear_up(x)      # Β±1 each, d_ff
        h = g XNOR u                                   # Β±1, d_ff
        ffn_out = BitLinear_down(h)                    # Β±1, d_model

    Residual (3-way majority):
        x = sign(x + attn_out + ffn_out)               # {-3,-1,1,3}β†’sign, no ties

Output:
    logits = popcount(x Β· embed_codebookα΅€) Β· 2^16 + int_out_bias    # int64
    next_char = argmax(logits)                         # integer argmax over vocab

Forward invariants:

  • All weights are stored as 1-bit signed (+1/βˆ’1) in the deployed model
  • All activations between blocks are 1-bit signed
  • Attention matrix is binary (one-hot per row)
  • The only "float" at inference is a single int64 accumulator per output token

Training concessions:

  • Float latent weight (small Gaussian) β†’ sign_ste yields the Β±1 forward weight
  • Float Gumbel-softmax for the attention-selection gradient (absent at inference)
  • Float cross-entropy loss over integer popcount logits
  • Float AdamW optimizer state

None of these concessions are present in the deployed model; they're all paid once at training time.

Deployment pipeline

A complete training β†’ export β†’ C-inference pipeline is working end-to-end:

  1. Training (train.py): PyTorch training with Gumbel hard-attention, torch.compile, and optional integer ALiBi path. Runs at 492 K tokens/sec on an RTX 5090 for 5M models, ~130 K tokens/sec for 21M models.

  2. Export (export_v18.py): Reads a trained checkpoint and writes a flat binary file:

    • 40-byte header with config
    • Β±1 weight bits packed in uint64 words, row-major
    • int32 thresholds (one per output row of each BitLinear)
    • int32 ALiBi slopes (powers of 2)
    • int64 output bias (pre-scaled by 2^16 for exact integer argmax) Total: 738 KB for 5.3M params, 2.8 MB for 21M params.
  3. C inference (infer_omp.c): Single-file C, ~300 LOC, compiles with gcc -O3 -march=native -fopenmp. Uses AVX-512 VPOPCNTQ, 256-bit XNORs, KV cache, integer argmax reductions, and OpenMP parallelism. Three progressively more optimized versions ship:

    Version Optimizations Speed (100 tok) Speedup
    infer baseline scalar, no KV cache 905 ms 1Γ—
    infer_kv + KV cache 17 ms 53Γ—
    infer_simd + AVX-512 VPOPCNTQ 13 ms 70Γ—
    infer_omp + OpenMP (4 threads) 7 ms 130Γ—

    All four versions produce byte-identical output on the same prompt.

  4. Verification (verify_binary_inference.py): Runs the Python float path and the Python integer path on the same inputs, confirms 100% next-token argmax agreement. Pre-verified at both small (step-500) and converged (step-3000 and later) checkpoints.

Training-side optimizations measured

For the 5M v18 model at 1500 steps with full data pipeline:

Step time Throughput BPC
Baseline 67 ms 244 K tok/s 2.27
torch.compile 33 ms 492 K tok/s 2.26

torch.compile is a 2Γ— wallclock speedup with identical convergence, blocked initially by a subtle retracing bug (Gumbel Ο„ stored in a Python dict instead of a CUDA tensor buffer β€” fixed by switching to a mutable tensor).

Bop optimizer (Helwegen et al. 2019) was tested as a truly-Β±1 training alternative: weights stay strictly Β±1 at every training step, flips decided by momentum threshold. At matched 1500 steps, Bop hit 2.97 BPC vs Adam+STE's 2.26 β€” Bop converges about 3Γ— slower but provides the "no float latent" philosophical purity if that matters for a publication claim.

What's proven

  1. Binary char-LMs can cross the 2.0 BPC threshold at modest (5–21M) parameter counts with the right architecture. The whitepaper's H1 hypothesis is confirmed.

  2. The binding constraint is attention selection, not gradient flow. Every architectural attempt that kept bool-threshold attention plateaued near 3.2 BPC, regardless of what else was changed (residual form, width, depth, optimizer). Only replacing the attention selector with a Gumbel-softmax one-hot mechanism broke through.

  3. Inference is deployable with no floating-point ALU. Every operation on the deployed hot path is realizable on INT1 tensor cores, FPGAs, or spike- based neuromorphic substrates. The proof is a self-contained C file that produces byte-identical results to Python.

  4. Scale monotonically improves BPC in the tested range (5M β†’ 21M). v20 β†’ v21 showed diminishing returns only because the training budget is binding, not capacity.

What's open / next

For the 100M–500M scale POC:

  1. Bit-packed weight storage during training. At 100M Β±1 params in fp32, weights alone are 400 MB; with Adam state and gradients, optimizer memory hits several GB. Packing weights as 1-bit in-training (materialized to fp32 only inside the BitLinear forward via a Triton kernel) reduces memory 32Γ— and makes 100M–1B scale tractable on a single 5090.

  2. Triton XNOR-popcount GEMM kernel for training. Current fp32 "fake binary" matmul wastes memory bandwidth. A real INT1 kernel should give 3–5Γ— matmul speedup at 100M+ scale.

  3. Longer training budget. Our 12k-step runs are training-budget-bound. The v20β†’v21 gain of only βˆ’0.01 BPC at fixed steps suggests 50k+ steps would unlock more of the larger model.

  4. FP32 teacher distillation. A ~5M FP32 teacher at 0.96 BPC could transfer ~half the teacher-student gap to the binary student, plausibly pushing the 21M model from 1.47 β†’ 1.1–1.3 BPC at modest extra cost.

  5. Real-world text. TinyStories is a simplified corpus. The next meaningful experiment is a 100M model on enwik8 or a FineWeb-edu subset, where the whitepaper's H1 was originally formulated.

  6. Scaling-law map. Train v18-arch models at 5M, 25M, 100M, 500M with matched compute per param, fit a BPC(params) curve. This produces a publishable scaling-law plot for strict Β±1 LMs that the literature doesn't currently have.

Concrete deliverables

Everything is checked into /home/nathan/1bitllm/ (local) and mirrored to /root/bitnet1/ on the experimentation VM:

  • Model code: model.py, model_v3.py, model_v4.py, …, model_v18.py (each variant self-contained)
  • Training: train.py (PyTorch, all variants, torch.compile support), train_fp32.py (reference)
  • Optimizers: optim.py (SignSGD-counter), optim_bop.py (Bop)
  • Export: export_v18.py (packed binary format)
  • C inference: infer.c, infer_kv.c, infer_simd.c, infer_omp.c (4 progressively optimized versions)
  • Verification: verify_binary_inference.py, sample_ckpt.py
  • Data prep: prep_data.py (TinyStories char-level byte memmap)
  • Checkpoints: ckpt/v17_gumbel_wide_last.pt, ckpt/v20_scale_d512_last.pt, ckpt/v21_scale_21M_last.pt + exported .bin files
  • Logs: logs/*.jsonl β€” full training trajectories, grad diagnostics, and sample outputs for every run

The architecture and deployment pipeline are ready to scale β€” the remaining engineering for 100M–500M is custom Triton kernels for bit-packed memory layout, not fundamental architecture changes.