A pure-binary character-level language model β POC progress report
Scope: we set out to test the maximalist claim from the Β±1 LM whitepaper β that a language model can be built where every weight, activation, and inference-time operation is reducible to binary/integer arithmetic β starting from a 5M-parameter TinyStories character model and scaling to 21M. This report covers what was attempted, what worked, what didn't, and where to go next for the 100Mβ500M scale-up.
Headline results
| Val BPC | Ratio to FP32 | Inference | |
|---|---|---|---|
| FP32 reference (5.3M params, standard transformer) | 0.96 | 1.0Γ | full FP |
| v21 (21M params, v18 arch, best model) | 1.47 | 1.53Γ | pure integer |
| v20 (~12M params, v18 arch) | 1.48 | 1.54Γ | pure integer |
| v17 (5M params, v18 arch) | 1.68 | 1.75Γ | pure integer |
| POC plateau (v3, standard bool-threshold attention) | 3.20 | 3.33Γ | pure integer |
The best model (v21, 21M params) produces qualitatively-readable English on TinyStories with dialogue, character names, and coherent sentence structure, while running inference using only XNOR, popcount, integer compare, and gather operations β no floating-point arithmetic on the hot path, verified to produce byte-identical predictions to the float path on 16 384 test positions.
Deployed inference is 8 200 tokens/second on a single AMD Ryzen 9 (4 OpenMP
threads) for the 21M model, with weights stored as 2.8 MB of packed bits. All
float scalars from training (1/βd scaling, ALiBi slopes, logit temperature,
bias vector) are absorbed into integer thresholds at checkpoint-load time.
The research journey: 21 architectural variants
The whitepaper identified "selective sparsity in attention" as the binding constraint for binary LMs. We iterated through 21 concrete variants trying to close that gap at β€5M params on TinyStories char-level. In chronological order:
Phase 1 β Establish the plateau (v2βv4)
- v2 β fully Β±1 with BiBERT bool-threshold attention: 3.62 BPC (had a +1 tie-break bias in majority-vote residuals that compounded across 8 layers)
- v3 β fixed v2 with a 3-way parallel residual
sign(x + attn(x) + ffn(x)): 3.20 BPC plateau β this became the reference "fully-Β±1 floor" - v4 β same as v3 but relaxed attention scores to softmax (float concession): 2.72 BPC β showed that attention was the limiting component
Conclusion: The 0.48 BPC gap between v3 and v4 quantified the cost of binary attention. Everything after this tried to close that gap without a float softmax.
Phase 2 β Sprint plan variants, all negative (v5βv8)
Following a detailed sprint plan from the paper's literature survey:
- v5 β Hadamard rotation + learnable integer Ο + 5-way residual with constant bias: 3.37 BPC (worse)
- v6 β XOR-multiplicative residual
y = x β F(x): 3.57 BPC (worse) β demonstrated that multiplicative composition strictly hurts - v7 β real 5-way parallel residual with multiple FFN branches: 3.31 BPC (neutral vs v3)
- v8 β Sign-JL attention with fixed random projections: 3.33 BPC (matches
v3 exactly, confirming the unbiasedness of JL estimation doesn't help when
followed by a
sign())
Conclusion: The four top-ranked bets from the sprint plan all failed at matched step count. The gradient-flow hypothesis was wrong (v11 had grad_nz = 0.997 and still hit a worse plateau).
Phase 3 β Different paradigms (v9βv10)
- v9 β pure-binary spiking RWKV char-LM (no attention matrix, linear recurrence + LIF neurons): 3.38 BPC β confirmed the SpikeGPT/BitNet synthesis doesn't outperform standard binary attention at this scale
- v10 β pure SDM char-LM (zero SGD, single-pass Hamming-ball retrieval): 6.12 BPC β confirmed the research report's own prediction that classical associative memory plateaus at n-gram quality
Phase 4 β Capacity and training-dynamics isolation (v11βv13)
- v11 β top-k ternary attention (
{-1, 0, +1}per query, sparse): 3.49 BPC (strictly worse β STE through discrete top-k gives noisy gradient) - v12 β noise-annealed STE: unstable, hit uniform 6.9 BPC at high noise
- v13 β time-multiplexed blocks (T=3 passes per token with random mask perturbations): 3.55 BPC plateau
Phase 5 β Width sweep (v14βv15)
Testing the hypothesis that hidden-state capacity binds the plateau:
- v14 β d_model=512, 4 layers (same 5M params as v3): 3.18 BPC (small improvement)
- v15 β d_model=768, 2 layers: 3.20 BPC (width helps but depth matters)
Phase 6 β The breakthrough (v16) β Gumbel hard-attention
Tried replacing the bool-threshold attention matrix with a Gumbel-softmax one-hot selection:
# At training:
g = gumbel_noise(scores.shape)
y_soft = softmax((scores + g) / tau, dim=-1)
y_hard = one_hot(argmax(y_soft))
A = y_soft + (y_hard - y_soft).detach() # straight-through
# At inference:
A = one_hot(argmax(scores)) # pure argmax
Each query attends to exactly one key. Attention matrix becomes binary {0, 1}
with one 1 per row. Temperature Ο anneals 2.0 β 0.1 over training.
Result at step 1500: 2.14 BPC, vs v3's 2.14 at step 13500. At step 3500, v16 crossed below 2.0 BPC β the whitepaper's H1 target. Final v16 at step 10k: 1.72 BPC.
The diagnosis of why this worked was surprising. Grad-flow went from 0.79 (v3) β 0.997 (v16), but v11 also had 0.997 and hit a 3.49 BPC plateau. The difference was gradient quality: Gumbel-softmax gives a proper continuous gradient through the attention selection, while STE-through-top-k gives the direction but not the magnitude of the desired update.
Phase 7 β Combine wins (v17)
- v17 β v16 + d_model=512 (width of v14 + attention of v16): 1.68 BPC at 10k steps, our best 5M-scale result.
Phase 8 β Deployment: pure-integer inference (v18)
v16/v17's inference path still had a few float scalars: 1/βd_head score
scaling, fractional ALiBi slopes, float logit_scale/out_bias. All are
positive-monotone under their argmax, so they can be absorbed into integer
thresholds without changing any prediction. v18 makes this explicit:
- Integer ALiBi slopes (powers of 2)
- BitLinear thresholds stored as
ceil(threshold_float Β· βin_features) - Output bias stored as
round(out_bias Β· 2^16 / logit_scale)in INT64 - Attention matrix as
argmax(int_scores)β one-hot gather from V
Verified to 100.0% byte-identical predictions to the float path over 16 384 test positions. v18 at step 10k: 1.74 BPC.
Phase 9 β Scale (v20βv21)
With the architecture and inference pipeline locked, scaling up:
- v20 β d_model=512, 6 layers, d_ff=768 (~12M params), 12k steps: 1.48 BPC (β0.20 vs v17)
- v21 β d_model=512, 8 layers, d_ff=1024 (~21M params), 12k steps: 1.47 BPC (β0.01 vs v20)
Scaling wins are large from 5Mβ12M and diminishing from 12Mβ21M at fixed step budget, so we're training-budget-bound more than capacity-bound at this size. Longer training would likely unlock more v21 capacity.
The final architecture β v18 in one page
Input: char IDs (int) of length T
Embedding:
x = gather_row(embed_codebook_Β±1, char_id) # Β±1 vector of d_model
For each of L layers:
Token mixer (Gumbel hard-attention):
Q, K, V = three BitLinear projections of x # Β±1 d_model each
scores = Q Β· Kα΅ # integer popcount
scores -= int_alibi_slope Β· |i β j| # integer subtract
at train: A = gumbel_softmax_straight_through(scores, Ο)
at eval: A = one_hot(argmax(scores)) # single integer argmax per query
O = gather(V, argmax_index) # pointer, no multiply
attn_out = BitLinear(O) # Β±1
Channel mixer (XNOR-gated FFN):
g, u = BitLinear_gate(x), BitLinear_up(x) # Β±1 each, d_ff
h = g XNOR u # Β±1, d_ff
ffn_out = BitLinear_down(h) # Β±1, d_model
Residual (3-way majority):
x = sign(x + attn_out + ffn_out) # {-3,-1,1,3}βsign, no ties
Output:
logits = popcount(x Β· embed_codebookα΅) Β· 2^16 + int_out_bias # int64
next_char = argmax(logits) # integer argmax over vocab
Forward invariants:
- All weights are stored as 1-bit signed (
+1/β1) in the deployed model - All activations between blocks are 1-bit signed
- Attention matrix is binary (one-hot per row)
- The only "float" at inference is a single
int64accumulator per output token
Training concessions:
- Float latent weight (small Gaussian) β
sign_steyields the Β±1 forward weight - Float Gumbel-softmax for the attention-selection gradient (absent at inference)
- Float cross-entropy loss over integer popcount logits
- Float AdamW optimizer state
None of these concessions are present in the deployed model; they're all paid once at training time.
Deployment pipeline
A complete training β export β C-inference pipeline is working end-to-end:
Training (
train.py): PyTorch training with Gumbel hard-attention, torch.compile, and optional integer ALiBi path. Runs at 492 K tokens/sec on an RTX 5090 for 5M models, ~130 K tokens/sec for 21M models.Export (
export_v18.py): Reads a trained checkpoint and writes a flat binary file:- 40-byte header with config
- Β±1 weight bits packed in
uint64words, row-major int32thresholds (one per output row of each BitLinear)int32ALiBi slopes (powers of 2)int64output bias (pre-scaled by 2^16 for exact integer argmax) Total: 738 KB for 5.3M params, 2.8 MB for 21M params.
C inference (
infer_omp.c): Single-file C, ~300 LOC, compiles withgcc -O3 -march=native -fopenmp. Uses AVX-512VPOPCNTQ, 256-bit XNORs, KV cache, integer argmax reductions, and OpenMP parallelism. Three progressively more optimized versions ship:Version Optimizations Speed (100 tok) Speedup inferbaseline scalar, no KV cache 905 ms 1Γ infer_kv+ KV cache 17 ms 53Γ infer_simd+ AVX-512 VPOPCNTQ 13 ms 70Γ infer_omp+ OpenMP (4 threads) 7 ms 130Γ All four versions produce byte-identical output on the same prompt.
Verification (
verify_binary_inference.py): Runs the Python float path and the Python integer path on the same inputs, confirms 100% next-token argmax agreement. Pre-verified at both small (step-500) and converged (step-3000 and later) checkpoints.
Training-side optimizations measured
For the 5M v18 model at 1500 steps with full data pipeline:
| Step time | Throughput | BPC | |
|---|---|---|---|
| Baseline | 67 ms | 244 K tok/s | 2.27 |
torch.compile |
33 ms | 492 K tok/s | 2.26 |
torch.compile is a 2Γ wallclock speedup with identical convergence, blocked initially by a subtle retracing bug (Gumbel Ο stored in a Python dict instead of a CUDA tensor buffer β fixed by switching to a mutable tensor).
Bop optimizer (Helwegen et al. 2019) was tested as a truly-Β±1 training alternative: weights stay strictly Β±1 at every training step, flips decided by momentum threshold. At matched 1500 steps, Bop hit 2.97 BPC vs Adam+STE's 2.26 β Bop converges about 3Γ slower but provides the "no float latent" philosophical purity if that matters for a publication claim.
What's proven
Binary char-LMs can cross the 2.0 BPC threshold at modest (5β21M) parameter counts with the right architecture. The whitepaper's H1 hypothesis is confirmed.
The binding constraint is attention selection, not gradient flow. Every architectural attempt that kept bool-threshold attention plateaued near 3.2 BPC, regardless of what else was changed (residual form, width, depth, optimizer). Only replacing the attention selector with a Gumbel-softmax one-hot mechanism broke through.
Inference is deployable with no floating-point ALU. Every operation on the deployed hot path is realizable on INT1 tensor cores, FPGAs, or spike- based neuromorphic substrates. The proof is a self-contained C file that produces byte-identical results to Python.
Scale monotonically improves BPC in the tested range (5M β 21M). v20 β v21 showed diminishing returns only because the training budget is binding, not capacity.
What's open / next
For the 100Mβ500M scale POC:
Bit-packed weight storage during training. At 100M Β±1 params in fp32, weights alone are 400 MB; with Adam state and gradients, optimizer memory hits several GB. Packing weights as 1-bit in-training (materialized to fp32 only inside the BitLinear forward via a Triton kernel) reduces memory 32Γ and makes 100Mβ1B scale tractable on a single 5090.
Triton XNOR-popcount GEMM kernel for training. Current fp32 "fake binary" matmul wastes memory bandwidth. A real INT1 kernel should give 3β5Γ matmul speedup at 100M+ scale.
Longer training budget. Our 12k-step runs are training-budget-bound. The v20βv21 gain of only β0.01 BPC at fixed steps suggests 50k+ steps would unlock more of the larger model.
FP32 teacher distillation. A ~5M FP32 teacher at 0.96 BPC could transfer ~half the teacher-student gap to the binary student, plausibly pushing the 21M model from 1.47 β 1.1β1.3 BPC at modest extra cost.
Real-world text. TinyStories is a simplified corpus. The next meaningful experiment is a 100M model on enwik8 or a FineWeb-edu subset, where the whitepaper's H1 was originally formulated.
Scaling-law map. Train v18-arch models at 5M, 25M, 100M, 500M with matched compute per param, fit a BPC(params) curve. This produces a publishable scaling-law plot for strict Β±1 LMs that the literature doesn't currently have.
Concrete deliverables
Everything is checked into /home/nathan/1bitllm/ (local) and mirrored to
/root/bitnet1/ on the experimentation VM:
- Model code:
model.py,model_v3.py,model_v4.py, β¦,model_v18.py(each variant self-contained) - Training:
train.py(PyTorch, all variants, torch.compile support),train_fp32.py(reference) - Optimizers:
optim.py(SignSGD-counter),optim_bop.py(Bop) - Export:
export_v18.py(packed binary format) - C inference:
infer.c,infer_kv.c,infer_simd.c,infer_omp.c(4 progressively optimized versions) - Verification:
verify_binary_inference.py,sample_ckpt.py - Data prep:
prep_data.py(TinyStories char-level byte memmap) - Checkpoints:
ckpt/v17_gumbel_wide_last.pt,ckpt/v20_scale_d512_last.pt,ckpt/v21_scale_21M_last.pt+ exported.binfiles - Logs:
logs/*.jsonlβ full training trajectories, grad diagnostics, and sample outputs for every run
The architecture and deployment pipeline are ready to scale β the remaining engineering for 100Mβ500M is custom Triton kernels for bit-packed memory layout, not fundamental architecture changes.