Joysulem commited on
Commit
9cdc2ab
Β·
verified Β·
1 Parent(s): f849236

Create Bible Readme

Browse files
Files changed (1) hide show
  1. Bible Readme +667 -0
Bible Readme ADDED
@@ -0,0 +1,667 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ================================================================================
2
+ FIREECHO ENGINE
3
+ High-Performance Single-GPU Inference Kernel for 30B+ MoE Models
4
+ ================================================================================
5
+
6
+ Creator & Sole Author: Luis E. Davila Flores (@Joysulem)
7
+ License: CC BY-NC 4.0 (free for research, attribution required)
8
+ Status: Production-quality single-GPU decode, research extensions active
9
+
10
+ ================================================================================
11
+ WHAT IS FIREECHO?
12
+ ================================================================================
13
+
14
+ FireEcho is a custom inference engine that runs a 30 BILLION parameter
15
+ Mixture-of-Experts model (Qwen3-Omni-30B) on a SINGLE consumer GPU at
16
+ 45+ tokens/second β€” using only 20 GB VRAM.
17
+
18
+ No multi-GPU. No cloud. No NVIDIA proprietary libraries.
19
+ Just Triton + PyTorch + one GPU.
20
+
21
+ Key numbers:
22
+ - 30.5B total params, ~3.3B active per token (128 experts, top-8 routing)
23
+ - 4x compression via Goliath FP4 fused dequant-matmul (61 GB -> 20 GB)
24
+ - 124x speedup from baseline (0.4 -> 49.4 tok/s) through 7 optimization layers
25
+ - Zero NVIDIA proprietary dependencies (no cuQuantizer, CUTLASS, TensorRT)
26
+ - Runs anywhere Triton compiles: NVIDIA CUDA, AMD ROCm, Intel XPU
27
+
28
+ What makes FireEcho different from vLLM/TGI/llama.cpp:
29
+ - Goliath kernel: FP4 dequantization INSIDE the Triton matmul loop (no separate
30
+ dequant step, no global memory materialization)
31
+ - Packed MoE: All 128 experts packed into one contiguous buffer per layer,
32
+ expert IDs stay on GPU β€” zero CPU-GPU sync during decode
33
+ - FlashDecode: Custom Triton attention kernel with online softmax for M=1 GQA
34
+ - Hebbian Memory: Biologically-inspired fast weights that learn at inference time
35
+ - FE-XC/INT2: Cold experts auto-demote to 2-bit (codebook or scalar) β€” further
36
+ bandwidth savings without touching hot experts
37
+ - CUDA Graph decode: Entire decode step captured as a graph, ~15.8ms/step
38
+
39
+ ================================================================================
40
+ CURRENT STATUS & REALISTIC EXPECTATIONS
41
+ ================================================================================
42
+
43
+ WHAT WORKS (production-quality):
44
+ [x] Full Qwen3-Omni-30B inference at 45+ tok/s on RTX 5090
45
+ [x] Goliath FP4 quantization (20 GB VRAM, FP16-quality output)
46
+ [x] Packed MoE with fused dequant-matmul (zero CPU sync)
47
+ [x] FlashDecode attention (online softmax, valid_len masking)
48
+ [x] CUDA Graph decode (graph-captured forward pass)
49
+ [x] Flat KV cache (pre-allocated, zero allocation per token)
50
+ [x] FP8 KV cache (50% VRAM savings on attention)
51
+ [x] FE-XC cold expert demotion (codebook 2-bit, 5.3x faster kernel)
52
+ [x] INT2 ultra-cold expert demotion (scalar 2-bit)
53
+ [x] Hebbian persistent memory (learns during inference)
54
+ [x] Atlas gatekeeper (expert banning + MoDES skipping)
55
+ [x] Streaming shard loader (110s cold start, 3.1 GB CPU RAM)
56
+
57
+ WHAT'S RESEARCH/EXPERIMENTAL:
58
+ [ ] EAGLE-3 speculative decoding (infrastructure done, head needs training)
59
+ [ ] FE-XT tree speculation (code done, needs trained draft head)
60
+ [ ] FE-H Hayabusa async prefetch (code done, needs benchmarking)
61
+ [ ] Batched speculative decode (infrastructure done)
62
+ [ ] Multi-GPU (not implemented β€” single-GPU is the design philosophy)
63
+
64
+ WILL NOT WORK ON:
65
+ - GPUs with < 24 GB VRAM (model is 20 GB + KV cache)
66
+ - CUDA < 12.4 (BF16 atomics, FP8 support needed)
67
+ - CPU-only (Triton compiles to GPU targets)
68
+
69
+ ================================================================================
70
+ HARDWARE & SOFTWARE REQUIREMENTS
71
+ ================================================================================
72
+
73
+ Component Minimum Recommended
74
+ ───────────────── ─────────────────── ────────────────────────
75
+ GPU RTX 4090 (24 GB)* RTX 5090 (32 GB)
76
+ GPU VRAM 24 GB 32 GB
77
+ CPU Any modern x86_64 Ryzen 9 9950X / i9-14900K
78
+ System RAM 32 GB 64 GB
79
+ CUDA 12.4+ 12.8+
80
+ Python 3.10 - 3.12 3.12
81
+ PyTorch 2.4.0+ 2.6.0+cu128
82
+ Triton 3.0+ 3.2+
83
+ OS Linux (x86_64) Arch Linux / Ubuntu 22.04+
84
+
85
+ * RTX 4090: Will work but FP4 kernels may be slower (no Blackwell tensor cores)
86
+ * RTX 3090: Marginal β€” 24 GB VRAM is tight, FP8 not supported
87
+ * AMD GPUs: Triton compiles to ROCm β€” untested but architecturally supported
88
+
89
+ Tested configuration (author's machine):
90
+ AMD Ryzen 9 9950X + NVIDIA RTX 5090 32 GB + 64 GB DDR5
91
+ Arch Linux, CUDA 12.8, Python 3.12, PyTorch 2.6.0+cu128, Triton 3.2
92
+
93
+ ================================================================================
94
+ INSTALLATION
95
+ ================================================================================
96
+
97
+ Step 1: Clone the repository
98
+ ─────────────────────────────
99
+ git clone https://github.com/Joysulem/FireEcho.git
100
+ cd FireEcho
101
+
102
+ Step 2: Create a Python virtual environment
103
+ ────────────────────────────────────────────
104
+ python3.12 -m venv .venv
105
+ source .venv/bin/activate
106
+
107
+ Step 3: Install dependencies
108
+ ─────────────────────────────
109
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
110
+ pip install triton transformers tokenizers safetensors sentencepiece
111
+
112
+ Step 4: Verify installation
113
+ ────────────────────────────
114
+ python -c "import torch; print('CUDA:', torch.cuda.is_available(), '|', torch.version.cuda)"
115
+ python -c "import triton; print('Triton:', triton.__version__)"
116
+
117
+ Expected output:
118
+ CUDA: True | 12.8
119
+ Triton: 3.2.0
120
+
121
+ Step 5: Download a model (Qwen3-Omni-30B recommended)
122
+ ──────────────────────────────────────────────────────
123
+ # Option A: Via huggingface-cli
124
+ pip install huggingface-hub
125
+ huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./model/Qwen3-Omni
126
+
127
+ # Option B: Via git lfs
128
+ git lfs install
129
+ git clone https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct ./model/Qwen3-Omni
130
+
131
+ ================================================================================
132
+ QUICK SMOKE TEST (run this first!)
133
+ ================================================================================
134
+
135
+ cd FireEcho/kernel/FireEcho\ Engine/
136
+
137
+ python -c "
138
+ from fireecho_kernel import FireEchoEngine
139
+ import torch
140
+
141
+ # Load model (takes ~110 seconds, streams layer-by-layer)
142
+ engine = FireEchoEngine.from_pretrained('./model/Qwen3-Omni')
143
+
144
+ # Quick generation test
145
+ tokens = engine.tokenizer.encode('The capital of France is', return_tensors='pt').cuda()
146
+ output = engine.generate(tokens, max_new_tokens=20, temperature=0.0)
147
+ print(engine.tokenizer.decode(output[0]))
148
+ print(f'VRAM used: {torch.cuda.max_memory_allocated()/1e9:.1f} GB')
149
+ "
150
+
151
+ Expected output:
152
+ The capital of France is Paris. Paris is the largest city in France...
153
+ VRAM used: 23.1 GB
154
+
155
+ If this works, your setup is correct. If not, check:
156
+ - CUDA version matches PyTorch build (torch.version.cuda)
157
+ - GPU has enough VRAM (nvidia-smi)
158
+ - Model path is correct
159
+
160
+ ================================================================================
161
+ BASIC INFERENCE USAGE
162
+ ================================================================================
163
+
164
+ ─── Minimal example ───
165
+
166
+ from fireecho_kernel import FireEchoEngine
167
+
168
+ # Load model with FP4 quantization (automatic for Qwen3-Omni)
169
+ engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
170
+
171
+ # Encode input
172
+ input_ids = engine.tokenizer.encode(
173
+ "Explain quantum computing in simple terms",
174
+ return_tensors='pt'
175
+ ).cuda()
176
+
177
+ # Generate
178
+ output = engine.generate(
179
+ input_ids,
180
+ max_new_tokens=200,
181
+ temperature=0.7,
182
+ top_p=0.9,
183
+ use_cache=True
184
+ )
185
+
186
+ # Decode and print
187
+ print(engine.tokenizer.decode(output[0], skip_special_tokens=True))
188
+
189
+
190
+ ─── High-performance decode (all optimizations) ───
191
+
192
+ engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
193
+
194
+ # Enable flat KV cache (eliminates torch.cat overhead)
195
+ engine.enable_flat_decode() # +403 MB VRAM, BF16 KV
196
+
197
+ # Or FP8 KV cache (half the VRAM, same speed)
198
+ engine.enable_flat_decode(kv_dtype='fp8') # +208 MB VRAM
199
+
200
+ # Enable CUDA Graph decode (captures forward pass as graph)
201
+ engine.enable_cuda_graph_decode() # +0 VRAM, ~10% faster
202
+
203
+ # Enable Atlas gatekeeper (prunes cold experts at runtime)
204
+ engine.enable_atlas(
205
+ profile_prompts=8,
206
+ ban_pct=0.25, # Ban bottom 25% of experts per layer
207
+ modes_threshold=2.0 # MoDES: skip MoE for uncertain tokens
208
+ )
209
+
210
+ # Enable FE-XC cold expert demotion (2-bit codebook)
211
+ engine.enable_auto_fexc_demotion(cold_threshold=0.10)
212
+
213
+ # Enable INT2 ultra-cold expert demotion
214
+ engine.enable_auto_int2_demotion(cold_threshold=0.05)
215
+
216
+ # Generate with everything enabled
217
+ output = engine.generate(input_ids, max_new_tokens=500)
218
+
219
+
220
+ ─── Interactive chat loop ───
221
+
222
+ engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
223
+ engine.enable_flat_decode()
224
+ engine.enable_cuda_graph_decode()
225
+
226
+ print("FireEcho Chat (type 'quit' to exit)")
227
+ while True:
228
+ user_input = input("\nYou: ")
229
+ if user_input.lower() == 'quit':
230
+ break
231
+
232
+ # Format as chat (Qwen3 format)
233
+ prompt = f"<|im_start|>user\n{user_input}<|im_end|>\n<|im_start|>assistant\n"
234
+ input_ids = engine.tokenizer.encode(prompt, return_tensors='pt').cuda()
235
+
236
+ output = engine.generate(
237
+ input_ids,
238
+ max_new_tokens=500,
239
+ temperature=0.7,
240
+ top_p=0.9
241
+ )
242
+
243
+ response = engine.tokenizer.decode(
244
+ output[0][input_ids.shape[1]:],
245
+ skip_special_tokens=True
246
+ )
247
+ print(f"\nFireEcho: {response}")
248
+
249
+ ================================================================================
250
+ BENCHMARKING
251
+ ================================================================================
252
+
253
+ ─── Quick speed test ───
254
+
255
+ python benchmark_fullstack.py
256
+
257
+ This runs 7 optimization layers, stacking each one:
258
+ L0: Baseline (FP4 + packed MoE + flat KV BF16) ~45 tok/s
259
+ L1: + FP8 KV cache ~42 tok/s
260
+ L2: + L2 layer prefetch ~42 tok/s
261
+ L3: + Atlas Ban & Pick (8->~5 experts) ~40 tok/s
262
+ L4: + FE-XC cold experts (2-bit codebook) ~39 tok/s
263
+ L5: + INT2 coldest experts (2-bit scalar) ~38 tok/s
264
+ L6: + CUDA Graph decode ~TBD
265
+
266
+ Note: L1-L5 are slightly slower than L0 due to overhead from
267
+ additional dispatch logic. The REAL benefit comes when combined
268
+ with speculative decoding (EAGLE-3) β€” the bandwidth savings from
269
+ FE-XC/INT2 allow more tokens to be verified per unit time.
270
+
271
+
272
+ ─── EAGLE-3 benchmark (speculative decode) ───
273
+
274
+ python benchmark_eagle.py --checkpoint eagle_checkpoints/eagle_best.pt
275
+
276
+ Note: Requires a trained draft head. See "EAGLE-3 Training" section.
277
+
278
+ ================================================================================
279
+ FEATURE REFERENCE (Cheat Sheet)
280
+ ================================================================================
281
+
282
+ Feature How to enable VRAM cost
283
+ ─────────────────────── ───────────────────────────────── ──────────
284
+ Flat KV cache (BF16) engine.enable_flat_decode() +403 MB
285
+ Flat KV cache (FP8) engine.enable_flat_decode('fp8') +208 MB
286
+ CUDA Graph decode engine.enable_cuda_graph_decode() ~0
287
+ Atlas gatekeeper engine.enable_atlas() ~0
288
+ FE-XC cold demotion engine.enable_auto_fexc_demotion() ~0*
289
+ INT2 cold demotion engine.enable_auto_int2_demotion() ~0*
290
+ L2 layer prefetch engine.enable_l2_prefetch() ~0
291
+ Hebbian memory engine.enable_hebbian() +50 MB
292
+ EAGLE-3 speculation engine.enable_eagle(checkpoint) +200 MB
293
+
294
+ * FE-XC/INT2 actually SAVES VRAM by compressing cold expert weights
295
+
296
+ Quantization formats available:
297
+ - Goliath FP4: 4-bit fused dequant (default for MoE experts)
298
+ - Goliath FP8: 8-bit fused dequant (optional for attention)
299
+ - Goliath INT2: 2-bit scalar quantization (coldest experts)
300
+ - FE-XC: 2-bit codebook (2x8 AQLM-style, near-FP16 quality)
301
+ - FE-XVQ: Hessian-weighted 2-bit codebook (VPTQ-inspired)
302
+ - FE-MX: Block floating point (FEMX4/FEMX6/FEMX8 for Hebbian)
303
+
304
+ ================================================================================
305
+ HOW THE ENGINE WORKS (Architecture Overview)
306
+ ================================================================================
307
+
308
+ FireEcho loads a model and replaces standard PyTorch operations with
309
+ custom Triton kernels at every level:
310
+
311
+ 1. LOADING (from_pretrained)
312
+ - Streams model shards one layer at a time (3.1 GB CPU RAM peak)
313
+ - Quantizes each layer to Goliath FP4 on GPU as it loads
314
+ - Packs all 128 MoE experts into contiguous buffers per layer
315
+ - Total: 61 GB BF16 -> 20 GB FP4 in 110 seconds
316
+
317
+ 2. PREFILL (processing the input prompt)
318
+ - Standard attention + MoE forward pass
319
+ - Uses FlashAttention-style Triton kernel for long sequences
320
+ - Builds KV cache for all layers
321
+
322
+ 3. DECODE (generating tokens one at a time)
323
+ - Each token goes through 48 transformer layers:
324
+
325
+ For each layer:
326
+ a) RMSNorm
327
+ b) Attention: Q/K/V projection (BF16 matmul) -> RoPE -> FlashDecode
328
+ (custom Triton kernel, M=1, online softmax, reads only valid KV)
329
+ c) RMSNorm
330
+ d) MoE Router: softmax over 128 experts -> top-8 selection
331
+ e) Expert FFN: Goliath FP4 packed matmul (gate_up + down)
332
+ - Hot experts: FP4 (highest quality)
333
+ - Cold experts: FE-XC 2-bit codebook (5.3x faster kernel)
334
+ - Coldest experts: INT2 2-bit scalar
335
+ f) Residual connection
336
+
337
+ - With CUDA Graph: entire 48-layer forward captured as one graph
338
+ launch -> ~15.8ms per token
339
+
340
+ 4. SPECULATIVE DECODE (EAGLE-3, when draft head is trained)
341
+ - Draft head predicts next K tokens (K=5 default)
342
+ - Target model verifies all K+1 tokens in one forward pass
343
+ - Accepts matching tokens, rejects and rolls back on mismatch
344
+ - Expected: 3-5x speedup with 70%+ acceptance rate
345
+
346
+ Memory layout during decode:
347
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
348
+ β”‚ GPU VRAM (32 GB total) β”‚
349
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
350
+ β”‚ Model weights (FP4 quantized) 19.6 GB β”‚
351
+ β”‚ KV cache (flat, FP8) 0.2 GB β”‚
352
+ β”‚ Hebbian memory 0.05 GB β”‚
353
+ β”‚ CUDA Graph buffers 0.1 GB β”‚
354
+ β”‚ Activations + workspace 1.0 GB β”‚
355
+ β”‚ ───────────────────────────────────────────── β”‚
356
+ β”‚ Total ~21.0 GB β”‚
357
+ β”‚ Free ~11.0 GB β”‚
358
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
359
+
360
+ ================================================================================
361
+ FILE STRUCTURE
362
+ ================================================================================
363
+
364
+ FireEcho Engine/
365
+ β”œβ”€β”€ fireecho_kernel.py Main engine (9000+ lines)
366
+ β”‚ - FireEchoEngine: load, generate, speculate
367
+ β”‚ - FireEchoConfig: model configuration
368
+ β”‚ - MoEFFN: mixture-of-experts with packed dispatch
369
+ β”‚ - HebbianMemory: biologically-inspired fast weights
370
+ β”‚ - FireEchoEagleHead: EAGLE-3 draft head
371
+ β”‚ - FlashDecode Triton kernel
372
+ β”‚ - CUDA Graph capture/replay
373
+ β”‚
374
+ β”œβ”€β”€ goliath_kernel.py Quantized GEMM kernels (3000+ lines)
375
+ β”‚ - GoliathFP4Weights: FP4 fused dequant-matmul
376
+ β”‚ - GoliathFP8Weights: FP8 fused dequant-matmul
377
+ β”‚ - GoliathINT2Weights: INT2 scalar quantization
378
+ β”‚ - GoliathFEXCWeights: FE-XC codebook 2-bit
379
+ β”‚ - GoliathFEXVQWeights: Hessian-weighted codebook
380
+ β”‚ - Packed MoE kernels (FP4, INT2, FE-XC)
381
+ β”‚ - Fused SwiGLU+Down kernel
382
+ β”‚ - GoliathQuantumLinear (training)
383
+ β”‚
384
+ β”œβ”€β”€ triton_hebbian.py Fused Triton kernels for Hebbian memory
385
+ β”‚ - fused_competition, fused_soft_hebbian
386
+ β”‚ - fused_traces_update, fused_gate_output
387
+ β”‚
388
+ β”œβ”€β”€ femx_storage.py FE-MX block floating point storage
389
+ β”‚ - FEMX2, FEMX4, FEMX6, FEMX8 tiers
390
+ β”‚ - Stochastic rounding, age-adaptive precision
391
+ β”‚
392
+ β”œβ”€β”€ persistent_memory.py AGI-like persistent memory
393
+ β”‚ - EpisodicLog: raw experience buffer
394
+ β”‚ - SemanticJournal: compressed knowledge
395
+ β”‚ - ReflectionEngine: self-evaluation
396
+ β”‚
397
+ β”œβ”€β”€ benchmark_fullstack.py Full-stack benchmark (L0-L6)
398
+ β”œβ”€β”€ benchmark_eagle.py EAGLE-3 speculative decode benchmark
399
+ β”œβ”€β”€ train_eagle_head.py EAGLE-3 draft head training script
400
+ └── calibrate_fexc.py FE-XC codebook calibration
401
+
402
+ ================================================================================
403
+ THE GOLIATH KERNEL (What Makes It Fast)
404
+ ================================================================================
405
+
406
+ Standard quantized inference:
407
+ 1. Load FP4 weights from VRAM
408
+ 2. Dequantize to BF16 in global memory (writes 61 GB!)
409
+ 3. Run matmul on the BF16 weights
410
+ Problem: Step 2 doubles memory traffic and VRAM usage
411
+
412
+ Goliath approach:
413
+ 1. Load FP4 weights directly into Triton registers
414
+ 2. Dequantize INSIDE the matmul tile loop (in registers, zero global write)
415
+ 3. Accumulate in FP32
416
+ Problem: None. This is strictly better.
417
+
418
+ Code path (simplified):
419
+ for k_block in range(0, K, BLOCK_K):
420
+ # Load FP4 packed bytes (2 values per byte)
421
+ w_packed = tl.load(weight_ptr + offsets)
422
+
423
+ # Dequantize in-register
424
+ w_lo = (w_packed & 0xF).to(tl.float32) * scale # low nibble
425
+ w_hi = (w_packed >> 4).to(tl.float32) * scale # high nibble
426
+
427
+ # Matmul tile (tensor core)
428
+ acc += tl.dot(a_tile, w_tile)
429
+
430
+ Result: 4x less memory traffic, same numerical quality.
431
+
432
+ Packed MoE:
433
+ Standard approach: Loop over 8 active experts, one matmul each = 16 kernel
434
+ launches per layer (gate_up + down per expert).
435
+
436
+ Goliath Packed MoE: All 128 experts packed into one [128, K//2, N] buffer.
437
+ Single kernel launch reads expert_id from GPU tensor, indexes into buffer.
438
+ Result: 2 kernel launches per layer (gate_up + down), expert selection
439
+ stays entirely on GPU.
440
+
441
+ ================================================================================
442
+ HEBBIAN MEMORY (What Makes It Smart)
443
+ ================================================================================
444
+
445
+ Standard LLMs: Frozen weights after training. Context window is the only memory.
446
+
447
+ FireEcho Hebbian Memory:
448
+ - Fast weights that update DURING inference (no backpropagation)
449
+ - Inspired by biological synaptic plasticity (Hebb's rule: "neurons that
450
+ fire together wire together")
451
+ - Stores patterns from the current conversation
452
+ - Retrieves relevant patterns to augment generation
453
+
454
+ How it works:
455
+ 1. Input token embedding is projected to query/key/value
456
+ 2. Query matches against stored memory slots (competitive retrieval)
457
+ 3. Top-K most relevant memories are retrieved
458
+ 4. Retrieved context is mixed with transformer hidden state
459
+ 5. Memory slots are updated via Hebbian learning rule
460
+
461
+ Updates use:
462
+ - Soft competitive learning (winner-take-most)
463
+ - Three-factor STDP (spike-timing dependent plasticity)
464
+ - Intrinsic plasticity (per-slot gain adaptation)
465
+ - PMI correction (pointwise mutual information bias)
466
+ - GHA decorrelation (prevent redundant memories)
467
+ - Kappa switching (amplified encoding for novel patterns)
468
+
469
+ Enable:
470
+ engine.enable_hebbian()
471
+
472
+ The memory persists within a session and can be saved/loaded:
473
+ engine.save_persistent_memory("memory.pt")
474
+ engine.load_persistent_memory("memory.pt")
475
+
476
+ ================================================================================
477
+ COMPRESSION STACK (Why 30B Fits in 20 GB)
478
+ ================================================================================
479
+
480
+ Level Format Bits Compression Quality Used For
481
+ ────── ───────── ──── ─────────── ──────────── ────────────────
482
+ Base BF16 16 1x Perfect Attention Q/K/V/O
483
+ Hot Goliath 4 4x Near-perfect Active MoE experts
484
+ FP4
485
+ Cold FE-XC 2 8x Very good Rarely-used experts
486
+ (codebook)
487
+ Coldest INT2 2 8x Acceptable Least-used experts
488
+ (scalar)
489
+
490
+ Combined with MoE sparsity (8/128 active = 6.25%):
491
+ Effective model size per token:
492
+ Attention: 8 Γ— (4 projections Γ— 2048 Γ— 128 Γ— 2 bytes) = 16 MB
493
+ MoE: 8 experts Γ— 3 projections Γ— 768 Γ— 2048 Γ— 0.5 bytes = 18.9 MB
494
+ Other: embeddings, norms, router = ~13 MB
495
+ Total per token: ~48 MB
496
+
497
+ RTX 5090 bandwidth: 1.79 TB/s
498
+ Theoretical max: 1,790,000 / 48 = 37,291 tok/s (compute-bound limit)
499
+ Practical (30% utilization): ~45 tok/s (memory-bound, current result)
500
+
501
+ With FE-XC/INT2 cold experts replacing 80%+ of inactive expert weights:
502
+ MoE bandwidth: 18.9 MB * 0.5 (half are 2-bit) = ~10 MB
503
+ Total per token: ~39 MB
504
+ At 30% utilization: ~55 tok/s
505
+
506
+ With EAGLE-3 (70% acceptance, K=5 draft):
507
+ Effective throughput: 55 * 3.5 (average accepted tokens per verify) = ~193 tok/s
508
+
509
+ ================================================================================
510
+ EAGLE-3 SPECULATIVE DECODING
511
+ ================================================================================
512
+
513
+ EAGLE-3 is a draft-then-verify acceleration technique:
514
+
515
+ Normal decode: 1 token per forward pass through 48 MoE layers
516
+ EAGLE-3: Draft head predicts 5 tokens cheaply, target model verifies all 6
517
+ in one forward pass. If 4/5 match -> 5 tokens for the cost of ~2.
518
+
519
+ Architecture of draft head:
520
+ - Takes hidden states from layers 8, 24, 47 + token embedding
521
+ - Compresses via FC layer (8192 -> 2048)
522
+ - Passes through D transformer layers (D=2 to D=50)
523
+ - Shares lm_head with target model
524
+ - Total params: 115M (D=2) to 2.12B (D=50)
525
+
526
+ Training:
527
+ python train_eagle_head.py \
528
+ --offline \ # Use precomputed hidden states
529
+ --num_head_layers 50 \ # D=50 layers
530
+ --draft_depth 5 \ # K=5 draft steps
531
+ --lr 5e-4 \ # Learning rate
532
+ --epochs 5 \ # Training epochs
533
+ --loss_type ce \ # Cross-entropy loss
534
+ --batch_positions \ # Batched M=64 (10x faster)
535
+ --use_quantum_linear \ # Goliath FP8 forward + Quantum Gold backward
536
+ --compile # torch.compile the head
537
+
538
+ Usage after training:
539
+ engine.enable_eagle("eagle_checkpoints/eagle_best.pt")
540
+ output = engine.speculative_generate(input_ids, max_new_tokens=500)
541
+
542
+ ================================================================================
543
+ SPEED OPTIMIZATION HISTORY
544
+ ================================================================================
545
+
546
+ Step Optimization tok/s Speedup
547
+ ─���── ──────────────────────────────────────── ────── ───────
548
+ 0 Baseline (128-expert Python loop) 0.4 1x
549
+ 1 Grouped dispatch + TF32 + Triton autotune 7.7 19x
550
+ 2 Fused gate_up_proj (2->1 matmul/expert) 9.5 24x
551
+ 3 Single-token decode fast path 12.6 32x
552
+ 4 Multi-expert Goliath kernel (2 launches) 18.8 47x
553
+ 5 Packed MoE (contiguous buffer, GPU IDs) 30.8 77x
554
+ 6 Flat decode KV cache (zero torch.cat) 40.9 102x
555
+ 7 CUDA Graph + FlashDecode 49.4 124x
556
+
557
+ Where the time goes at 45 tok/s (22ms per token):
558
+ Attention (FlashDecode): 0.28ms/layer x 48 = 13.4ms (61%)
559
+ MoE (Goliath FP4): 0.17ms/layer x 48 = 8.2ms (37%)
560
+ Other (norms, router): 0.4ms (2%)
561
+
562
+ ================================================================================
563
+ KNOWN LIMITATIONS & GOTCHAS
564
+ ================================================================================
565
+
566
+ - Single-GPU only (by design β€” multi-GPU adds complexity for marginal gain)
567
+ - Minimum 24 GB VRAM (model alone is 20 GB)
568
+ - FP4 quantization has ~0.05-0.15 relative error vs BF16 (negligible in practice)
569
+ - First 10+ forward passes are slow (Triton kernel compilation/autotuning)
570
+ - CUDA Graph capture requires fixed tensor shapes (only decode, not prefill)
571
+ - Hebbian memory adds ~50 MB VRAM and slight latency
572
+ - FE-XC codebook learning takes 1-2 minutes on first enable
573
+ - No pip package yet (source install only)
574
+ - Tested primarily on RTX 5090 β€” other GPUs may need Triton autotune re-run
575
+ - MoDES expert skipping can hurt quality if threshold is too aggressive
576
+
577
+ ================================================================================
578
+ TROUBLESHOOTING
579
+ ================================================================================
580
+
581
+ Problem: "CUDA out of memory"
582
+ Fix: Check nvidia-smi for other processes using VRAM. Kill them.
583
+ Or reduce max_kv_blocks in config (default 256 = 4K tokens = 3.1 GB).
584
+
585
+ Problem: Very slow first few generations
586
+ Fix: Normal β€” Triton is compiling and autotuning kernels. Wait ~10 forward
587
+ passes for warmup. Subsequent runs use cached kernels.
588
+
589
+ Problem: "No module named 'triton'"
590
+ Fix: pip install triton (requires CUDA toolkit installed)
591
+
592
+ Problem: "RuntimeError: Triton compilation failed"
593
+ Fix: Check CUDA version matches PyTorch: python -c "import torch; print(torch.version.cuda)"
594
+ Triton 3.0+ needs CUDA 12.0+.
595
+
596
+ Problem: NaN in output
597
+ Fix: Check if using prefill with >20 tokens (packed MoE kernel needs 3D grid).
598
+ This was a fixed bug β€” update to latest code.
599
+
600
+ Problem: CUDA Graph capture crashes
601
+ Fix: Atlas .item() calls conflict with graph capture. The engine auto-skips
602
+ these during capture (fixed). Update to latest code.
603
+
604
+ ================================================================================
605
+ RESEARCH PAPERS & REFERENCES
606
+ ================================================================================
607
+
608
+ FireEcho builds on ideas from:
609
+
610
+ Quantization:
611
+ - AQLM (arxiv 2401.06118): Additive quantization for LLMs -> FE-XC codebook
612
+ - VPTQ (Hessian-weighted): Second-order optimal codebooks -> FE-XVQ
613
+ - FP4 Training (arxiv 2501.17116): Gradient flow through FP4
614
+
615
+ Speculative Decoding:
616
+ - EAGLE-3 (Li et al.): Draft-then-verify with shared lm_head
617
+ - Scylla (arxiv 2505.07858): Tree-based multi-candidate verification -> FE-XT
618
+ - Medusa: Multi-head parallel drafting
619
+
620
+ MoE Optimization:
621
+ - SP-MoE (arxiv 2510.10302): Async expert prefetch -> FE-H Hayabusa
622
+ - MoE-Inference-Bench: Expert sizing analysis
623
+
624
+ Hebbian/Neuroscience:
625
+ - Lansner BCPNN: Bayesian confidence propagation neural networks
626
+ - Triesch 2005: Intrinsic plasticity
627
+ - Sanger's GHA: Generalized Hebbian algorithm
628
+ - McClelland et al. 1995: Complementary learning systems
629
+
630
+ Tensor Decomposition:
631
+ - MPS/TT decomposition: Quantum-inspired weight compression
632
+
633
+ ================================================================================
634
+ WHERE TO GET HELP
635
+ ================================================================================
636
+
637
+ GitHub Issues: https://github.com/Joysulem/FireEcho/issues
638
+ Include: GPU model, CUDA version, PyTorch version, full error traceback
639
+
640
+ X / Twitter: @Joysulem
641
+ Tag me with questions, benchmarks, or usage reports
642
+
643
+ Email: (floresluise1988@gmail.com)
644
+
645
+ ================================================================================
646
+ LICENSE
647
+ ================================================================================
648
+
649
+ Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
650
+
651
+ You are free to:
652
+ - Share: copy and redistribute the material in any medium or format
653
+ - Adapt: remix, transform, and build upon the material
654
+
655
+ Under the following terms:
656
+ - Attribution: You must give appropriate credit to Luis E. Davila Flores,
657
+ provide a link to the license, and indicate if changes were made.
658
+ - NonCommercial: You may not use the material for commercial purposes.
659
+
660
+ Full license: https://creativecommons.org/licenses/by-nc/4.0/
661
+
662
+ For commercial licensing inquiries, contact: @Joysulem on X/Twitter
663
+
664
+ ================================================================================
665
+ FireEcho Engine β€” Created by Luis E. Davila Flores
666
+ "One GPU. One file. One import. Full pipeline."
667
+ ================================================================================