Create Bible Readme

Browse files

Files changed (1) hide show

Bible Readme +667 -0

Bible Readme ADDED Viewed

	@@ -0,0 +1,667 @@

+================================================================================
+                          FIREECHO ENGINE
+     High-Performance Single-GPU Inference Kernel for 30B+ MoE Models
+================================================================================
+   Creator & Sole Author: Luis E. Davila Flores (@Joysulem)
+   License: CC BY-NC 4.0 (free for research, attribution required)
+   Status: Production-quality single-GPU decode, research extensions active
+================================================================================
+ WHAT IS FIREECHO?
+================================================================================
+FireEcho is a custom inference engine that runs a 30 BILLION parameter
+Mixture-of-Experts model (Qwen3-Omni-30B) on a SINGLE consumer GPU at
+45+ tokens/second — using only 20 GB VRAM.
+No multi-GPU. No cloud. No NVIDIA proprietary libraries.
+Just Triton + PyTorch + one GPU.
+Key numbers:
+  - 30.5B total params, ~3.3B active per token (128 experts, top-8 routing)
+  - 4x compression via Goliath FP4 fused dequant-matmul (61 GB -> 20 GB)
+  - 124x speedup from baseline (0.4 -> 49.4 tok/s) through 7 optimization layers
+  - Zero NVIDIA proprietary dependencies (no cuQuantizer, CUTLASS, TensorRT)
+  - Runs anywhere Triton compiles: NVIDIA CUDA, AMD ROCm, Intel XPU
+What makes FireEcho different from vLLM/TGI/llama.cpp:
+  - Goliath kernel: FP4 dequantization INSIDE the Triton matmul loop (no separate
+    dequant step, no global memory materialization)
+  - Packed MoE: All 128 experts packed into one contiguous buffer per layer,
+    expert IDs stay on GPU — zero CPU-GPU sync during decode
+  - FlashDecode: Custom Triton attention kernel with online softmax for M=1 GQA
+  - Hebbian Memory: Biologically-inspired fast weights that learn at inference time
+  - FE-XC/INT2: Cold experts auto-demote to 2-bit (codebook or scalar) — further
+    bandwidth savings without touching hot experts
+  - CUDA Graph decode: Entire decode step captured as a graph, ~15.8ms/step
+================================================================================
+ CURRENT STATUS & REALISTIC EXPECTATIONS
+================================================================================
+WHAT WORKS (production-quality):
+  [x] Full Qwen3-Omni-30B inference at 45+ tok/s on RTX 5090
+  [x] Goliath FP4 quantization (20 GB VRAM, FP16-quality output)
+  [x] Packed MoE with fused dequant-matmul (zero CPU sync)
+  [x] FlashDecode attention (online softmax, valid_len masking)
+  [x] CUDA Graph decode (graph-captured forward pass)
+  [x] Flat KV cache (pre-allocated, zero allocation per token)
+  [x] FP8 KV cache (50% VRAM savings on attention)
+  [x] FE-XC cold expert demotion (codebook 2-bit, 5.3x faster kernel)
+  [x] INT2 ultra-cold expert demotion (scalar 2-bit)
+  [x] Hebbian persistent memory (learns during inference)
+  [x] Atlas gatekeeper (expert banning + MoDES skipping)
+  [x] Streaming shard loader (110s cold start, 3.1 GB CPU RAM)
+WHAT'S RESEARCH/EXPERIMENTAL:
+  [ ] EAGLE-3 speculative decoding (infrastructure done, head needs training)
+  [ ] FE-XT tree speculation (code done, needs trained draft head)
+  [ ] FE-H Hayabusa async prefetch (code done, needs benchmarking)
+  [ ] Batched speculative decode (infrastructure done)
+  [ ] Multi-GPU (not implemented — single-GPU is the design philosophy)
+WILL NOT WORK ON:
+  - GPUs with < 24 GB VRAM (model is 20 GB + KV cache)
+  - CUDA < 12.4 (BF16 atomics, FP8 support needed)
+  - CPU-only (Triton compiles to GPU targets)
+================================================================================
+ HARDWARE & SOFTWARE REQUIREMENTS
+================================================================================
+  Component          Minimum              Recommended
+  ─────────────────  ───────────────────  ────────────────────────
+  GPU                RTX 4090 (24 GB)*    RTX 5090 (32 GB)
+  GPU VRAM           24 GB                32 GB
+  CPU                Any modern x86_64    Ryzen 9 9950X / i9-14900K
+  System RAM         32 GB                64 GB
+  CUDA               12.4+               12.8+
+  Python             3.10 - 3.12         3.12
+  PyTorch            2.4.0+              2.6.0+cu128
+  Triton             3.0+                3.2+
+  OS                 Linux (x86_64)      Arch Linux / Ubuntu 22.04+
+  * RTX 4090: Will work but FP4 kernels may be slower (no Blackwell tensor cores)
+  * RTX 3090: Marginal — 24 GB VRAM is tight, FP8 not supported
+  * AMD GPUs: Triton compiles to ROCm — untested but architecturally supported
+  Tested configuration (author's machine):
+    AMD Ryzen 9 9950X + NVIDIA RTX 5090 32 GB + 64 GB DDR5
+    Arch Linux, CUDA 12.8, Python 3.12, PyTorch 2.6.0+cu128, Triton 3.2
+================================================================================
+ INSTALLATION
+================================================================================
+Step 1: Clone the repository
+─────────────────────────────
+  git clone https://github.com/Joysulem/FireEcho.git
+  cd FireEcho
+Step 2: Create a Python virtual environment
+────────────────────────────────────────────
+  python3.12 -m venv .venv
+  source .venv/bin/activate
+Step 3: Install dependencies
+─────────────────────────────
+  pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
+  pip install triton transformers tokenizers safetensors sentencepiece
+Step 4: Verify installation
+────────────────────────────
+  python -c "import torch; print('CUDA:', torch.cuda.is_available(), '|', torch.version.cuda)"
+  python -c "import triton; print('Triton:', triton.__version__)"
+  Expected output:
+    CUDA: True | 12.8
+    Triton: 3.2.0
+Step 5: Download a model (Qwen3-Omni-30B recommended)
+──────────────────────────────────────────────────────
+  # Option A: Via huggingface-cli
+  pip install huggingface-hub
+  huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./model/Qwen3-Omni
+  # Option B: Via git lfs
+  git lfs install
+  git clone https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct ./model/Qwen3-Omni
+================================================================================
+ QUICK SMOKE TEST (run this first!)
+================================================================================
+  cd FireEcho/kernel/FireEcho\ Engine/
+  python -c "
+  from fireecho_kernel import FireEchoEngine
+  import torch
+  # Load model (takes ~110 seconds, streams layer-by-layer)
+  engine = FireEchoEngine.from_pretrained('./model/Qwen3-Omni')
+  # Quick generation test
+  tokens = engine.tokenizer.encode('The capital of France is', return_tensors='pt').cuda()
+  output = engine.generate(tokens, max_new_tokens=20, temperature=0.0)
+  print(engine.tokenizer.decode(output[0]))
+  print(f'VRAM used: {torch.cuda.max_memory_allocated()/1e9:.1f} GB')
+  "
+  Expected output:
+    The capital of France is Paris. Paris is the largest city in France...
+    VRAM used: 23.1 GB
+  If this works, your setup is correct. If not, check:
+    - CUDA version matches PyTorch build (torch.version.cuda)
+    - GPU has enough VRAM (nvidia-smi)
+    - Model path is correct
+================================================================================
+ BASIC INFERENCE USAGE
+================================================================================
+─── Minimal example ───
+  from fireecho_kernel import FireEchoEngine
+  # Load model with FP4 quantization (automatic for Qwen3-Omni)
+  engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
+  # Encode input
+  input_ids = engine.tokenizer.encode(
+      "Explain quantum computing in simple terms",
+      return_tensors='pt'
+  ).cuda()
+  # Generate
+  output = engine.generate(
+      input_ids,
+      max_new_tokens=200,
+      temperature=0.7,
+      top_p=0.9,
+      use_cache=True
+  )
+  # Decode and print
+  print(engine.tokenizer.decode(output[0], skip_special_tokens=True))
+─── High-performance decode (all optimizations) ───
+  engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
+  # Enable flat KV cache (eliminates torch.cat overhead)
+  engine.enable_flat_decode()          # +403 MB VRAM, BF16 KV
+  # Or FP8 KV cache (half the VRAM, same speed)
+  engine.enable_flat_decode(kv_dtype='fp8')  # +208 MB VRAM
+  # Enable CUDA Graph decode (captures forward pass as graph)
+  engine.enable_cuda_graph_decode()    # +0 VRAM, ~10% faster
+  # Enable Atlas gatekeeper (prunes cold experts at runtime)
+  engine.enable_atlas(
+      profile_prompts=8,
+      ban_pct=0.25,              # Ban bottom 25% of experts per layer
+      modes_threshold=2.0        # MoDES: skip MoE for uncertain tokens
+  )
+  # Enable FE-XC cold expert demotion (2-bit codebook)
+  engine.enable_auto_fexc_demotion(cold_threshold=0.10)
+  # Enable INT2 ultra-cold expert demotion
+  engine.enable_auto_int2_demotion(cold_threshold=0.05)
+  # Generate with everything enabled
+  output = engine.generate(input_ids, max_new_tokens=500)
+─── Interactive chat loop ───
+  engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
+  engine.enable_flat_decode()
+  engine.enable_cuda_graph_decode()
+  print("FireEcho Chat (type 'quit' to exit)")
+  while True:
+      user_input = input("\nYou: ")
+      if user_input.lower() == 'quit':
+          break
+      # Format as chat (Qwen3 format)
+      prompt = f"<|im_start|>user\n{user_input}<|im_end|>\n<|im_start|>assistant\n"
+      input_ids = engine.tokenizer.encode(prompt, return_tensors='pt').cuda()
+      output = engine.generate(
+          input_ids,
+          max_new_tokens=500,
+          temperature=0.7,
+          top_p=0.9
+      )
+      response = engine.tokenizer.decode(
+          output[0][input_ids.shape[1]:],
+          skip_special_tokens=True
+      )
+      print(f"\nFireEcho: {response}")
+================================================================================
+ BENCHMARKING
+================================================================================
+─── Quick speed test ───
+  python benchmark_fullstack.py
+  This runs 7 optimization layers, stacking each one:
+    L0: Baseline (FP4 + packed MoE + flat KV BF16)    ~45 tok/s
+    L1: + FP8 KV cache                                ~42 tok/s
+    L2: + L2 layer prefetch                            ~42 tok/s
+    L3: + Atlas Ban & Pick (8->~5 experts)             ~40 tok/s
+    L4: + FE-XC cold experts (2-bit codebook)          ~39 tok/s
+    L5: + INT2 coldest experts (2-bit scalar)           ~38 tok/s
+    L6: + CUDA Graph decode                             ~TBD
+  Note: L1-L5 are slightly slower than L0 due to overhead from
+  additional dispatch logic. The REAL benefit comes when combined
+  with speculative decoding (EAGLE-3) — the bandwidth savings from
+  FE-XC/INT2 allow more tokens to be verified per unit time.
+─── EAGLE-3 benchmark (speculative decode) ───
+  python benchmark_eagle.py --checkpoint eagle_checkpoints/eagle_best.pt
+  Note: Requires a trained draft head. See "EAGLE-3 Training" section.
+================================================================================
+ FEATURE REFERENCE (Cheat Sheet)
+================================================================================
+  Feature                  How to enable                        VRAM cost
+  ───────────────────────  ─────────────────────────────────    ──────────
+  Flat KV cache (BF16)     engine.enable_flat_decode()          +403 MB
+  Flat KV cache (FP8)      engine.enable_flat_decode('fp8')     +208 MB
+  CUDA Graph decode        engine.enable_cuda_graph_decode()    ~0
+  Atlas gatekeeper         engine.enable_atlas()                ~0
+  FE-XC cold demotion      engine.enable_auto_fexc_demotion()   ~0*
+  INT2 cold demotion       engine.enable_auto_int2_demotion()   ~0*
+  L2 layer prefetch        engine.enable_l2_prefetch()          ~0
+  Hebbian memory           engine.enable_hebbian()              +50 MB
+  EAGLE-3 speculation      engine.enable_eagle(checkpoint)      +200 MB
+  * FE-XC/INT2 actually SAVES VRAM by compressing cold expert weights
+  Quantization formats available:
+    - Goliath FP4: 4-bit fused dequant (default for MoE experts)
+    - Goliath FP8: 8-bit fused dequant (optional for attention)
+    - Goliath INT2: 2-bit scalar quantization (coldest experts)
+    - FE-XC: 2-bit codebook (2x8 AQLM-style, near-FP16 quality)
+    - FE-XVQ: Hessian-weighted 2-bit codebook (VPTQ-inspired)
+    - FE-MX: Block floating point (FEMX4/FEMX6/FEMX8 for Hebbian)
+================================================================================
+ HOW THE ENGINE WORKS (Architecture Overview)
+================================================================================
+  FireEcho loads a model and replaces standard PyTorch operations with
+  custom Triton kernels at every level:
+  1. LOADING (from_pretrained)
+     - Streams model shards one layer at a time (3.1 GB CPU RAM peak)
+     - Quantizes each layer to Goliath FP4 on GPU as it loads
+     - Packs all 128 MoE experts into contiguous buffers per layer
+     - Total: 61 GB BF16 -> 20 GB FP4 in 110 seconds
+  2. PREFILL (processing the input prompt)
+     - Standard attention + MoE forward pass
+     - Uses FlashAttention-style Triton kernel for long sequences
+     - Builds KV cache for all layers
+  3. DECODE (generating tokens one at a time)
+     - Each token goes through 48 transformer layers:
+       For each layer:
+         a) RMSNorm
+         b) Attention: Q/K/V projection (BF16 matmul) -> RoPE -> FlashDecode
+            (custom Triton kernel, M=1, online softmax, reads only valid KV)
+         c) RMSNorm
+         d) MoE Router: softmax over 128 experts -> top-8 selection
+         e) Expert FFN: Goliath FP4 packed matmul (gate_up + down)
+            - Hot experts: FP4 (highest quality)
+            - Cold experts: FE-XC 2-bit codebook (5.3x faster kernel)
+            - Coldest experts: INT2 2-bit scalar
+         f) Residual connection
+     - With CUDA Graph: entire 48-layer forward captured as one graph
+       launch -> ~15.8ms per token
+  4. SPECULATIVE DECODE (EAGLE-3, when draft head is trained)
+     - Draft head predicts next K tokens (K=5 default)
+     - Target model verifies all K+1 tokens in one forward pass
+     - Accepts matching tokens, rejects and rolls back on mismatch
+     - Expected: 3-5x speedup with 70%+ acceptance rate
+  Memory layout during decode:
+    ┌──────────────────────────────────────────────────┐
+    │ GPU VRAM (32 GB total)                           │
+    ├──────────────────────────────────────────────────┤
+    │ Model weights (FP4 quantized)        19.6 GB     │
+    │ KV cache (flat, FP8)                  0.2 GB     │
+    │ Hebbian memory                        0.05 GB    │
+    │ CUDA Graph buffers                    0.1 GB     │
+    │ Activations + workspace               1.0 GB     │
+    │ ─────────────────────────────────────────────    │
+    │ Total                                ~21.0 GB    │
+    │ Free                                 ~11.0 GB    │
+    └──────────────────────────────────────────────────┘
+================================================================================
+ FILE STRUCTURE
+================================================================================
+  FireEcho Engine/
+  ├── fireecho_kernel.py      Main engine (9000+ lines)
+  │                           - FireEchoEngine: load, generate, speculate
+  │                           - FireEchoConfig: model configuration
+  │                           - MoEFFN: mixture-of-experts with packed dispatch
+  │                           - HebbianMemory: biologically-inspired fast weights
+  │                           - FireEchoEagleHead: EAGLE-3 draft head
+  │                           - FlashDecode Triton kernel
+  │                           - CUDA Graph capture/replay
+  │
+  ├── goliath_kernel.py       Quantized GEMM kernels (3000+ lines)
+  │                           - GoliathFP4Weights: FP4 fused dequant-matmul
+  │                           - GoliathFP8Weights: FP8 fused dequant-matmul
+  │                           - GoliathINT2Weights: INT2 scalar quantization
+  │                           - GoliathFEXCWeights: FE-XC codebook 2-bit
+  │                           - GoliathFEXVQWeights: Hessian-weighted codebook
+  │                           - Packed MoE kernels (FP4, INT2, FE-XC)
+  │                           - Fused SwiGLU+Down kernel
+  │                           - GoliathQuantumLinear (training)
+  │
+  ├── triton_hebbian.py       Fused Triton kernels for Hebbian memory
+  │                           - fused_competition, fused_soft_hebbian
+  │                           - fused_traces_update, fused_gate_output
+  │
+  ├── femx_storage.py         FE-MX block floating point storage
+  │                           - FEMX2, FEMX4, FEMX6, FEMX8 tiers
+  │                           - Stochastic rounding, age-adaptive precision
+  │
+  ├── persistent_memory.py    AGI-like persistent memory
+  │                           - EpisodicLog: raw experience buffer
+  │                           - SemanticJournal: compressed knowledge
+  │                           - ReflectionEngine: self-evaluation
+  │
+  ├── benchmark_fullstack.py  Full-stack benchmark (L0-L6)
+  ├── benchmark_eagle.py      EAGLE-3 speculative decode benchmark
+  ├── train_eagle_head.py     EAGLE-3 draft head training script
+  └── calibrate_fexc.py       FE-XC codebook calibration
+================================================================================
+ THE GOLIATH KERNEL (What Makes It Fast)
+================================================================================
+Standard quantized inference:
+  1. Load FP4 weights from VRAM
+  2. Dequantize to BF16 in global memory (writes 61 GB!)
+  3. Run matmul on the BF16 weights
+  Problem: Step 2 doubles memory traffic and VRAM usage
+Goliath approach:
+  1. Load FP4 weights directly into Triton registers
+  2. Dequantize INSIDE the matmul tile loop (in registers, zero global write)
+  3. Accumulate in FP32
+  Problem: None. This is strictly better.
+  Code path (simplified):
+    for k_block in range(0, K, BLOCK_K):
+        # Load FP4 packed bytes (2 values per byte)
+        w_packed = tl.load(weight_ptr + offsets)
+        # Dequantize in-register
+        w_lo = (w_packed & 0xF).to(tl.float32) * scale  # low nibble
+        w_hi = (w_packed >> 4).to(tl.float32) * scale   # high nibble
+        # Matmul tile (tensor core)
+        acc += tl.dot(a_tile, w_tile)
+  Result: 4x less memory traffic, same numerical quality.
+Packed MoE:
+  Standard approach: Loop over 8 active experts, one matmul each = 16 kernel
+  launches per layer (gate_up + down per expert).
+  Goliath Packed MoE: All 128 experts packed into one [128, K//2, N] buffer.
+  Single kernel launch reads expert_id from GPU tensor, indexes into buffer.
+  Result: 2 kernel launches per layer (gate_up + down), expert selection
+  stays entirely on GPU.
+================================================================================
+ HEBBIAN MEMORY (What Makes It Smart)
+================================================================================
+Standard LLMs: Frozen weights after training. Context window is the only memory.
+FireEcho Hebbian Memory:
+  - Fast weights that update DURING inference (no backpropagation)
+  - Inspired by biological synaptic plasticity (Hebb's rule: "neurons that
+    fire together wire together")
+  - Stores patterns from the current conversation
+  - Retrieves relevant patterns to augment generation
+How it works:
+  1. Input token embedding is projected to query/key/value
+  2. Query matches against stored memory slots (competitive retrieval)
+  3. Top-K most relevant memories are retrieved
+  4. Retrieved context is mixed with transformer hidden state
+  5. Memory slots are updated via Hebbian learning rule
+  Updates use:
+    - Soft competitive learning (winner-take-most)
+    - Three-factor STDP (spike-timing dependent plasticity)
+    - Intrinsic plasticity (per-slot gain adaptation)
+    - PMI correction (pointwise mutual information bias)
+    - GHA decorrelation (prevent redundant memories)
+    - Kappa switching (amplified encoding for novel patterns)
+  Enable:
+    engine.enable_hebbian()
+  The memory persists within a session and can be saved/loaded:
+    engine.save_persistent_memory("memory.pt")
+    engine.load_persistent_memory("memory.pt")
+================================================================================
+ COMPRESSION STACK (Why 30B Fits in 20 GB)
+================================================================================
+  Level    Format     Bits   Compression   Quality        Used For
+  ──────   ─────────  ────   ───────────   ────────────   ────────────────
+  Base     BF16       16     1x            Perfect        Attention Q/K/V/O
+  Hot      Goliath    4      4x            Near-perfect   Active MoE experts
+           FP4
+  Cold     FE-XC      2      8x            Very good      Rarely-used experts
+                                           (codebook)
+  Coldest  INT2       2      8x            Acceptable     Least-used experts
+                                           (scalar)
+  Combined with MoE sparsity (8/128 active = 6.25%):
+    Effective model size per token:
+      Attention: 8 × (4 projections × 2048 × 128 × 2 bytes) = 16 MB
+      MoE: 8 experts × 3 projections × 768 × 2048 × 0.5 bytes = 18.9 MB
+      Other: embeddings, norms, router = ~13 MB
+      Total per token: ~48 MB
+    RTX 5090 bandwidth: 1.79 TB/s
+    Theoretical max: 1,790,000 / 48 = 37,291 tok/s (compute-bound limit)
+    Practical (30% utilization): ~45 tok/s (memory-bound, current result)
+  With FE-XC/INT2 cold experts replacing 80%+ of inactive expert weights:
+    MoE bandwidth: 18.9 MB * 0.5 (half are 2-bit) = ~10 MB
+    Total per token: ~39 MB
+    At 30% utilization: ~55 tok/s
+  With EAGLE-3 (70% acceptance, K=5 draft):
+    Effective throughput: 55 * 3.5 (average accepted tokens per verify) = ~193 tok/s
+================================================================================
+ EAGLE-3 SPECULATIVE DECODING
+================================================================================
+EAGLE-3 is a draft-then-verify acceleration technique:
+  Normal decode: 1 token per forward pass through 48 MoE layers
+  EAGLE-3: Draft head predicts 5 tokens cheaply, target model verifies all 6
+            in one forward pass. If 4/5 match -> 5 tokens for the cost of ~2.
+  Architecture of draft head:
+    - Takes hidden states from layers 8, 24, 47 + token embedding
+    - Compresses via FC layer (8192 -> 2048)
+    - Passes through D transformer layers (D=2 to D=50)
+    - Shares lm_head with target model
+    - Total params: 115M (D=2) to 2.12B (D=50)
+  Training:
+    python train_eagle_head.py \
+        --offline \                    # Use precomputed hidden states
+        --num_head_layers 50 \         # D=50 layers
+        --draft_depth 5 \              # K=5 draft steps
+        --lr 5e-4 \                    # Learning rate
+        --epochs 5 \                   # Training epochs
+        --loss_type ce \               # Cross-entropy loss
+        --batch_positions \            # Batched M=64 (10x faster)
+        --use_quantum_linear \         # Goliath FP8 forward + Quantum Gold backward
+        --compile                      # torch.compile the head
+  Usage after training:
+    engine.enable_eagle("eagle_checkpoints/eagle_best.pt")
+    output = engine.speculative_generate(input_ids, max_new_tokens=500)
+================================================================================
+ SPEED OPTIMIZATION HISTORY
+================================================================================
+  Step  Optimization                              tok/s   Speedup
+  ─���──  ────────────────────────────────────────  ──────  ───────
+  0     Baseline (128-expert Python loop)           0.4   1x
+  1     Grouped dispatch + TF32 + Triton autotune   7.7   19x
+  2     Fused gate_up_proj (2->1 matmul/expert)     9.5   24x
+  3     Single-token decode fast path              12.6   32x
+  4     Multi-expert Goliath kernel (2 launches)   18.8   47x
+  5     Packed MoE (contiguous buffer, GPU IDs)    30.8   77x
+  6     Flat decode KV cache (zero torch.cat)      40.9   102x
+  7     CUDA Graph + FlashDecode                   49.4   124x
+  Where the time goes at 45 tok/s (22ms per token):
+    Attention (FlashDecode):  0.28ms/layer x 48 = 13.4ms (61%)
+    MoE (Goliath FP4):        0.17ms/layer x 48 =  8.2ms (37%)
+    Other (norms, router):                         0.4ms  (2%)
+================================================================================
+ KNOWN LIMITATIONS & GOTCHAS
+================================================================================
+  - Single-GPU only (by design — multi-GPU adds complexity for marginal gain)
+  - Minimum 24 GB VRAM (model alone is 20 GB)
+  - FP4 quantization has ~0.05-0.15 relative error vs BF16 (negligible in practice)
+  - First 10+ forward passes are slow (Triton kernel compilation/autotuning)
+  - CUDA Graph capture requires fixed tensor shapes (only decode, not prefill)
+  - Hebbian memory adds ~50 MB VRAM and slight latency
+  - FE-XC codebook learning takes 1-2 minutes on first enable
+  - No pip package yet (source install only)
+  - Tested primarily on RTX 5090 — other GPUs may need Triton autotune re-run
+  - MoDES expert skipping can hurt quality if threshold is too aggressive
+================================================================================
+ TROUBLESHOOTING
+================================================================================
+  Problem: "CUDA out of memory"
+  Fix: Check nvidia-smi for other processes using VRAM. Kill them.
+       Or reduce max_kv_blocks in config (default 256 = 4K tokens = 3.1 GB).
+  Problem: Very slow first few generations
+  Fix: Normal — Triton is compiling and autotuning kernels. Wait ~10 forward
+       passes for warmup. Subsequent runs use cached kernels.
+  Problem: "No module named 'triton'"
+  Fix: pip install triton (requires CUDA toolkit installed)
+  Problem: "RuntimeError: Triton compilation failed"
+  Fix: Check CUDA version matches PyTorch: python -c "import torch; print(torch.version.cuda)"
+       Triton 3.0+ needs CUDA 12.0+.
+  Problem: NaN in output
+  Fix: Check if using prefill with >20 tokens (packed MoE kernel needs 3D grid).
+       This was a fixed bug — update to latest code.
+  Problem: CUDA Graph capture crashes
+  Fix: Atlas .item() calls conflict with graph capture. The engine auto-skips
+       these during capture (fixed). Update to latest code.
+================================================================================
+ RESEARCH PAPERS & REFERENCES
+================================================================================
+  FireEcho builds on ideas from:
+  Quantization:
+    - AQLM (arxiv 2401.06118): Additive quantization for LLMs -> FE-XC codebook
+    - VPTQ (Hessian-weighted): Second-order optimal codebooks -> FE-XVQ
+    - FP4 Training (arxiv 2501.17116): Gradient flow through FP4
+  Speculative Decoding:
+    - EAGLE-3 (Li et al.): Draft-then-verify with shared lm_head
+    - Scylla (arxiv 2505.07858): Tree-based multi-candidate verification -> FE-XT
+    - Medusa: Multi-head parallel drafting
+  MoE Optimization:
+    - SP-MoE (arxiv 2510.10302): Async expert prefetch -> FE-H Hayabusa
+    - MoE-Inference-Bench: Expert sizing analysis
+  Hebbian/Neuroscience:
+    - Lansner BCPNN: Bayesian confidence propagation neural networks
+    - Triesch 2005: Intrinsic plasticity
+    - Sanger's GHA: Generalized Hebbian algorithm
+    - McClelland et al. 1995: Complementary learning systems
+  Tensor Decomposition:
+    - MPS/TT decomposition: Quantum-inspired weight compression
+================================================================================
+ WHERE TO GET HELP
+================================================================================
+  GitHub Issues: https://github.com/Joysulem/FireEcho/issues
+    Include: GPU model, CUDA version, PyTorch version, full error traceback
+  X / Twitter: @Joysulem
+    Tag me with questions, benchmarks, or usage reports
+  Email: (floresluise1988@gmail.com)
+================================================================================
+ LICENSE
+================================================================================
+  Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
+  You are free to:
+    - Share: copy and redistribute the material in any medium or format
+    - Adapt: remix, transform, and build upon the material
+  Under the following terms:
+    - Attribution: You must give appropriate credit to Luis E. Davila Flores,
+      provide a link to the license, and indicate if changes were made.
+    - NonCommercial: You may not use the material for commercial purposes.
+  Full license: https://creativecommons.org/licenses/by-nc/4.0/
+  For commercial licensing inquiries, contact: @Joysulem on X/Twitter
+================================================================================
+  FireEcho Engine — Created by Luis E. Davila Flores
+  "One GPU. One file. One import. Full pipeline."
+================================================================================