Create Bible Readme
Browse files- Bible Readme +667 -0
Bible Readme
ADDED
|
@@ -0,0 +1,667 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
================================================================================
|
| 2 |
+
FIREECHO ENGINE
|
| 3 |
+
High-Performance Single-GPU Inference Kernel for 30B+ MoE Models
|
| 4 |
+
================================================================================
|
| 5 |
+
|
| 6 |
+
Creator & Sole Author: Luis E. Davila Flores (@Joysulem)
|
| 7 |
+
License: CC BY-NC 4.0 (free for research, attribution required)
|
| 8 |
+
Status: Production-quality single-GPU decode, research extensions active
|
| 9 |
+
|
| 10 |
+
================================================================================
|
| 11 |
+
WHAT IS FIREECHO?
|
| 12 |
+
================================================================================
|
| 13 |
+
|
| 14 |
+
FireEcho is a custom inference engine that runs a 30 BILLION parameter
|
| 15 |
+
Mixture-of-Experts model (Qwen3-Omni-30B) on a SINGLE consumer GPU at
|
| 16 |
+
45+ tokens/second β using only 20 GB VRAM.
|
| 17 |
+
|
| 18 |
+
No multi-GPU. No cloud. No NVIDIA proprietary libraries.
|
| 19 |
+
Just Triton + PyTorch + one GPU.
|
| 20 |
+
|
| 21 |
+
Key numbers:
|
| 22 |
+
- 30.5B total params, ~3.3B active per token (128 experts, top-8 routing)
|
| 23 |
+
- 4x compression via Goliath FP4 fused dequant-matmul (61 GB -> 20 GB)
|
| 24 |
+
- 124x speedup from baseline (0.4 -> 49.4 tok/s) through 7 optimization layers
|
| 25 |
+
- Zero NVIDIA proprietary dependencies (no cuQuantizer, CUTLASS, TensorRT)
|
| 26 |
+
- Runs anywhere Triton compiles: NVIDIA CUDA, AMD ROCm, Intel XPU
|
| 27 |
+
|
| 28 |
+
What makes FireEcho different from vLLM/TGI/llama.cpp:
|
| 29 |
+
- Goliath kernel: FP4 dequantization INSIDE the Triton matmul loop (no separate
|
| 30 |
+
dequant step, no global memory materialization)
|
| 31 |
+
- Packed MoE: All 128 experts packed into one contiguous buffer per layer,
|
| 32 |
+
expert IDs stay on GPU β zero CPU-GPU sync during decode
|
| 33 |
+
- FlashDecode: Custom Triton attention kernel with online softmax for M=1 GQA
|
| 34 |
+
- Hebbian Memory: Biologically-inspired fast weights that learn at inference time
|
| 35 |
+
- FE-XC/INT2: Cold experts auto-demote to 2-bit (codebook or scalar) β further
|
| 36 |
+
bandwidth savings without touching hot experts
|
| 37 |
+
- CUDA Graph decode: Entire decode step captured as a graph, ~15.8ms/step
|
| 38 |
+
|
| 39 |
+
================================================================================
|
| 40 |
+
CURRENT STATUS & REALISTIC EXPECTATIONS
|
| 41 |
+
================================================================================
|
| 42 |
+
|
| 43 |
+
WHAT WORKS (production-quality):
|
| 44 |
+
[x] Full Qwen3-Omni-30B inference at 45+ tok/s on RTX 5090
|
| 45 |
+
[x] Goliath FP4 quantization (20 GB VRAM, FP16-quality output)
|
| 46 |
+
[x] Packed MoE with fused dequant-matmul (zero CPU sync)
|
| 47 |
+
[x] FlashDecode attention (online softmax, valid_len masking)
|
| 48 |
+
[x] CUDA Graph decode (graph-captured forward pass)
|
| 49 |
+
[x] Flat KV cache (pre-allocated, zero allocation per token)
|
| 50 |
+
[x] FP8 KV cache (50% VRAM savings on attention)
|
| 51 |
+
[x] FE-XC cold expert demotion (codebook 2-bit, 5.3x faster kernel)
|
| 52 |
+
[x] INT2 ultra-cold expert demotion (scalar 2-bit)
|
| 53 |
+
[x] Hebbian persistent memory (learns during inference)
|
| 54 |
+
[x] Atlas gatekeeper (expert banning + MoDES skipping)
|
| 55 |
+
[x] Streaming shard loader (110s cold start, 3.1 GB CPU RAM)
|
| 56 |
+
|
| 57 |
+
WHAT'S RESEARCH/EXPERIMENTAL:
|
| 58 |
+
[ ] EAGLE-3 speculative decoding (infrastructure done, head needs training)
|
| 59 |
+
[ ] FE-XT tree speculation (code done, needs trained draft head)
|
| 60 |
+
[ ] FE-H Hayabusa async prefetch (code done, needs benchmarking)
|
| 61 |
+
[ ] Batched speculative decode (infrastructure done)
|
| 62 |
+
[ ] Multi-GPU (not implemented β single-GPU is the design philosophy)
|
| 63 |
+
|
| 64 |
+
WILL NOT WORK ON:
|
| 65 |
+
- GPUs with < 24 GB VRAM (model is 20 GB + KV cache)
|
| 66 |
+
- CUDA < 12.4 (BF16 atomics, FP8 support needed)
|
| 67 |
+
- CPU-only (Triton compiles to GPU targets)
|
| 68 |
+
|
| 69 |
+
================================================================================
|
| 70 |
+
HARDWARE & SOFTWARE REQUIREMENTS
|
| 71 |
+
================================================================================
|
| 72 |
+
|
| 73 |
+
Component Minimum Recommended
|
| 74 |
+
βββββββββββββββββ βββββββββββββββββββ ββββββββββββββββββββββββ
|
| 75 |
+
GPU RTX 4090 (24 GB)* RTX 5090 (32 GB)
|
| 76 |
+
GPU VRAM 24 GB 32 GB
|
| 77 |
+
CPU Any modern x86_64 Ryzen 9 9950X / i9-14900K
|
| 78 |
+
System RAM 32 GB 64 GB
|
| 79 |
+
CUDA 12.4+ 12.8+
|
| 80 |
+
Python 3.10 - 3.12 3.12
|
| 81 |
+
PyTorch 2.4.0+ 2.6.0+cu128
|
| 82 |
+
Triton 3.0+ 3.2+
|
| 83 |
+
OS Linux (x86_64) Arch Linux / Ubuntu 22.04+
|
| 84 |
+
|
| 85 |
+
* RTX 4090: Will work but FP4 kernels may be slower (no Blackwell tensor cores)
|
| 86 |
+
* RTX 3090: Marginal β 24 GB VRAM is tight, FP8 not supported
|
| 87 |
+
* AMD GPUs: Triton compiles to ROCm β untested but architecturally supported
|
| 88 |
+
|
| 89 |
+
Tested configuration (author's machine):
|
| 90 |
+
AMD Ryzen 9 9950X + NVIDIA RTX 5090 32 GB + 64 GB DDR5
|
| 91 |
+
Arch Linux, CUDA 12.8, Python 3.12, PyTorch 2.6.0+cu128, Triton 3.2
|
| 92 |
+
|
| 93 |
+
================================================================================
|
| 94 |
+
INSTALLATION
|
| 95 |
+
================================================================================
|
| 96 |
+
|
| 97 |
+
Step 1: Clone the repository
|
| 98 |
+
βββββββββββββββββββββββββββββ
|
| 99 |
+
git clone https://github.com/Joysulem/FireEcho.git
|
| 100 |
+
cd FireEcho
|
| 101 |
+
|
| 102 |
+
Step 2: Create a Python virtual environment
|
| 103 |
+
ββββββββββββββββββββββββββββββββββββββββββββ
|
| 104 |
+
python3.12 -m venv .venv
|
| 105 |
+
source .venv/bin/activate
|
| 106 |
+
|
| 107 |
+
Step 3: Install dependencies
|
| 108 |
+
βββββββββββββββββββββββββββββ
|
| 109 |
+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
|
| 110 |
+
pip install triton transformers tokenizers safetensors sentencepiece
|
| 111 |
+
|
| 112 |
+
Step 4: Verify installation
|
| 113 |
+
ββββββββββββββββββββββββββββ
|
| 114 |
+
python -c "import torch; print('CUDA:', torch.cuda.is_available(), '|', torch.version.cuda)"
|
| 115 |
+
python -c "import triton; print('Triton:', triton.__version__)"
|
| 116 |
+
|
| 117 |
+
Expected output:
|
| 118 |
+
CUDA: True | 12.8
|
| 119 |
+
Triton: 3.2.0
|
| 120 |
+
|
| 121 |
+
Step 5: Download a model (Qwen3-Omni-30B recommended)
|
| 122 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 123 |
+
# Option A: Via huggingface-cli
|
| 124 |
+
pip install huggingface-hub
|
| 125 |
+
huggingface-cli download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./model/Qwen3-Omni
|
| 126 |
+
|
| 127 |
+
# Option B: Via git lfs
|
| 128 |
+
git lfs install
|
| 129 |
+
git clone https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct ./model/Qwen3-Omni
|
| 130 |
+
|
| 131 |
+
================================================================================
|
| 132 |
+
QUICK SMOKE TEST (run this first!)
|
| 133 |
+
================================================================================
|
| 134 |
+
|
| 135 |
+
cd FireEcho/kernel/FireEcho\ Engine/
|
| 136 |
+
|
| 137 |
+
python -c "
|
| 138 |
+
from fireecho_kernel import FireEchoEngine
|
| 139 |
+
import torch
|
| 140 |
+
|
| 141 |
+
# Load model (takes ~110 seconds, streams layer-by-layer)
|
| 142 |
+
engine = FireEchoEngine.from_pretrained('./model/Qwen3-Omni')
|
| 143 |
+
|
| 144 |
+
# Quick generation test
|
| 145 |
+
tokens = engine.tokenizer.encode('The capital of France is', return_tensors='pt').cuda()
|
| 146 |
+
output = engine.generate(tokens, max_new_tokens=20, temperature=0.0)
|
| 147 |
+
print(engine.tokenizer.decode(output[0]))
|
| 148 |
+
print(f'VRAM used: {torch.cuda.max_memory_allocated()/1e9:.1f} GB')
|
| 149 |
+
"
|
| 150 |
+
|
| 151 |
+
Expected output:
|
| 152 |
+
The capital of France is Paris. Paris is the largest city in France...
|
| 153 |
+
VRAM used: 23.1 GB
|
| 154 |
+
|
| 155 |
+
If this works, your setup is correct. If not, check:
|
| 156 |
+
- CUDA version matches PyTorch build (torch.version.cuda)
|
| 157 |
+
- GPU has enough VRAM (nvidia-smi)
|
| 158 |
+
- Model path is correct
|
| 159 |
+
|
| 160 |
+
================================================================================
|
| 161 |
+
BASIC INFERENCE USAGE
|
| 162 |
+
================================================================================
|
| 163 |
+
|
| 164 |
+
βββ Minimal example βββ
|
| 165 |
+
|
| 166 |
+
from fireecho_kernel import FireEchoEngine
|
| 167 |
+
|
| 168 |
+
# Load model with FP4 quantization (automatic for Qwen3-Omni)
|
| 169 |
+
engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
|
| 170 |
+
|
| 171 |
+
# Encode input
|
| 172 |
+
input_ids = engine.tokenizer.encode(
|
| 173 |
+
"Explain quantum computing in simple terms",
|
| 174 |
+
return_tensors='pt'
|
| 175 |
+
).cuda()
|
| 176 |
+
|
| 177 |
+
# Generate
|
| 178 |
+
output = engine.generate(
|
| 179 |
+
input_ids,
|
| 180 |
+
max_new_tokens=200,
|
| 181 |
+
temperature=0.7,
|
| 182 |
+
top_p=0.9,
|
| 183 |
+
use_cache=True
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
# Decode and print
|
| 187 |
+
print(engine.tokenizer.decode(output[0], skip_special_tokens=True))
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
βββ High-performance decode (all optimizations) βββ
|
| 191 |
+
|
| 192 |
+
engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
|
| 193 |
+
|
| 194 |
+
# Enable flat KV cache (eliminates torch.cat overhead)
|
| 195 |
+
engine.enable_flat_decode() # +403 MB VRAM, BF16 KV
|
| 196 |
+
|
| 197 |
+
# Or FP8 KV cache (half the VRAM, same speed)
|
| 198 |
+
engine.enable_flat_decode(kv_dtype='fp8') # +208 MB VRAM
|
| 199 |
+
|
| 200 |
+
# Enable CUDA Graph decode (captures forward pass as graph)
|
| 201 |
+
engine.enable_cuda_graph_decode() # +0 VRAM, ~10% faster
|
| 202 |
+
|
| 203 |
+
# Enable Atlas gatekeeper (prunes cold experts at runtime)
|
| 204 |
+
engine.enable_atlas(
|
| 205 |
+
profile_prompts=8,
|
| 206 |
+
ban_pct=0.25, # Ban bottom 25% of experts per layer
|
| 207 |
+
modes_threshold=2.0 # MoDES: skip MoE for uncertain tokens
|
| 208 |
+
)
|
| 209 |
+
|
| 210 |
+
# Enable FE-XC cold expert demotion (2-bit codebook)
|
| 211 |
+
engine.enable_auto_fexc_demotion(cold_threshold=0.10)
|
| 212 |
+
|
| 213 |
+
# Enable INT2 ultra-cold expert demotion
|
| 214 |
+
engine.enable_auto_int2_demotion(cold_threshold=0.05)
|
| 215 |
+
|
| 216 |
+
# Generate with everything enabled
|
| 217 |
+
output = engine.generate(input_ids, max_new_tokens=500)
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
βββ Interactive chat loop βββ
|
| 221 |
+
|
| 222 |
+
engine = FireEchoEngine.from_pretrained("/path/to/Qwen3-Omni-30B")
|
| 223 |
+
engine.enable_flat_decode()
|
| 224 |
+
engine.enable_cuda_graph_decode()
|
| 225 |
+
|
| 226 |
+
print("FireEcho Chat (type 'quit' to exit)")
|
| 227 |
+
while True:
|
| 228 |
+
user_input = input("\nYou: ")
|
| 229 |
+
if user_input.lower() == 'quit':
|
| 230 |
+
break
|
| 231 |
+
|
| 232 |
+
# Format as chat (Qwen3 format)
|
| 233 |
+
prompt = f"<|im_start|>user\n{user_input}<|im_end|>\n<|im_start|>assistant\n"
|
| 234 |
+
input_ids = engine.tokenizer.encode(prompt, return_tensors='pt').cuda()
|
| 235 |
+
|
| 236 |
+
output = engine.generate(
|
| 237 |
+
input_ids,
|
| 238 |
+
max_new_tokens=500,
|
| 239 |
+
temperature=0.7,
|
| 240 |
+
top_p=0.9
|
| 241 |
+
)
|
| 242 |
+
|
| 243 |
+
response = engine.tokenizer.decode(
|
| 244 |
+
output[0][input_ids.shape[1]:],
|
| 245 |
+
skip_special_tokens=True
|
| 246 |
+
)
|
| 247 |
+
print(f"\nFireEcho: {response}")
|
| 248 |
+
|
| 249 |
+
================================================================================
|
| 250 |
+
BENCHMARKING
|
| 251 |
+
================================================================================
|
| 252 |
+
|
| 253 |
+
βββ Quick speed test βββ
|
| 254 |
+
|
| 255 |
+
python benchmark_fullstack.py
|
| 256 |
+
|
| 257 |
+
This runs 7 optimization layers, stacking each one:
|
| 258 |
+
L0: Baseline (FP4 + packed MoE + flat KV BF16) ~45 tok/s
|
| 259 |
+
L1: + FP8 KV cache ~42 tok/s
|
| 260 |
+
L2: + L2 layer prefetch ~42 tok/s
|
| 261 |
+
L3: + Atlas Ban & Pick (8->~5 experts) ~40 tok/s
|
| 262 |
+
L4: + FE-XC cold experts (2-bit codebook) ~39 tok/s
|
| 263 |
+
L5: + INT2 coldest experts (2-bit scalar) ~38 tok/s
|
| 264 |
+
L6: + CUDA Graph decode ~TBD
|
| 265 |
+
|
| 266 |
+
Note: L1-L5 are slightly slower than L0 due to overhead from
|
| 267 |
+
additional dispatch logic. The REAL benefit comes when combined
|
| 268 |
+
with speculative decoding (EAGLE-3) β the bandwidth savings from
|
| 269 |
+
FE-XC/INT2 allow more tokens to be verified per unit time.
|
| 270 |
+
|
| 271 |
+
|
| 272 |
+
βββ EAGLE-3 benchmark (speculative decode) βββ
|
| 273 |
+
|
| 274 |
+
python benchmark_eagle.py --checkpoint eagle_checkpoints/eagle_best.pt
|
| 275 |
+
|
| 276 |
+
Note: Requires a trained draft head. See "EAGLE-3 Training" section.
|
| 277 |
+
|
| 278 |
+
================================================================================
|
| 279 |
+
FEATURE REFERENCE (Cheat Sheet)
|
| 280 |
+
================================================================================
|
| 281 |
+
|
| 282 |
+
Feature How to enable VRAM cost
|
| 283 |
+
βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββ ββββββββββ
|
| 284 |
+
Flat KV cache (BF16) engine.enable_flat_decode() +403 MB
|
| 285 |
+
Flat KV cache (FP8) engine.enable_flat_decode('fp8') +208 MB
|
| 286 |
+
CUDA Graph decode engine.enable_cuda_graph_decode() ~0
|
| 287 |
+
Atlas gatekeeper engine.enable_atlas() ~0
|
| 288 |
+
FE-XC cold demotion engine.enable_auto_fexc_demotion() ~0*
|
| 289 |
+
INT2 cold demotion engine.enable_auto_int2_demotion() ~0*
|
| 290 |
+
L2 layer prefetch engine.enable_l2_prefetch() ~0
|
| 291 |
+
Hebbian memory engine.enable_hebbian() +50 MB
|
| 292 |
+
EAGLE-3 speculation engine.enable_eagle(checkpoint) +200 MB
|
| 293 |
+
|
| 294 |
+
* FE-XC/INT2 actually SAVES VRAM by compressing cold expert weights
|
| 295 |
+
|
| 296 |
+
Quantization formats available:
|
| 297 |
+
- Goliath FP4: 4-bit fused dequant (default for MoE experts)
|
| 298 |
+
- Goliath FP8: 8-bit fused dequant (optional for attention)
|
| 299 |
+
- Goliath INT2: 2-bit scalar quantization (coldest experts)
|
| 300 |
+
- FE-XC: 2-bit codebook (2x8 AQLM-style, near-FP16 quality)
|
| 301 |
+
- FE-XVQ: Hessian-weighted 2-bit codebook (VPTQ-inspired)
|
| 302 |
+
- FE-MX: Block floating point (FEMX4/FEMX6/FEMX8 for Hebbian)
|
| 303 |
+
|
| 304 |
+
================================================================================
|
| 305 |
+
HOW THE ENGINE WORKS (Architecture Overview)
|
| 306 |
+
================================================================================
|
| 307 |
+
|
| 308 |
+
FireEcho loads a model and replaces standard PyTorch operations with
|
| 309 |
+
custom Triton kernels at every level:
|
| 310 |
+
|
| 311 |
+
1. LOADING (from_pretrained)
|
| 312 |
+
- Streams model shards one layer at a time (3.1 GB CPU RAM peak)
|
| 313 |
+
- Quantizes each layer to Goliath FP4 on GPU as it loads
|
| 314 |
+
- Packs all 128 MoE experts into contiguous buffers per layer
|
| 315 |
+
- Total: 61 GB BF16 -> 20 GB FP4 in 110 seconds
|
| 316 |
+
|
| 317 |
+
2. PREFILL (processing the input prompt)
|
| 318 |
+
- Standard attention + MoE forward pass
|
| 319 |
+
- Uses FlashAttention-style Triton kernel for long sequences
|
| 320 |
+
- Builds KV cache for all layers
|
| 321 |
+
|
| 322 |
+
3. DECODE (generating tokens one at a time)
|
| 323 |
+
- Each token goes through 48 transformer layers:
|
| 324 |
+
|
| 325 |
+
For each layer:
|
| 326 |
+
a) RMSNorm
|
| 327 |
+
b) Attention: Q/K/V projection (BF16 matmul) -> RoPE -> FlashDecode
|
| 328 |
+
(custom Triton kernel, M=1, online softmax, reads only valid KV)
|
| 329 |
+
c) RMSNorm
|
| 330 |
+
d) MoE Router: softmax over 128 experts -> top-8 selection
|
| 331 |
+
e) Expert FFN: Goliath FP4 packed matmul (gate_up + down)
|
| 332 |
+
- Hot experts: FP4 (highest quality)
|
| 333 |
+
- Cold experts: FE-XC 2-bit codebook (5.3x faster kernel)
|
| 334 |
+
- Coldest experts: INT2 2-bit scalar
|
| 335 |
+
f) Residual connection
|
| 336 |
+
|
| 337 |
+
- With CUDA Graph: entire 48-layer forward captured as one graph
|
| 338 |
+
launch -> ~15.8ms per token
|
| 339 |
+
|
| 340 |
+
4. SPECULATIVE DECODE (EAGLE-3, when draft head is trained)
|
| 341 |
+
- Draft head predicts next K tokens (K=5 default)
|
| 342 |
+
- Target model verifies all K+1 tokens in one forward pass
|
| 343 |
+
- Accepts matching tokens, rejects and rolls back on mismatch
|
| 344 |
+
- Expected: 3-5x speedup with 70%+ acceptance rate
|
| 345 |
+
|
| 346 |
+
Memory layout during decode:
|
| 347 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 348 |
+
β GPU VRAM (32 GB total) β
|
| 349 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 350 |
+
β Model weights (FP4 quantized) 19.6 GB β
|
| 351 |
+
β KV cache (flat, FP8) 0.2 GB β
|
| 352 |
+
β Hebbian memory 0.05 GB β
|
| 353 |
+
β CUDA Graph buffers 0.1 GB β
|
| 354 |
+
β Activations + workspace 1.0 GB β
|
| 355 |
+
β βββββββββββββββββββββββββββββββββββββββββββββ β
|
| 356 |
+
β Total ~21.0 GB β
|
| 357 |
+
β Free ~11.0 GB β
|
| 358 |
+
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 359 |
+
|
| 360 |
+
================================================================================
|
| 361 |
+
FILE STRUCTURE
|
| 362 |
+
================================================================================
|
| 363 |
+
|
| 364 |
+
FireEcho Engine/
|
| 365 |
+
βββ fireecho_kernel.py Main engine (9000+ lines)
|
| 366 |
+
β - FireEchoEngine: load, generate, speculate
|
| 367 |
+
β - FireEchoConfig: model configuration
|
| 368 |
+
β - MoEFFN: mixture-of-experts with packed dispatch
|
| 369 |
+
β - HebbianMemory: biologically-inspired fast weights
|
| 370 |
+
β - FireEchoEagleHead: EAGLE-3 draft head
|
| 371 |
+
β - FlashDecode Triton kernel
|
| 372 |
+
β - CUDA Graph capture/replay
|
| 373 |
+
β
|
| 374 |
+
βββ goliath_kernel.py Quantized GEMM kernels (3000+ lines)
|
| 375 |
+
β - GoliathFP4Weights: FP4 fused dequant-matmul
|
| 376 |
+
β - GoliathFP8Weights: FP8 fused dequant-matmul
|
| 377 |
+
β - GoliathINT2Weights: INT2 scalar quantization
|
| 378 |
+
β - GoliathFEXCWeights: FE-XC codebook 2-bit
|
| 379 |
+
β - GoliathFEXVQWeights: Hessian-weighted codebook
|
| 380 |
+
β - Packed MoE kernels (FP4, INT2, FE-XC)
|
| 381 |
+
β - Fused SwiGLU+Down kernel
|
| 382 |
+
β - GoliathQuantumLinear (training)
|
| 383 |
+
β
|
| 384 |
+
βββ triton_hebbian.py Fused Triton kernels for Hebbian memory
|
| 385 |
+
β - fused_competition, fused_soft_hebbian
|
| 386 |
+
β - fused_traces_update, fused_gate_output
|
| 387 |
+
β
|
| 388 |
+
βββ femx_storage.py FE-MX block floating point storage
|
| 389 |
+
β - FEMX2, FEMX4, FEMX6, FEMX8 tiers
|
| 390 |
+
β - Stochastic rounding, age-adaptive precision
|
| 391 |
+
β
|
| 392 |
+
βββ persistent_memory.py AGI-like persistent memory
|
| 393 |
+
β - EpisodicLog: raw experience buffer
|
| 394 |
+
β - SemanticJournal: compressed knowledge
|
| 395 |
+
β - ReflectionEngine: self-evaluation
|
| 396 |
+
β
|
| 397 |
+
βββ benchmark_fullstack.py Full-stack benchmark (L0-L6)
|
| 398 |
+
βββ benchmark_eagle.py EAGLE-3 speculative decode benchmark
|
| 399 |
+
βββ train_eagle_head.py EAGLE-3 draft head training script
|
| 400 |
+
βββ calibrate_fexc.py FE-XC codebook calibration
|
| 401 |
+
|
| 402 |
+
================================================================================
|
| 403 |
+
THE GOLIATH KERNEL (What Makes It Fast)
|
| 404 |
+
================================================================================
|
| 405 |
+
|
| 406 |
+
Standard quantized inference:
|
| 407 |
+
1. Load FP4 weights from VRAM
|
| 408 |
+
2. Dequantize to BF16 in global memory (writes 61 GB!)
|
| 409 |
+
3. Run matmul on the BF16 weights
|
| 410 |
+
Problem: Step 2 doubles memory traffic and VRAM usage
|
| 411 |
+
|
| 412 |
+
Goliath approach:
|
| 413 |
+
1. Load FP4 weights directly into Triton registers
|
| 414 |
+
2. Dequantize INSIDE the matmul tile loop (in registers, zero global write)
|
| 415 |
+
3. Accumulate in FP32
|
| 416 |
+
Problem: None. This is strictly better.
|
| 417 |
+
|
| 418 |
+
Code path (simplified):
|
| 419 |
+
for k_block in range(0, K, BLOCK_K):
|
| 420 |
+
# Load FP4 packed bytes (2 values per byte)
|
| 421 |
+
w_packed = tl.load(weight_ptr + offsets)
|
| 422 |
+
|
| 423 |
+
# Dequantize in-register
|
| 424 |
+
w_lo = (w_packed & 0xF).to(tl.float32) * scale # low nibble
|
| 425 |
+
w_hi = (w_packed >> 4).to(tl.float32) * scale # high nibble
|
| 426 |
+
|
| 427 |
+
# Matmul tile (tensor core)
|
| 428 |
+
acc += tl.dot(a_tile, w_tile)
|
| 429 |
+
|
| 430 |
+
Result: 4x less memory traffic, same numerical quality.
|
| 431 |
+
|
| 432 |
+
Packed MoE:
|
| 433 |
+
Standard approach: Loop over 8 active experts, one matmul each = 16 kernel
|
| 434 |
+
launches per layer (gate_up + down per expert).
|
| 435 |
+
|
| 436 |
+
Goliath Packed MoE: All 128 experts packed into one [128, K//2, N] buffer.
|
| 437 |
+
Single kernel launch reads expert_id from GPU tensor, indexes into buffer.
|
| 438 |
+
Result: 2 kernel launches per layer (gate_up + down), expert selection
|
| 439 |
+
stays entirely on GPU.
|
| 440 |
+
|
| 441 |
+
================================================================================
|
| 442 |
+
HEBBIAN MEMORY (What Makes It Smart)
|
| 443 |
+
================================================================================
|
| 444 |
+
|
| 445 |
+
Standard LLMs: Frozen weights after training. Context window is the only memory.
|
| 446 |
+
|
| 447 |
+
FireEcho Hebbian Memory:
|
| 448 |
+
- Fast weights that update DURING inference (no backpropagation)
|
| 449 |
+
- Inspired by biological synaptic plasticity (Hebb's rule: "neurons that
|
| 450 |
+
fire together wire together")
|
| 451 |
+
- Stores patterns from the current conversation
|
| 452 |
+
- Retrieves relevant patterns to augment generation
|
| 453 |
+
|
| 454 |
+
How it works:
|
| 455 |
+
1. Input token embedding is projected to query/key/value
|
| 456 |
+
2. Query matches against stored memory slots (competitive retrieval)
|
| 457 |
+
3. Top-K most relevant memories are retrieved
|
| 458 |
+
4. Retrieved context is mixed with transformer hidden state
|
| 459 |
+
5. Memory slots are updated via Hebbian learning rule
|
| 460 |
+
|
| 461 |
+
Updates use:
|
| 462 |
+
- Soft competitive learning (winner-take-most)
|
| 463 |
+
- Three-factor STDP (spike-timing dependent plasticity)
|
| 464 |
+
- Intrinsic plasticity (per-slot gain adaptation)
|
| 465 |
+
- PMI correction (pointwise mutual information bias)
|
| 466 |
+
- GHA decorrelation (prevent redundant memories)
|
| 467 |
+
- Kappa switching (amplified encoding for novel patterns)
|
| 468 |
+
|
| 469 |
+
Enable:
|
| 470 |
+
engine.enable_hebbian()
|
| 471 |
+
|
| 472 |
+
The memory persists within a session and can be saved/loaded:
|
| 473 |
+
engine.save_persistent_memory("memory.pt")
|
| 474 |
+
engine.load_persistent_memory("memory.pt")
|
| 475 |
+
|
| 476 |
+
================================================================================
|
| 477 |
+
COMPRESSION STACK (Why 30B Fits in 20 GB)
|
| 478 |
+
================================================================================
|
| 479 |
+
|
| 480 |
+
Level Format Bits Compression Quality Used For
|
| 481 |
+
ββββββ βββββββββ ββββ βββββββββββ ββββββββββββ ββββββββββββββββ
|
| 482 |
+
Base BF16 16 1x Perfect Attention Q/K/V/O
|
| 483 |
+
Hot Goliath 4 4x Near-perfect Active MoE experts
|
| 484 |
+
FP4
|
| 485 |
+
Cold FE-XC 2 8x Very good Rarely-used experts
|
| 486 |
+
(codebook)
|
| 487 |
+
Coldest INT2 2 8x Acceptable Least-used experts
|
| 488 |
+
(scalar)
|
| 489 |
+
|
| 490 |
+
Combined with MoE sparsity (8/128 active = 6.25%):
|
| 491 |
+
Effective model size per token:
|
| 492 |
+
Attention: 8 Γ (4 projections Γ 2048 Γ 128 Γ 2 bytes) = 16 MB
|
| 493 |
+
MoE: 8 experts Γ 3 projections Γ 768 Γ 2048 Γ 0.5 bytes = 18.9 MB
|
| 494 |
+
Other: embeddings, norms, router = ~13 MB
|
| 495 |
+
Total per token: ~48 MB
|
| 496 |
+
|
| 497 |
+
RTX 5090 bandwidth: 1.79 TB/s
|
| 498 |
+
Theoretical max: 1,790,000 / 48 = 37,291 tok/s (compute-bound limit)
|
| 499 |
+
Practical (30% utilization): ~45 tok/s (memory-bound, current result)
|
| 500 |
+
|
| 501 |
+
With FE-XC/INT2 cold experts replacing 80%+ of inactive expert weights:
|
| 502 |
+
MoE bandwidth: 18.9 MB * 0.5 (half are 2-bit) = ~10 MB
|
| 503 |
+
Total per token: ~39 MB
|
| 504 |
+
At 30% utilization: ~55 tok/s
|
| 505 |
+
|
| 506 |
+
With EAGLE-3 (70% acceptance, K=5 draft):
|
| 507 |
+
Effective throughput: 55 * 3.5 (average accepted tokens per verify) = ~193 tok/s
|
| 508 |
+
|
| 509 |
+
================================================================================
|
| 510 |
+
EAGLE-3 SPECULATIVE DECODING
|
| 511 |
+
================================================================================
|
| 512 |
+
|
| 513 |
+
EAGLE-3 is a draft-then-verify acceleration technique:
|
| 514 |
+
|
| 515 |
+
Normal decode: 1 token per forward pass through 48 MoE layers
|
| 516 |
+
EAGLE-3: Draft head predicts 5 tokens cheaply, target model verifies all 6
|
| 517 |
+
in one forward pass. If 4/5 match -> 5 tokens for the cost of ~2.
|
| 518 |
+
|
| 519 |
+
Architecture of draft head:
|
| 520 |
+
- Takes hidden states from layers 8, 24, 47 + token embedding
|
| 521 |
+
- Compresses via FC layer (8192 -> 2048)
|
| 522 |
+
- Passes through D transformer layers (D=2 to D=50)
|
| 523 |
+
- Shares lm_head with target model
|
| 524 |
+
- Total params: 115M (D=2) to 2.12B (D=50)
|
| 525 |
+
|
| 526 |
+
Training:
|
| 527 |
+
python train_eagle_head.py \
|
| 528 |
+
--offline \ # Use precomputed hidden states
|
| 529 |
+
--num_head_layers 50 \ # D=50 layers
|
| 530 |
+
--draft_depth 5 \ # K=5 draft steps
|
| 531 |
+
--lr 5e-4 \ # Learning rate
|
| 532 |
+
--epochs 5 \ # Training epochs
|
| 533 |
+
--loss_type ce \ # Cross-entropy loss
|
| 534 |
+
--batch_positions \ # Batched M=64 (10x faster)
|
| 535 |
+
--use_quantum_linear \ # Goliath FP8 forward + Quantum Gold backward
|
| 536 |
+
--compile # torch.compile the head
|
| 537 |
+
|
| 538 |
+
Usage after training:
|
| 539 |
+
engine.enable_eagle("eagle_checkpoints/eagle_best.pt")
|
| 540 |
+
output = engine.speculative_generate(input_ids, max_new_tokens=500)
|
| 541 |
+
|
| 542 |
+
================================================================================
|
| 543 |
+
SPEED OPTIMIZATION HISTORY
|
| 544 |
+
================================================================================
|
| 545 |
+
|
| 546 |
+
Step Optimization tok/s Speedup
|
| 547 |
+
βοΏ½οΏ½οΏ½ββ ββββββββββββββββββββββββββββββββββββββββ ββββββ βββββββ
|
| 548 |
+
0 Baseline (128-expert Python loop) 0.4 1x
|
| 549 |
+
1 Grouped dispatch + TF32 + Triton autotune 7.7 19x
|
| 550 |
+
2 Fused gate_up_proj (2->1 matmul/expert) 9.5 24x
|
| 551 |
+
3 Single-token decode fast path 12.6 32x
|
| 552 |
+
4 Multi-expert Goliath kernel (2 launches) 18.8 47x
|
| 553 |
+
5 Packed MoE (contiguous buffer, GPU IDs) 30.8 77x
|
| 554 |
+
6 Flat decode KV cache (zero torch.cat) 40.9 102x
|
| 555 |
+
7 CUDA Graph + FlashDecode 49.4 124x
|
| 556 |
+
|
| 557 |
+
Where the time goes at 45 tok/s (22ms per token):
|
| 558 |
+
Attention (FlashDecode): 0.28ms/layer x 48 = 13.4ms (61%)
|
| 559 |
+
MoE (Goliath FP4): 0.17ms/layer x 48 = 8.2ms (37%)
|
| 560 |
+
Other (norms, router): 0.4ms (2%)
|
| 561 |
+
|
| 562 |
+
================================================================================
|
| 563 |
+
KNOWN LIMITATIONS & GOTCHAS
|
| 564 |
+
================================================================================
|
| 565 |
+
|
| 566 |
+
- Single-GPU only (by design β multi-GPU adds complexity for marginal gain)
|
| 567 |
+
- Minimum 24 GB VRAM (model alone is 20 GB)
|
| 568 |
+
- FP4 quantization has ~0.05-0.15 relative error vs BF16 (negligible in practice)
|
| 569 |
+
- First 10+ forward passes are slow (Triton kernel compilation/autotuning)
|
| 570 |
+
- CUDA Graph capture requires fixed tensor shapes (only decode, not prefill)
|
| 571 |
+
- Hebbian memory adds ~50 MB VRAM and slight latency
|
| 572 |
+
- FE-XC codebook learning takes 1-2 minutes on first enable
|
| 573 |
+
- No pip package yet (source install only)
|
| 574 |
+
- Tested primarily on RTX 5090 β other GPUs may need Triton autotune re-run
|
| 575 |
+
- MoDES expert skipping can hurt quality if threshold is too aggressive
|
| 576 |
+
|
| 577 |
+
================================================================================
|
| 578 |
+
TROUBLESHOOTING
|
| 579 |
+
================================================================================
|
| 580 |
+
|
| 581 |
+
Problem: "CUDA out of memory"
|
| 582 |
+
Fix: Check nvidia-smi for other processes using VRAM. Kill them.
|
| 583 |
+
Or reduce max_kv_blocks in config (default 256 = 4K tokens = 3.1 GB).
|
| 584 |
+
|
| 585 |
+
Problem: Very slow first few generations
|
| 586 |
+
Fix: Normal β Triton is compiling and autotuning kernels. Wait ~10 forward
|
| 587 |
+
passes for warmup. Subsequent runs use cached kernels.
|
| 588 |
+
|
| 589 |
+
Problem: "No module named 'triton'"
|
| 590 |
+
Fix: pip install triton (requires CUDA toolkit installed)
|
| 591 |
+
|
| 592 |
+
Problem: "RuntimeError: Triton compilation failed"
|
| 593 |
+
Fix: Check CUDA version matches PyTorch: python -c "import torch; print(torch.version.cuda)"
|
| 594 |
+
Triton 3.0+ needs CUDA 12.0+.
|
| 595 |
+
|
| 596 |
+
Problem: NaN in output
|
| 597 |
+
Fix: Check if using prefill with >20 tokens (packed MoE kernel needs 3D grid).
|
| 598 |
+
This was a fixed bug β update to latest code.
|
| 599 |
+
|
| 600 |
+
Problem: CUDA Graph capture crashes
|
| 601 |
+
Fix: Atlas .item() calls conflict with graph capture. The engine auto-skips
|
| 602 |
+
these during capture (fixed). Update to latest code.
|
| 603 |
+
|
| 604 |
+
================================================================================
|
| 605 |
+
RESEARCH PAPERS & REFERENCES
|
| 606 |
+
================================================================================
|
| 607 |
+
|
| 608 |
+
FireEcho builds on ideas from:
|
| 609 |
+
|
| 610 |
+
Quantization:
|
| 611 |
+
- AQLM (arxiv 2401.06118): Additive quantization for LLMs -> FE-XC codebook
|
| 612 |
+
- VPTQ (Hessian-weighted): Second-order optimal codebooks -> FE-XVQ
|
| 613 |
+
- FP4 Training (arxiv 2501.17116): Gradient flow through FP4
|
| 614 |
+
|
| 615 |
+
Speculative Decoding:
|
| 616 |
+
- EAGLE-3 (Li et al.): Draft-then-verify with shared lm_head
|
| 617 |
+
- Scylla (arxiv 2505.07858): Tree-based multi-candidate verification -> FE-XT
|
| 618 |
+
- Medusa: Multi-head parallel drafting
|
| 619 |
+
|
| 620 |
+
MoE Optimization:
|
| 621 |
+
- SP-MoE (arxiv 2510.10302): Async expert prefetch -> FE-H Hayabusa
|
| 622 |
+
- MoE-Inference-Bench: Expert sizing analysis
|
| 623 |
+
|
| 624 |
+
Hebbian/Neuroscience:
|
| 625 |
+
- Lansner BCPNN: Bayesian confidence propagation neural networks
|
| 626 |
+
- Triesch 2005: Intrinsic plasticity
|
| 627 |
+
- Sanger's GHA: Generalized Hebbian algorithm
|
| 628 |
+
- McClelland et al. 1995: Complementary learning systems
|
| 629 |
+
|
| 630 |
+
Tensor Decomposition:
|
| 631 |
+
- MPS/TT decomposition: Quantum-inspired weight compression
|
| 632 |
+
|
| 633 |
+
================================================================================
|
| 634 |
+
WHERE TO GET HELP
|
| 635 |
+
================================================================================
|
| 636 |
+
|
| 637 |
+
GitHub Issues: https://github.com/Joysulem/FireEcho/issues
|
| 638 |
+
Include: GPU model, CUDA version, PyTorch version, full error traceback
|
| 639 |
+
|
| 640 |
+
X / Twitter: @Joysulem
|
| 641 |
+
Tag me with questions, benchmarks, or usage reports
|
| 642 |
+
|
| 643 |
+
Email: (floresluise1988@gmail.com)
|
| 644 |
+
|
| 645 |
+
================================================================================
|
| 646 |
+
LICENSE
|
| 647 |
+
================================================================================
|
| 648 |
+
|
| 649 |
+
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
|
| 650 |
+
|
| 651 |
+
You are free to:
|
| 652 |
+
- Share: copy and redistribute the material in any medium or format
|
| 653 |
+
- Adapt: remix, transform, and build upon the material
|
| 654 |
+
|
| 655 |
+
Under the following terms:
|
| 656 |
+
- Attribution: You must give appropriate credit to Luis E. Davila Flores,
|
| 657 |
+
provide a link to the license, and indicate if changes were made.
|
| 658 |
+
- NonCommercial: You may not use the material for commercial purposes.
|
| 659 |
+
|
| 660 |
+
Full license: https://creativecommons.org/licenses/by-nc/4.0/
|
| 661 |
+
|
| 662 |
+
For commercial licensing inquiries, contact: @Joysulem on X/Twitter
|
| 663 |
+
|
| 664 |
+
================================================================================
|
| 665 |
+
FireEcho Engine β Created by Luis E. Davila Flores
|
| 666 |
+
"One GPU. One file. One import. Full pipeline."
|
| 667 |
+
================================================================================
|