Axion1-350K-A250K
DeepSeek-V3 architecture scaled to
344k total parameters (160k active/token) β runs entirely on CPU.
Built from scratch as a proof-of-concept that the real DeepSeek-V3 architectural innovations (MLA + DeepSeekMoE + auxiliary-loss-free load balancing) work correctly even at extreme miniaturization.
Architecture
This is not a distilled or quantized version of DeepSeek. Every component was implemented from scratch in pure PyTorch, faithfully following the DeepSeek-V3 technical report (arXiv:2412.19437).
| Component | DeepSeek-V3 | Axion1 |
|---|---|---|
| Attention | MLA (Multi-head Latent Attention) | β Identical MLA |
| FFN | DeepSeekMoE (256 routed experts) | β MoE (4 routed, top-2) |
| Load balancing | Auxiliary-loss-free (dynamic bias) | β Section 2.3.2 |
| Position | RoPE | β RoPE |
| Normalization | RMSNorm | β RMSNorm |
| Activation | SwiGLU | β SwiGLU |
| Total params | 671B | 344k |
| Active params/token | 37B | ~160k |
Model Details
d_model : 64
n_layers : 4
n_heads : 4 (MLA)
d_head : 16
kv_lora_rank : 8 (MLA KV compression)
q_lora_rank : 16 (MLA Q compression)
n_shared_experts : 1
n_routed_experts : 4 (top-2 activated)
d_ff : 64 (per expert)
vocab_size : 1024 (BPE, trained on GSM8K)
max_seq_len : 512
total_params : 343,616
active_params/tok : ~160,000
Training
- Dataset: GSM8K β grade school math, converted to plain text with question / reasoning / answer format
- Tokenizer: BPE trained from scratch, vocab size 1024
- Hardware: AMD Ryzen 5 5600G β CPU only, 12 threads, 32 GB RAM
- Speed: ~1,000β1,100 tokens/sec on CPU
- Epochs: 20 | Final val loss: ~3.2 | Total time: ~115 minutes
Training Curve
| Epoch | Val Loss |
|---|---|
| 1 | 5.49 |
| 2 | 4.59 |
| 3 | 4.30 |
| 5 | 3.88 |
| 7 | 3.66 |
| 9 | 3.54 |
| 20 | ~3.2 |
Usage
from transformers import AutoModelForCausalLM, LogitsProcessor, LogitsProcessorList
from tokenizer import BPETokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"AxionLab-official/Axion1-350k-A250k",
trust_remote_code=True
)
model.eval()
tok = BPETokenizer.load("model.vocab", "model.model")
# Bloqueia EOS e PAD nos primeiros min_tokens gerados
class MinNewTokens(LogitsProcessor):
def __init__(self, min_tokens: int, eos_id: int, pad_id: int):
self.min_tokens = min_tokens
self.bad = [eos_id, pad_id]
self.generated = 0
def __call__(self, input_ids, scores):
if self.generated < self.min_tokens:
for bid in self.bad:
scores[:, bid] = float("-inf")
self.generated += 1
return scores
eos_id = tok.token2id["<eos>"]
pad_id = tok.token2id["<pad>"]
prompt = "# Pergunta:\nQuanto Γ© 5 + 3?\n--\n# Resposta:\n"
ids = tok.encode(prompt, add_bos=True, add_eos=False)
input_ids = torch.tensor([ids])
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=80,
temperature=0.9,
do_sample=True,
top_k=50,
top_p=0.95,
eos_token_id=eos_id,
pad_token_id=pad_id,
use_cache=False,
logits_processor=LogitsProcessorList([
MinNewTokens(min_tokens=5, eos_id=eos_id, pad_id=pad_id)
]),
)
new_tokens = output[0][len(ids):].tolist()
# Remove EOS do final se presente
if new_tokens and new_tokens[-1] == eos_id:
new_tokens = new_tokens[:-1]
print("Resposta:", tok.decode(new_tokens))
Scaling Roadmap
| Version | Params | Status |
|---|---|---|
| Axion1-v0.1 (this) | 344k | β Released |
| Axion1-v0.2 | ~1.5M | π Next |
| Axion1-v0.3 | ~6M | π Planned |
| Axion1--v0.4 | ~24M | π Planned |
| Axion1--v0.5 | ~100M | π Planned |
Files
βββ model.py # Full DeepSeek-V3 architecture (MLA + MoE)
βββ modeling_axion.py # HuggingFace wrapper
βββ config.json # Model configuration
βββ model.safetensors # Trained weights
βββ model.vocab # BPE vocabulary
βββ model.model # BPE merge rules
Limitations
With only 344k parameters, the model has learned mathematical vocabulary and co-occurrence patterns from GSM8K but cannot reliably solve problems or maintain syntactic coherence. This is expected β the purpose of this release is to demonstrate that the DeepSeek-V3 architectural components work correctly at any scale, and to serve as a foundation for the scaling roadmap above.
Citation
@article{deepseekv3,
title = {DeepSeek-V3 Technical Report},
author = {DeepSeek-AI},
year = {2024},
url = {https://arxiv.org/abs/2412.19437}
}
License
MIT β free to use, modify, and build upon.
Made by AxionLab
- Downloads last month
- 14