Vortex Scientific

Vortex Scientific is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).

🌟 Features

  • Novel Architecture: Hybrid State-Space Model (SSM) + Local Attention blocks
  • Science-Specialized: Custom tokenizer, domain-aware gating, and specialized modules for equations, numerical reasoning, citations, and molecular structures
  • Hardware Optimized: Runs smoothly on 8GB VRAM (4060 laptop) and 16GB unified memory (MacBook Pro M2/M3)
  • Two Model Sizes:
    • Vortex-7B: 7 billion parameters, fits in 8GB VRAM
    • Vortex-13B: 13 billion parameters, fits in 16GB VRAM with quantization
  • HuggingFace Compatible: Full integration with transformers library
  • From Scratch: No base model β€” everything built bottom-up including tokenizer and weights

πŸ—οΈ Architecture

Vortex uses a two-block hybrid architecture:

  1. SSM-Only Blocks: State-space layers for efficient long-context processing (O(n) complexity)
  2. Attention+Science Blocks: Local windowed attention + science modules + SciGate FFN

Layer ratios:

  • 7B: 60% SSM, 40% Attention (pattern: SSM, SSM, Attn, ...)
  • 13B: 50% SSM, 50% Attention (pattern: SSM, Attn, SSM, Attn, ...)

Science Modules

  • EquationModule: LaTeX equation detection and structural understanding
  • NumericalReasoningModule: Digit-level encoding, scientific notation, unit awareness
  • CitationModule: Citation span detection, provenance tracking, confidence scoring
  • MolecularModule: Element embeddings, SMILES understanding, amino acid sequences

πŸ“¦ Project Structure

Vortex/
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ vortex_7b_config.py      # 7B model configuration
β”‚   β”œβ”€β”€ vortex_13b_config.py     # 13B model configuration
β”‚   └── training_config.py       # Training hyperparameters
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ ssm_layer.py             # State-space layer
β”‚   β”œβ”€β”€ attention_layer.py       # Local windowed attention
β”‚   β”œβ”€β”€ scigate_ffn.py           # Science-gated feed-forward
β”‚   β”œβ”€β”€ vortex_model.py          # Main model class
β”‚   └── science_modules/         # Specialized science modules
β”œβ”€β”€ tokenizer/
β”‚   └── vortex_tokenizer.py      # Custom BPE tokenizer with science vocab
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ dataset_loader.py        # Open dataset loading (Pile, S2ORC, etc.)
β”‚   β”œβ”€β”€ quality_filter.py        # Multi-stage quality filtering
β”‚   β”œβ”€β”€ domain_classifier.py     # 7-domain classifier
β”‚   β”œβ”€β”€ deduplication.py         # MinHash LSH deduplication
β”‚   └── scraper.py               # Web scraping (arXiv, PubMed, etc.)
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ trainer.py               # Main training loop
β”‚   β”œβ”€β”€ losses.py                # Science-aware loss functions
β”‚   └── curriculum.py            # Curriculum learning scheduler
β”œβ”€β”€ inference/
β”‚   β”œβ”€β”€ cuda_optimize.py         # CUDA optimizations (Flash Attention, INT8)
β”‚   └── mps_optimize.py          # MPS optimizations for Apple Silicon
β”œβ”€β”€ evaluation/                  # Science benchmarks (coming soon)
β”œβ”€β”€ configuration_vortex.py      # HF config class
β”œβ”€β”€ tokenization_vortex.py       # HF tokenizer wrapper
β”œβ”€β”€ modeling_vortex.py           # HF model integration
β”œβ”€β”€ train.py                     # Training entry point
β”œβ”€β”€ inference/inference.py       # Inference entry point
└── requirements.txt

πŸš€ Quick Start

Installation

# Clone and setup
cd Vortex
pip install -r requirements.txt

# For CUDA optimizations
pip install flash-attn
pip install bitsandbytes

Training

# Train 7B model on CUDA
python train.py \
    --model_size 7b \
    --device cuda \
    --data_dir ./data/processed \
    --output_dir ./checkpoints \
    --max_steps 100000

# Train 13B model with INT8 quantization (for 8GB VRAM)
python train.py \
    --model_size 13b \
    --device cuda \
    --quantization int8 \
    --data_dir ./data/processed \
    --output_dir ./checkpoints_13b

Inference

# Generate text with 7B model
python inference/inference.py \
    --model_path ./checkpoints/latest.pt \
    --model_size 7b \
    --device cuda \
    --prompt "The equation E = mc^2 describes" \
    --max_new_tokens 100

# Interactive mode
python inference/inference.py \
    --model_path ./checkpoints/latest.pt \
    --model_size 7b \
    --device cuda \
    --interactive

# On Apple Silicon (MPS)
python inference/inference.py \
    --model_path ./checkpoints/latest.pt \
    --model_size 7b \
    --use_mps \
    --prompt "Explain quantum mechanics"

HuggingFace Integration

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints")

# Generate
input_text = "The energy of a photon is given by"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

πŸ“Š Data Pipeline

  1. Open Datasets: Automatically download from HuggingFace (Pile, S2ORC, Math datasets, PubMed QA)
  2. Quality Filtering: Multi-stage checks (length, language, equations, repetition, citations)
  3. Deduplication: MinHash LSH for near-duplicate detection
  4. Domain Classification: Classify into 7 science domains
  5. Tokenization: Custom science-aware BPE tokenizer
  6. Sharding: Write to Parquet with statistics
from data.dataset_loader import VortexDatasetLoader
from data.quality_filter import ScienceQualityFilter
from data.deduplication import MinHashLSH

# Load and process data
loader = VortexDatasetLoader()
quality_filter = ScienceQualityFilter()
lsh = MinHashLSH()

# Stream datasets, filter, deduplicate, and shard
for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
    if quality_filter.filter(sample["text"]):
        lsh.add_document(sample["id"], sample["text"])
        # Tokenize and save

🎯 Training Strategy

Curriculum Learning

Training progresses through 4 stages:

  1. Foundation (0-20%): Basic science text, simple equations, definitions
  2. Domain (20-50%): Domain-specific deep content per science area
  3. Reasoning (50-80%): Scientific problem solving, multi-step derivations
  4. Integration (80-100%): Cross-domain science, full dataset

Science-Aware Loss

total_loss = (
    lm_loss * 1.0              # Standard next token prediction
    + equation_loss * 0.3      # Equation reconstruction accuracy
    + domain_loss * 0.1        # Domain classification head
    + citation_loss * 0.1      # Citation detection accuracy
    + numerical_loss * 0.2     # Numerical reasoning accuracy
)

βš™οΈ Configuration

7B Config (VORTEX_7B_CONFIG)

  • d_model: 4096
  • num_layers: 32
  • num_heads: 32
  • d_state: 16
  • ssm_ratio: 0.6
  • vocab_size: 50000
  • max_seq_len: 16384

13B Config (VORTEX_13B_CONFIG)

  • d_model: 5120
  • num_layers: 40
  • num_heads: 40
  • d_state: 32
  • ssm_ratio: 0.5
  • vocab_size: 50000
  • max_seq_len: 16384

πŸ”§ Hardware Targets

Nvidia 4060 Laptop (8GB VRAM)

  • 7B: BF16, no quantization, Flash Attention 2, torch.compile
  • 13B: INT8 quantization, Flash Attention 2, torch.compile
  • Target TPS: 25-40 (7B), 15-25 (13B)

Apple Silicon (M2/M3)

  • 7B on M3: BF16 (via float16), SDPA, no compile
  • 13B on M3 Max: BF16, unified memory, SDPA
  • Target TPS: 20-35 (7B), 12-20 (13B)

πŸ§ͺ Science Domains

  1. Physics ([PHYS])
  2. Mathematics ([MATH])
  3. Chemistry ([CHEM])
  4. Biology ([BIO])
  5. Earth Science ([EARTH])
  6. Space Science ([SPACE])
  7. Zoology ([ZOO])

Domain tags can be included in training data to guide the SciGate FFN routing.

πŸ“ Tokenizer

Custom BPE tokenizer with:

  • 40,000 base BPE tokens trained on scientific corpus
  • 10,000 science-specific tokens:
    • 500 LaTeX math symbols (\alpha, \sum, \int, etc.)
    • 118 chemical element symbols
    • 200 SI and derived units
    • 300 scientific abbreviations (DNA, RNA, ATP, etc.)
    • 500 mathematical operators
    • Amino acid codes
    • Greek alphabet (Ξ±, Ξ², Ξ³, etc.)
  • Special tokens: [EQUATION], [CITATION], [MOLECULE], [FIGURE], [TABLE], domain tags

πŸ§ͺ Evaluation

Science benchmarks across all 7 domains will be added. Planned benchmarks:

  • Physics: Feynman Questions, Physics GRE
  • Math: MATH dataset, GSM8K
  • Chemistry: Chemistry problem-solving, molecular property prediction
  • Biology: PubMed QA, bioinformatics tasks
  • Earth Science: Climate modeling questions
  • Space Science: Astronomy problem sets
  • Zoology: Species classification, ecological reasoning

πŸ“„ License

This is a school science project. Code is provided for educational purposes.

πŸ™ Acknowledgments

  • Mamba (Gu et al.) for SSM architecture inspiration
  • Flash Attention (Dao et al.) for efficient attention
  • HuggingFace for transformers library
  • All open scientific data sources: arXiv, PubMed, S2ORC, etc.

πŸ“§ Contact

For questions or issues, please open an issue on GitHub.


Built with ❀️ for scientific AI research

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Matrix-Corp/Vortex-13b-V1