Vortex V1
Collection
All model from Vortex V1 collection made for scince β’ 2 items β’ Updated
β’ 1
Vortex Scientific is a from-scratch AI model family designed for deep scientific reasoning. Built from the ground up with a novel hybrid state-space + attention architecture, optimized for consumer laptop hardware (Apple Silicon MacBooks and Nvidia 4060 laptop GPUs).
transformers libraryVortex uses a two-block hybrid architecture:
Layer ratios:
Vortex/
βββ configs/
β βββ vortex_7b_config.py # 7B model configuration
β βββ vortex_13b_config.py # 13B model configuration
β βββ training_config.py # Training hyperparameters
βββ models/
β βββ ssm_layer.py # State-space layer
β βββ attention_layer.py # Local windowed attention
β βββ scigate_ffn.py # Science-gated feed-forward
β βββ vortex_model.py # Main model class
β βββ science_modules/ # Specialized science modules
βββ tokenizer/
β βββ vortex_tokenizer.py # Custom BPE tokenizer with science vocab
βββ data/
β βββ dataset_loader.py # Open dataset loading (Pile, S2ORC, etc.)
β βββ quality_filter.py # Multi-stage quality filtering
β βββ domain_classifier.py # 7-domain classifier
β βββ deduplication.py # MinHash LSH deduplication
β βββ scraper.py # Web scraping (arXiv, PubMed, etc.)
βββ training/
β βββ trainer.py # Main training loop
β βββ losses.py # Science-aware loss functions
β βββ curriculum.py # Curriculum learning scheduler
βββ inference/
β βββ cuda_optimize.py # CUDA optimizations (Flash Attention, INT8)
β βββ mps_optimize.py # MPS optimizations for Apple Silicon
βββ evaluation/ # Science benchmarks (coming soon)
βββ configuration_vortex.py # HF config class
βββ tokenization_vortex.py # HF tokenizer wrapper
βββ modeling_vortex.py # HF model integration
βββ train.py # Training entry point
βββ inference/inference.py # Inference entry point
βββ requirements.txt
# Clone and setup
cd Vortex
pip install -r requirements.txt
# For CUDA optimizations
pip install flash-attn
pip install bitsandbytes
# Train 7B model on CUDA
python train.py \
--model_size 7b \
--device cuda \
--data_dir ./data/processed \
--output_dir ./checkpoints \
--max_steps 100000
# Train 13B model with INT8 quantization (for 8GB VRAM)
python train.py \
--model_size 13b \
--device cuda \
--quantization int8 \
--data_dir ./data/processed \
--output_dir ./checkpoints_13b
# Generate text with 7B model
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--device cuda \
--prompt "The equation E = mc^2 describes" \
--max_new_tokens 100
# Interactive mode
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--device cuda \
--interactive
# On Apple Silicon (MPS)
python inference/inference.py \
--model_path ./checkpoints/latest.pt \
--model_size 7b \
--use_mps \
--prompt "Explain quantum mechanics"
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("./checkpoints")
tokenizer = AutoTokenizer.from_pretrained("./checkpoints")
# Generate
input_text = "The energy of a photon is given by"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
from data.dataset_loader import VortexDatasetLoader
from data.quality_filter import ScienceQualityFilter
from data.deduplication import MinHashLSH
# Load and process data
loader = VortexDatasetLoader()
quality_filter = ScienceQualityFilter()
lsh = MinHashLSH()
# Stream datasets, filter, deduplicate, and shard
for sample in loader.load_multiple_datasets(["pile_scientific", "automath"]):
if quality_filter.filter(sample["text"]):
lsh.add_document(sample["id"], sample["text"])
# Tokenize and save
Training progresses through 4 stages:
total_loss = (
lm_loss * 1.0 # Standard next token prediction
+ equation_loss * 0.3 # Equation reconstruction accuracy
+ domain_loss * 0.1 # Domain classification head
+ citation_loss * 0.1 # Citation detection accuracy
+ numerical_loss * 0.2 # Numerical reasoning accuracy
)
d_model: 4096num_layers: 32num_heads: 32d_state: 16ssm_ratio: 0.6vocab_size: 50000max_seq_len: 16384d_model: 5120num_layers: 40num_heads: 40d_state: 32ssm_ratio: 0.5vocab_size: 50000max_seq_len: 16384[PHYS])[MATH])[CHEM])[BIO])[EARTH])[SPACE])[ZOO])Domain tags can be included in training data to guide the SciGate FFN routing.
Custom BPE tokenizer with:
\alpha, \sum, \int, etc.)[EQUATION], [CITATION], [MOLECULE], [FIGURE], [TABLE], domain tagsScience benchmarks across all 7 domains will be added. Planned benchmarks:
This is a school science project. Code is provided for educational purposes.
For questions or issues, please open an issue on GitHub.
Built with β€οΈ for scientific AI research