YAML Metadata Warning: The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Celestia Mark 1

Hybrid Multilingual Autoregressive Language Model(Model file and usage code will be uploaded soon).


Overview

Celestia Mark 1 is a leading-edge, mid-sized autoregressive language model built with a novel hybrid architecture that fuses Transformer, Mixture of Experts (MoE), and Chain of Experts (CoE) layers. It is designed for multi-domain and multilingual tasks, supporting code, math, education, and general reasoning. Celestia Mark 1 is currently undergoing incremental training and has already processed over 4 billion tokens (target: 10B tokens).

  • Model Size: ~360M parameters
  • Architecture: Hybrid (Transformer + MoE + CoE)
  • Training Approach: Autoregressive (completion-ready), with fine-tuning support for classification, code, math, multilingual tasks, and more
  • License: Apache 2.0

Training Domains and Languages

Celestia Mark 1 is trained on a rich and diverse set of datasets, featuring both human and programming languages:

Human Languages Used:

  • English
  • Hindi (Latin script)
  • Arabic
  • French
  • German
  • Spanish
  • Italian
  • Polish
  • Greek
  • Latin

Programming Languages Used (13 total):

  • Python
  • JavaScript
  • TypeScript
  • Java
  • C
  • C++
  • C#
  • Go
  • Shell
  • Bash
  • HTML
  • CSS
  • SQL

Other Domains:

  • Math (symbolic, numeric, and educational datasets)
  • Education (FineWeb-Edu, Finemath-4plus)
  • General web text (Common Corpus, FineWeb-2)

Performance Benchmarks

Model Params Tokens Trained Loss Perplexity Accuracy Architecture Multilingual Domains
Celestia Mark 1 360M 4B (ongoing) 2.9 25 47% Transformer + MoE + CoE (Hybrid) βœ… Yes General
GPT-2 Medium 345M 40B 3.3 28–35 35–43% Dense Transformer ❌ No English
GPT-2 Large 774M 40B 3.2 27–33 38–44% Dense Transformer ❌ No English
Pythia-410M 410M 300B 2.9 30 ~42% Dense Transformer ❌ No English
Pythia-1B 1B 300B 2.7 27 ~45% Dense Transformer ❌ No English
CodeParrot 110M 22B 2.7 30–35 37% Dense Transformer (code-focused) ❌ No Python code
Qwen-1B 1B ~15B 2.8 27 45% Dense Transformer βœ… Yes General
Jamba-1.1B 1.1B 20B 2.7 23 48% Hybrid Transformer-Mamba βœ… Yes General
Phi-2 2.7B 1.4T 2.5 21 ~52% Dense Transformer, curated data βœ… Yes General
Llama-2 7B 7B 2T 2.7 21 ~52% Dense Transformer βœ… Yes General
Mistral 7B 7B 1.5T 2.6 19 ~54% Dense Transformer βœ… Yes General

Sources: Official model papers, leaderboards, OpenReview, Datawizz, DataCamp, Microsoft Research.


Why Celestia Mark 1 Is Superior

  • Hybrid Architecture: Celestia Mark 1 alternates Transformer layers with Mixture of Experts (MoE) and Chain of Experts (CoE) blocks, enabling dynamic routing, specialization, and iterative reasoning. This hybrid design delivers better accuracy and generalization for a given model size compared to pure Transformer models.
  • Multilingual & Multi-Domain: Trained on 10 human languages and 13 programming languages, as well as math and educational data, Celestia Mark 1 covers a vastly broader scope than similarly-sized models.
  • Efficient Learning: Achieves competitive or superior loss, perplexity, and accuracy compared to much larger models trained on more data, due to efficient expert routing and architectural innovation.
  • Generalization & Adaptability: Performs robustly on code, math, multilingual, and web text, while remaining easy to fine-tune for classification, translation, and symbolic reasoning.
  • Open Weights & License: Released under Apache 2.0 for free research and commercial use.

Hybrid Architecture Explained

Celestia Mark 1’s architecture is designed for maximal flexibility and specialization:

  • Transformer Layers: Provide standard attention-based modeling for generalization.
  • Mixture of Experts (MoE): Multiple expert networks are selectively activated for each token, increasing model capacity and specialization without increasing compute for all tokens.
  • Chain of Experts (CoE): Allows iterative refinement and multi-step reasoning, particularly beneficial for symbolic, mathematical, and code tasks.

This hybrid approach enables Celestia Mark 1 to outperform pure Transformers in multilingual, code, and math domains, even with fewer parameters and less data.


Limitations

Celestia Mark 1 is still undergoing incremental training. As such:

  • Some factual outputs may be inaccurate or incomplete.
  • Performance will continue to improve as training approaches the 100 billion token goal.
  • For highly factual, up-to-date, or specialized knowledge, verification is recommended.

Usage

Celestia Mark 1 can be used for:

  • Completions (default autoregressive)
  • Fine-tuning: Classification, code generation, math, translation, and more
  • Multilingual & multi-domain applications

See usage.py for quick-start instructions.


License

Apache 2.0 β€” free for research and commercial use.


Contact

For support or questions, contact: naqeeb.ajk63@gmail.com

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train Naqeeb-2424/Celestia-1.0