Quasar Foundation Model

Quasar-10B: Fully Linear Foundation Model

Quasar-10B is a high-performance foundation model developed by SILX AI. It is built upon the Qwen3.5-9B-Base architecture, fundamentally re-engineered to support extreme long-context reasoning (2 Million+ tokens) while maintaining high computational efficiency.

This model marks a major shift in the Quasar training stack, moving from traditional Softmax-based attention to a Hybrid Gated Linear Attention (GLA) architecture.


Model Overview

Model Name: Quasar-10B
Organization: SILX AI
Base Model: Qwen3.5-9B-Base

Architecture Evolution

The original Qwen3.5 architecture uses a combination of Gated Delta Attention and Softmax Gated Attention. To support the Quasar design requirements for infinite scaling and efficient state management, we performed a deep architectural swap:

  • GLA Integration: Replaced the target attention layers with Gated Linear Attention (GLA).
  • NOPE (No Positional Embeddings): Removed traditional RoPE (Rotary Positional Embeddings) to eliminate positional bias and enable native extrapolation to millions of tokens.

    GLA was chosen as the core linear mechanism to maintain exact architectural parity with the Quasar 22B MoE design. This model is a direct evolution redirected from silx-ai/Quasar-V1-Base-Stage1, utilizing Quasar Continuous Time Attention for state-trajectory optimization.


Training Methodology

The development of Quasar-10B followed a rigorous two-stage process:

Stage 1: Structural Distillation (10B Tokens)

To ensure the new GLA layers correctly inherited the capabilities of the original Qwen heads:

  • Process: Layer-wise structural distillation. We initialized the student with Qwen3.5 weights and replaced specific layers with GLA units.
  • Loss: Hybrid loss combining MSE (Hidden State Mimicry) and Cross-Entropy (Language Modeling).
  • Volume: 10 Billion tokens of high-quality reasoning data.
  • Goal: Minimize structural divergence and transferpretrained world knowledge into the new linear state.

Stage 2: Native 2M Context Expansion (20B Tokens)

Once structurally sound, the model was pushed to extreme sequence lengths:

  • Positionality: RoPE was fully removed and replaced with NOPE (No Positional Embedding).
  • Context Length: Native training on 2,097,152 (2M) sequence lengths.
  • Volume: 20 Billion tokens.
  • Hardware: Optimized for B200 HBM efficiency, utilizing sub-chunked sequential processing to maintain a 2M token active state.

Features

  • Infinite Recurrence: The GLA architecture allows the model to process sequences far beyond its training window with linear complexity.
  • Reasoning Excellence: Trained on the Nemotron-Pretraining-Specialized-v1 mix, focusing on Math, STEM, and code-centric reasoning.
  • B200 Optimized: Specifically tuned for maximum throughput on NVIDIA Blackwell hardware.

Technical Notes

Quasar-10B represents the first "Recurrent foundation model" in our stack that successfully bridges the gap between Transformer-scale pretraining and RNN-style linear efficiency. By removing positional embeddings, we allow the model to rely entirely on its internal state trajectories for temporal coherence.


Next Steps

The Quasar roadmap continues toward even larger scales and deeper MoE integrations. For technical research and integration support, contact the SILX AI team.


Strategic Purpose: Quasar-10B is designed as a foundational high-context engine. It will be used exclusively to distill knowledge and generate synthetic reasoning data for the upcoming Quasar 22B MoE, ensuring that the larger mixture-of-experts model inherits superior long-context coherence and refined logical state-trajectories from this fully linear base.

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for silx-ai/Quasar-10B

Finetuned
(29)
this model

Collection including silx-ai/Quasar-10B