Hecto: FFNN + GRU Mixture-of-Experts for AG News
Hecto is a lightweight, interpretable Mixture-of-Experts (MoE) architecture combining:
- A feedforward expert for static feature abstraction, and
 - A GRU expert for sequential reasoning,
 
These experts are gated by a sparse, learnable Top-1 router conditioned on the [CLS] token embedding.
Model Architecture
- Base Encoder: DistilBERT (
distilbert-base-uncased) - Experts:
- Expert 0: 2-layer FFNN (
256 β 128 β 4, Tanh activation) - Expert 1: GRU (
256 β 128 β 4) 
 - Expert 0: 2-layer FFNN (
 - Gating:
- Top-1 sparse routing
 - Temperature-controlled softmax (Ο = 1.5)
 - Entropy and load-balancing regularization
 
 
Training Setup
| Detail | Value | 
|---|---|
| Dataset | AG News (5k sampled) | 
| Loss Function | Cross-Entropy + Entropy + Diversity | 
| Optimizer | AdamW | 
| Epochs | 5 | 
| Batch Size | 16 | 
| Learning Rate | 2e-5 | 
| Seeds Used | [0, 1, 2] (averaged) | 
Performance (Average over 3 seeds)
| Metric | Value | 
|---|---|
| Accuracy | 90.02% | 
| F1 Score | 89.91% | 
| Inference Time | 0.0083 sec/sample | 
| Expert Usage | FFNN = 20.1%, GRU = 79.9% | 
The model routes the majority of samples to the GRU expert, especially for classes like "Sports" and "Sci/Tech". This suggests stronger reliance on sequential reasoning across AG News categories.
Files Included
pytorch_model.bin: Model weightsconfig.json: Custom MoE architecture configtokenizer_config.json,tokenizer.json,vocab.txt,special_tokens_map.json: Tokenizer files (DistilBERT)
Example Usage
from transformers import AutoTokenizer
from your_model_file import Hecto  # Replace with your local Hecto class definition
import torch
from torch.nn.functional import softmax
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("ruhzi/hecto-ffnn-gru")
# Reconstruct the model architecture
model = Hecto("ff", "gru", frozen=False)
model.load_state_dict(torch.load("pytorch_model.bin"))
model.eval()
# Tokenize input
inputs = tokenizer("NASA launches new satellite to study space weather.", return_tensors="pt")
# Run inference
with torch.no_grad():
    logits, _, gate_probs = model(**inputs)
    probs = softmax(logits, dim=-1)
print("Predicted class:", probs.argmax().item())
print("Gate routing probabilities:", gate_probs)
Note:
Hectois a custom model and must be defined in your environment before loading weights.
To make your model easily reusable, consider including amodeling_hecto.pyfile in your repository.
Citation
If you use this model or architecture in your research, please cite:
@article{pandey2025hecto,
  title = {Hecto: Modular Sparse Experts for Adaptive and Interpretable Reasoning},
  author = {Pandey, Sanskar and Chopra, Ruhaan and Bhat, Saad Murtaza and Abhyudaya, Ark},
  journal = {arXiv preprint arXiv:2506.22919},
  year = {2025},
  month = {June},
  note = {Version 1 submitted June 28, 2025; version 2 updated July 1, 2025}
}
- Downloads last month
 - 5
 
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	π
			
		Ask for provider support