metadata

library_name: transformers
license: apache-2.0
datasets:
  - HuggingFaceTB/smollm-corpus
language:
  - en
pipeline_tag: text-generation
tags:
  - pt
  - doge

Doge 60M

Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by SmallDoge community, for detailed algorithm and model architecture, paper coming soon, all training details and code are available in the small-doge repository.

Uses

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-60M")
>>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-60M", trust_remote_code=True)
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")

>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Model Details

We build the Doge by doing Per-Training on Smollm-Corpus. If you want to continue pre-training this model, you can find the unconverged checkpoint here. These models has not been fine-tuned for instruction, the instruction model is here.

Pre-Training:

Model	Training Data	Steps	Content Length	Tokens	LR	Batch Size	Precision	RTX 4090 GPU hours
Doge-20M	HuggingFaceTB/smollm-corpus	8k	2048	4B	8e-3	0.5M	bfloat16	14
Doge-60M	HuggingFaceTB/smollm-corpus	16k	2048	16B	6e-3	1M	bfloat16	128
Doge-160M	HuggingFaceTB/smollm-corpus	24k	2048	32B	4e-3	1.5M	bfloat16	522
Doge-320M	HuggingFaceTB/smollm-corpus	32k	2048	64B	2e-3	2M	bfloat16	1856

Evaluation:

Model	MMLU	TriviaQA	ARC	PIQA	HellaSwag	OBQA	Winogrande	tokens / s on i7-11 CPU
Doge-20M	25.4	0.03	29.8	58.4	27.3	25.6	50.2	142
Doge-60M	26.4	0.2	37.9	61.4	31.5	28.0	50.8	62
Doge-160M	29.2	4.8	44.4	70.1	43.4	34.4	52.2	28
Doge-320M	33.8	9.4	52.1	73.9	52.7	37.9	55.0	16

All evaluations are done using five-shot settings, without additional training on the benchmarks.

Procedure:

Environment:

Image: nvcr.io/nvidia/pytorch:24.12-py3
Hardware: 1x NVIDIA RTX 4090
Software: Transformers

Citation

@misc{smalldoges,
  title={SmallDoges},
  author={SmallDoge Team and Jingze, Shi and Yifan, Wu and Bingheng, Wu},
  year={2025},
  month={March}, 
}