Doge-60M / README.md
JingzeShi's picture
Update README.md
b9c1616 verified
metadata
library_name: transformers
license: apache-2.0
datasets:
  - HuggingFaceTB/smollm-corpus
language:
  - en
pipeline_tag: text-generation
tags:
  - pt
  - doge

Doge 60M

SmallDoge

Doge uses Dynamic Mask Attention as sequence transformation and can use Multi-Layer Perceptron or Cross Domain Mixture of Experts as state transformation. Dynamic Mask Attention allows the Transformer to use self-attention during training and state space during inference, and Cross Domain Mixture of Experts can directly inherit the weights of Multi-Layer Perceptron for further training. This model is trained by SmallDoge community, for detailed algorithm and model architecture, paper coming soon, all training details and code are available in the small-doge repository.

Uses

>>> from transformers import AutoTokenizer, AutoModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-60M")
>>> model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-60M", trust_remote_code=True)
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")

>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Model Details

We build the Doge by doing Per-Training on Smollm-Corpus. If you want to continue pre-training this model, you can find the unconverged checkpoint here. These models has not been fine-tuned for instruction, the instruction model is here.

Pre-Training:

Model Training Data Steps Content Length Tokens LR Batch Size Precision RTX 4090 GPU hours
Doge-20M HuggingFaceTB/smollm-corpus 8k 2048 4B 8e-3 0.5M bfloat16 14
Doge-60M HuggingFaceTB/smollm-corpus 16k 2048 16B 6e-3 1M bfloat16 128
Doge-160M HuggingFaceTB/smollm-corpus 24k 2048 32B 4e-3 1.5M bfloat16 522
Doge-320M HuggingFaceTB/smollm-corpus 32k 2048 64B 2e-3 2M bfloat16 1856

Evaluation:

Model MMLU TriviaQA ARC PIQA HellaSwag OBQA Winogrande tokens / s on i7-11 CPU
Doge-20M 25.4 0.03 29.8 58.4 27.3 25.6 50.2 142
Doge-60M 26.4 0.2 37.9 61.4 31.5 28.0 50.8 62
Doge-160M 29.2 4.8 44.4 70.1 43.4 34.4 52.2 28
Doge-320M 33.8 9.4 52.1 73.9 52.7 37.9 55.0 16

All evaluations are done using five-shot settings, without additional training on the benchmarks.

Procedure:

Visualize in Weights & Biases

Environment:

  • Image: nvcr.io/nvidia/pytorch:24.12-py3
  • Hardware: 1x NVIDIA RTX 4090
  • Software: Transformers

Citation

@misc{smalldoges,
  title={SmallDoges},
  author={SmallDoge Team and Jingze, Shi and Yifan, Wu and Bingheng, Wu},
  year={2025},
  month={March}, 
}