metadata

license: apache-2.0

The TokenFormer is a fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network.(see paper). It contains four models of sizes 150M, 450M, 900M, 1.5B. For each size, it's trained based on gpt-neox code base and uses Pile with 300B tokens. All 4 model sizes are trained on the exact same data, in the exact same order.

TokenFormer-150M

Model Details

Developed by: Haiyang Wang
Model type: ToeknFormer-based Language Model
Language: English
Learn more: TokenFormer's GitHub repository for training procedure, config files, and details on how to use. See paper for more evals and implementation details.
Library: GPT-NeoX
License: Apache 2.0
Contact: to ask questions about this model, please email Haiyang Wang.

TokenFormer model	Layers	#QKV Param Tokens	#Output Param Tokens	#FFN Param Tokens	Model Dim	Heads	Batch Size	Learning Rate	Training Iterations
150M	12	768	768	3072	768	12	2M	6.0 x 10^-4	143000
450M	24	1024	1024	4096	1024	16	2M	6.0 x 10^-4	143000
900M	32	1280	1280	5120	1280	16	2M	6.0 x 10^-4	143000
1.5B	40	1536	1536	6144	1536	16	2M	6.0 x 10^-4	143000

Engineering details for the TokenFormer.