metadata
license: apache-2.0
The TokenFormer is a fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network.(see paper). It contains four models of sizes 150M, 450M, 900M, 1.5B. For each size, it's trained based on gpt-neox code base and uses Pile with 300B tokens. All 4 model sizes are trained on the exact same data, in the exact same order.
TokenFormer-150M
Model Details
- Developed by: Haiyang Wang
- Model type: ToeknFormer-based Language Model
- Language: English
- Learn more: TokenFormer's GitHub repository for training procedure, config files, and details on how to use. See paper for more evals and implementation details.
- Library: GPT-NeoX
- License: Apache 2.0
- Contact: to ask questions about this model, please email Haiyang Wang.
TokenFormer model | Layers | #QKV Param Tokens | #Output Param Tokens | #FFN Param Tokens | Model Dim | Heads | Batch Size | Learning Rate | Training Iterations |
---|---|---|---|---|---|---|---|---|---|
150M | 12 | 768 | 768 | 3072 | 768 | 12 | 2M | 6.0 x 10-4 | 143000 |
450M | 24 | 1024 | 1024 | 4096 | 1024 | 16 | 2M | 6.0 x 10-4 | 143000 |
900M | 32 | 1280 | 1280 | 5120 | 1280 | 16 | 2M | 6.0 x 10-4 | 143000 |
1.5B | 40 | 1536 | 1536 | 6144 | 1536 | 16 | 2M | 6.0 x 10-4 | 143000 |