--- license: apache-2.0 --- The *TokenFormer* is a **fully attention-based architecture** that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, **maximizes the flexibility of neural network**.[(see paper)](https://github.com/Haiyang-W/TokenFormer). It contains four models of sizes 150M, 450M, 900M, 1.5B. For each size, it's trained based on [gpt-neox](https://github.com/EleutherAI/gpt-neox) code base and uses [Pile](https://huggingface.co/datasets/EleutherAI/pile) with 300B tokens. All 4 model sizes are trained on the exact same data, in the exact same order. # TokenFormer-150M ## Model Details - Developed by: [Haiyang Wang](https://haiyang-w.github.io/) - Model type: ToeknFormer-based Language Model - Language: English - Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer) for training procedure, config files, and details on how to use. [See paper](https://github.com/Haiyang-W/TokenFormer) for more evals and implementation details. - Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) - License: Apache 2.0 - Contact: to ask questions about this model, please email Haiyang Wang.
| TokenFormer model | Layers | #QKV Param Tokens | #Output Param Tokens | #FFN Param Tokens | Model Dim | Heads | Batch Size | Learning Rate | Training Iterations | | ----------------: | -----: | :---------------: | :------------------: | :---------------: | :-------: | :---: | :--------: | :-------------------: | :-------------------------: | | 150M | 12 | 768 | 768 | 3072 | 768 | 12 | 2M | 6.0 x 10-4 | 143000 | | 450M | 24 | 1024 | 1024 | 4096 | 1024 | 16 | 2M | 6.0 x 10-4 | 143000 | | 900M | 32 | 1280 | 1280 | 5120 | 1280 | 16 | 2M | 6.0 x 10-4 | 143000 | | 1.5B | 40 | 1536 | 1536 | 6144 | 1536 | 16 | 2M | 6.0 x 10-4 | 143000 |
Engineering details for the TokenFormer.