---
license: apache-2.0
---

The *TokenFormer* is a **fully attention-based architecture** 
that unifies the computations of token-token and token-parameter interactions 
by entirely employing the attention mechanism, **maximizes the flexibility of neural network**.[(see paper)](https://github.com/Haiyang-W/TokenFormer). 
It contains four models of sizes 
150M, 450M, 900M, 1.5B. For each size, it's trained based on [gpt-neox](https://github.com/EleutherAI/gpt-neox) code base and uses [Pile](https://huggingface.co/datasets/EleutherAI/pile) with 300B tokens. 
All 4 model sizes are trained on the exact 
same data, in the exact same order.

# TokenFormer-150M

## Model Details

- Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
- Model type: ToeknFormer-based Language Model
- Language: English
- Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
 for training procedure, config files, and details on how to use.
 [See paper](https://github.com/Haiyang-W/TokenFormer) for more evals and implementation
 details.
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
- License: Apache 2.0
- Contact: to ask questions about this model, please email Haiyang Wang.

<figure>

| TokenFormer model | Layers | #QKV Param Tokens | #Output Param Tokens | #FFN Param Tokens | Model Dim | Heads | Batch Size | Learning Rate         | Training Iterations         |
| ----------------: | -----: | :---------------: | :------------------: | :---------------: | :-------: | :---: | :--------: | :-------------------: | :-------------------------: |
| 150M              | 12     | 768               | 768                  | 3072              | 768       | 12    | 2M         | 6.0 x 10<sup>-4</sup> | 143000                      |
| 450M              | 24     | 1024              | 1024                 | 4096              | 1024      | 16    | 2M         | 6.0 x 10<sup>-4</sup> | 143000                      | 
| 900M              | 32     | 1280              | 1280                 | 5120              | 1280      | 16    | 2M         | 6.0 x 10<sup>-4</sup> | 143000                      |
| 1.5B              | 40     | 1536              | 1536                 | 6144              | 1536      | 16    | 2M         | 6.0 x 10<sup>-4</sup> | 143000                      | 
<figcaption>Engineering details for the <i>TokenFormer</i>. </figcaption>
</figure>