TokenFormer-150M / README.md
Haiyang-W's picture
Update README.md
094679d verified
|
raw
history blame
2.41 kB
metadata
license: apache-2.0

The TokenFormer is a fully attention-based architecture that unifies the computations of token-token and token-parameter interactions by entirely employing the attention mechanism, maximizes the flexibility of neural network.(see paper). It contains four models of sizes 150M, 450M, 900M, 1.5B. For each size, it's trained based on gpt-neox code base and uses Pile with 300B tokens. All 4 model sizes are trained on the exact same data, in the exact same order.

TokenFormer-150M

Model Details

  • Developed by: Haiyang Wang
  • Model type: ToeknFormer-based Language Model
  • Language: English
  • Learn more: TokenFormer's GitHub repository for training procedure, config files, and details on how to use. See paper for more evals and implementation details.
  • Library: GPT-NeoX
  • License: Apache 2.0
  • Contact: to ask questions about this model, please email Haiyang Wang.
TokenFormer model Layers #QKV Param Tokens #Output Param Tokens #FFN Param Tokens Model Dim Heads Batch Size Learning Rate Training Iterations
150M 12 768 768 3072 768 12 2M 6.0 x 10-4 143000
450M 24 1024 1024 4096 1024 16 2M 6.0 x 10-4 143000
900M 32 1280 1280 5120 1280 16 2M 6.0 x 10-4 143000
1.5B 40 1536 1536 6144 1536 16 2M 6.0 x 10-4 143000
Engineering details for the TokenFormer.