Haiyang-W
/

TokenFormer-150M

PyTorch

Model card Files Files and versions Community

Haiyang-W commited on Oct 30

Commit

410ec70

•

1 Parent(s): 094679d

Update README.md

Browse files

Files changed (1) hide show

README.md +52 -1

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ same data, in the exact same order.
 ## Model Details
 - Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
-- Model type: ToeknFormer-based Language Model
 - Language: English
 - Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
  for training procedure, config files, and details on how to use.
@@ -36,3 +36,54 @@ same data, in the exact same order.
 <figcaption>Engineering details for the <i>TokenFormer</i>. </figcaption>
 </figure>

 ## Model Details
 - Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
+- Model type: TokenFormer-based Language Model
 - Language: English
 - Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
  for training procedure, config files, and details on how to use.
 <figcaption>Engineering details for the <i>TokenFormer</i>. </figcaption>
 </figure>
+## Training
+### Training data
+[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
+English. It was created by EleutherAI specifically for training large language
+models. It contains texts from 22 diverse sources, roughly broken down into
+five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
+prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
+miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
+paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
+methodology, and a discussion of ethical implications. Consult [the
+datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
+about the Pile and its component datasets. The Pile can be downloaded from
+the [official website](https://pile.eleuther.ai/), or from a [community
+mirror](https://the-eye.eu/public/AI/pile/).<br>
+### Training procedure
+We follow the default training strategy of [Pythia](https://arxiv.org/abs/2304.01373) in [gpt-neox](https://github.com/EleutherAI/gpt-neox),
+including the dataset processing, hyper-parameter and code base.
+All models were trained on the exact same data, in the exact same order. Each
+model saw 299,892,736,000 tokens during training.
+All *TokenFormer* models trained for 143000 steps at a batch size
+of 2M (2,097,152 tokens).<br>
+See [GitHub](https://github.com/Haiyang-W/TokenFormer) for more details on training
+ procedure.<br>
+TokenFormer uses the same tokenizer as [GPT-NeoX-
+20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
+## Evaluations
+All 16 *TokenFormer* models were evaluated using the [LM Evaluation
+Harness](https://github.com/EleutherAI/lm-evaluation-harness).
+You can run the evaluation with our [instruction](https://github.com/Haiyang-W/TokenFormer?tab=readme-ov-file#evaluations).<br>
+Expand the sections below to see plots of evaluation results for all
+TokenFormer compared with Opensource Transformer-based LLMs.
+<figure>
+| Model        | #Param   |  LAMBADA | HellaSwag | PIQA | Arc-E  | Arc-C | WinoGrande | Average  |
+| ----:        | -------: | :------: | :-------: | :--: | :---:  | :---: | :--------: | :------: |
+| Pythia       |    150M  | 35.4     |     30.3  | 62.3 |  43.6  | 23.6  |  51.3      |   40.1   |
+| TokenFormer  |    150M  | 45.0     |     35.5  | 64.9 |  47.3  | 24.9  |  50.4      |   44.7   |
+| Pythia       |    410M  | 51.4     |     40.6  | 66.9 |  52.1  | 24.6  |  53.8      |   48.2   |
+| TokenFormer  |    450M  | 57.3     |     47.5  | 69.5 |  56.2  | 26.7  |  54.6      |   52.0   |
+| Pythia       |    1B    | 56.1     |     47.2  | 70.7 |  57.0  | 27.1  |  53.5      |   51.9   |
+| TokenFormer  |    900M  | 64.0     |     55.3  | 72.4 |  59.9  | 30.6  |  56.4      |   56.4   |
+<figcaption>Zero-shot evaluation of Language Modeling. </figcaption>
+</figure>