Update README.md
Browse files
README.md
CHANGED
@@ -15,7 +15,7 @@ same data, in the exact same order.
|
|
15 |
## Model Details
|
16 |
|
17 |
- Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
|
18 |
-
- Model type:
|
19 |
- Language: English
|
20 |
- Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
|
21 |
for training procedure, config files, and details on how to use.
|
@@ -36,3 +36,54 @@ same data, in the exact same order.
|
|
36 |
<figcaption>Engineering details for the <i>TokenFormer</i>. </figcaption>
|
37 |
</figure>
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
## Model Details
|
16 |
|
17 |
- Developed by: [Haiyang Wang](https://haiyang-w.github.io/)
|
18 |
+
- Model type: TokenFormer-based Language Model
|
19 |
- Language: English
|
20 |
- Learn more: [TokenFormer's GitHub repository](https://github.com/Haiyang-W/TokenFormer)
|
21 |
for training procedure, config files, and details on how to use.
|
|
|
36 |
<figcaption>Engineering details for the <i>TokenFormer</i>. </figcaption>
|
37 |
</figure>
|
38 |
|
39 |
+
## Training
|
40 |
+
|
41 |
+
### Training data
|
42 |
+
|
43 |
+
[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
|
44 |
+
English. It was created by EleutherAI specifically for training large language
|
45 |
+
models. It contains texts from 22 diverse sources, roughly broken down into
|
46 |
+
five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
|
47 |
+
prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
|
48 |
+
miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
|
49 |
+
paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
|
50 |
+
methodology, and a discussion of ethical implications. Consult [the
|
51 |
+
datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
|
52 |
+
about the Pile and its component datasets. The Pile can be downloaded from
|
53 |
+
the [official website](https://pile.eleuther.ai/), or from a [community
|
54 |
+
mirror](https://the-eye.eu/public/AI/pile/).<br>
|
55 |
+
|
56 |
+
### Training procedure
|
57 |
+
We follow the default training strategy of [Pythia](https://arxiv.org/abs/2304.01373) in [gpt-neox](https://github.com/EleutherAI/gpt-neox),
|
58 |
+
including the dataset processing, hyper-parameter and code base.
|
59 |
+
All models were trained on the exact same data, in the exact same order. Each
|
60 |
+
model saw 299,892,736,000 tokens during training.
|
61 |
+
|
62 |
+
All *TokenFormer* models trained for 143000 steps at a batch size
|
63 |
+
of 2M (2,097,152 tokens).<br>
|
64 |
+
See [GitHub](https://github.com/Haiyang-W/TokenFormer) for more details on training
|
65 |
+
procedure.<br>
|
66 |
+
TokenFormer uses the same tokenizer as [GPT-NeoX-
|
67 |
+
20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
|
68 |
+
|
69 |
+
## Evaluations
|
70 |
+
|
71 |
+
All 16 *TokenFormer* models were evaluated using the [LM Evaluation
|
72 |
+
Harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
73 |
+
You can run the evaluation with our [instruction](https://github.com/Haiyang-W/TokenFormer?tab=readme-ov-file#evaluations).<br>
|
74 |
+
Expand the sections below to see plots of evaluation results for all
|
75 |
+
TokenFormer compared with Opensource Transformer-based LLMs.
|
76 |
+
|
77 |
+
<figure>
|
78 |
+
|
79 |
+
| Model | #Param | LAMBADA | HellaSwag | PIQA | Arc-E | Arc-C | WinoGrande | Average |
|
80 |
+
| ----: | -------: | :------: | :-------: | :--: | :---: | :---: | :--------: | :------: |
|
81 |
+
| Pythia | 150M | 35.4 | 30.3 | 62.3 | 43.6 | 23.6 | 51.3 | 40.1 |
|
82 |
+
| TokenFormer | 150M | 45.0 | 35.5 | 64.9 | 47.3 | 24.9 | 50.4 | 44.7 |
|
83 |
+
| Pythia | 410M | 51.4 | 40.6 | 66.9 | 52.1 | 24.6 | 53.8 | 48.2 |
|
84 |
+
| TokenFormer | 450M | 57.3 | 47.5 | 69.5 | 56.2 | 26.7 | 54.6 | 52.0 |
|
85 |
+
| Pythia | 1B | 56.1 | 47.2 | 70.7 | 57.0 | 27.1 | 53.5 | 51.9 |
|
86 |
+
| TokenFormer | 900M | 64.0 | 55.3 | 72.4 | 59.9 | 30.6 | 56.4 | 56.4 |
|
87 |
+
<figcaption>Zero-shot evaluation of Language Modeling. </figcaption>
|
88 |
+
</figure>
|
89 |
+
|