lintang commited on
Commit
8905117
1 Parent(s): 461e6a2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +148 -0
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - EleutherAI/pile
4
+ language:
5
+ - en
6
+ pipeline_tag: text2text-generation
7
+ tags:
8
+ - t5x
9
+ - encode-decoder
10
+ ---
11
+
12
+ Pile-T5 Base is an Encoder-Decoder model trained on [the Pile](https://pile.eleuther.ai/) using the [T5x](https://github.com/google-research/t5x) library. The model was trained for 2 million steps or roughly 2 trillion tokens using MLM-objective similar to the original T5 model.
13
+
14
+ ### Model Details
15
+
16
+ - Developed by: [EleutherAI](http://eleuther.ai)
17
+ - Model type: Transformer-based Language Model
18
+ - Language: English
19
+ - Learn more: [Blogpost](). For details about the training dataset,
20
+ see [the Pile paper](https://arxiv.org/abs/2101.00027), and [its data
21
+ sheet](https://arxiv.org/abs/2201.07311).
22
+ - License: Apache 2.0
23
+ - Contact: to ask questions about this model, join the [EleutherAI
24
+ Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
25
+ Please read the existing GPT-NeoX-20B documentation before asking about the model
26
+ on Discord. For general correspondence: [contact@eleuther.
27
+ ai](mailto:contact@eleuther.ai).
28
+
29
+ <figure style="width:30em">
30
+
31
+ | Hyperparameter | Value |
32
+ | -------------------------- | ----------- |
33
+ | n<sub>parameters</sub> | |
34
+ | n<sub>encoder layers</sub> | 24 |
35
+ | n<sub>decoder layers</sub> | 24 |
36
+ | d<sub>model</sub> | 2816 |
37
+ | d<sub>emb</sub> | 1024 |
38
+ | n<sub>heads</sub> | 16 |
39
+ | d<sub>head</sub> | 64 |
40
+ | n<sub>vocab</sub> | 32128 |
41
+ | Sequence Length | 512 |
42
+ </figure>
43
+
44
+ ### Uses and limitations
45
+
46
+ #### Intended use
47
+
48
+ Pile-T5 was developed primarily for research purposes. It learns an inner
49
+ representation of the English language that can be used to extract features
50
+ useful for downstream tasks.
51
+
52
+ In addition to scientific uses, you may also further fine-tune and adapt
53
+ Pile-T5 for deployment, as long as your use is in accordance with the
54
+ Apache 2.0 license. This model works with the [Transformers
55
+ Library](https://huggingface.co/docs/transformers/index). If you decide to use
56
+ pre-trained Pile-T5 as a basis for your fine-tuned model, please note that
57
+ you need to conduct your own risk and bias assessment.
58
+
59
+ #### Out-of-scope use
60
+
61
+ Pile-T5 is **not** intended for deployment as-is. It is not a product
62
+ and cannot be used for human-facing interactions without supervision.
63
+
64
+ Pile-T5 has not been fine-tuned for downstream tasks for which language
65
+ models are commonly deployed, such as writing genre prose, or commercial
66
+ chatbots. This means Pile-T5 will likely **not** respond to a given prompt
67
+ the way products such as ChatGPT do. This is because, unlike Pile-T5,
68
+ ChatGPT was fine-tuned using methods such as Reinforcement Learning from Human
69
+ Feedback (RLHF) to better “understand” human instructions and dialogue.
70
+
71
+ This model is English-language only, and thus cannot be used for translation
72
+ or generating text in other languages.
73
+
74
+ #### Limitations and biases
75
+
76
+ The core functionality of Pile-T5 is to take a string of text that has been
77
+ partially replaced with mask tokens and predict a sequence of tokens that would
78
+ replace those mask tokens. Remember that the statistically most likely sequence
79
+ of tokens need not result in the most “accurate” text. Never rely on Pile-T5 to produce
80
+ factually accurate output.
81
+
82
+ This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
83
+ known to contain profanity and texts that are lewd or otherwise offensive.
84
+ See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
85
+ discussion of documented biases with regards to gender, religion, and race.
86
+ Pile-T5 may produce socially unacceptable or undesirable text, *even if*
87
+ the prompt itself does not include anything explicitly offensive.
88
+
89
+ We recommend curating the outputs of this model before presenting it to a human
90
+ reader. Please inform your audience that you are using artificially generated
91
+ text.
92
+
93
+ #### How to use
94
+
95
+ Pile-T5 can be loaded using the `AutoModelForSeq2SeqLM` functionality:
96
+ ```python
97
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
98
+
99
+ tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pile-t5-base")
100
+ model = AutoModelForSeq2SeqLM.from_pretrained("EleutherAI/pile-t5-base")
101
+ ```
102
+
103
+ ### Training
104
+
105
+ #### Training dataset
106
+
107
+ The Pile is a 825GiB general-purpose dataset in English. It was created by
108
+ EleutherAI specifically for training large language models. It contains texts
109
+ from 22 diverse sources, roughly broken down into five categories: academic
110
+ writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project
111
+ Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub,
112
+ Enron Emails). See [the Pile paper](https://arxiv.org/abs/2101.00027) for
113
+ a breakdown of all data sources, methodology, and a discussion of ethical
114
+ implications. Consult [the datasheet](https://arxiv.org/abs/2201.07311) for
115
+ more detailed documentation about the Pile and its component datasets. The
116
+ Pile can be downloaded from the [official website](https://pile.eleuther.ai/),
117
+ or from a [community mirror](https://the-eye.eu/public/AI/pile/).
118
+
119
+ The Pile was deduplicated before being used to train Pile-T5.
120
+
121
+ #### Training procedure
122
+
123
+ Pile-T5 was trained with a batch size of approximately 1M tokens
124
+ (2048 sequences of 512 tokens each), for a total of 2,000,000 steps. Pile-T5 was trained
125
+ with the span-corruption objective.
126
+
127
+ #### Training checkpoints
128
+
129
+ Intermediate checkpoints for Pile-T5 are accessible within this repository.
130
+ There are in total 200 checkpoints that are spaced 10,000 steps. For T5x-native
131
+ checkpoints that can be used for finetuning with the T5x library, refer to [here](https://huggingface.co/lintang/pile-t5-base-t5x/tree/main)
132
+
133
+
134
+ ### Evaluations
135
+
136
+ TBD
137
+
138
+ ### BibTeX
139
+
140
+ ```
141
+ @article{2024t5v2,
142
+ author = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel},
143
+ title = {Pile T5, an update of T5},
144
+ year = {2024},
145
+ url = {}
146
+ }
147
+ ```
148
+