EleutherAI
/

pile-t5-large

+---
+datasets:
+- EleutherAI/pile
+language:
+- en
+pipeline_tag: text2text-generation
+tags:
+- t5x
+- encode-decoder
+---
+Pile-T5 Base is an Encoder-Decoder model trained on [the Pile](https://pile.eleuther.ai/) using the [T5x](https://github.com/google-research/t5x) library. The model was trained for 2 million steps or roughly 2 trillion tokens using MLM-objective similar to the original T5 model.
+### Model Details
+- Developed by: [EleutherAI](http://eleuther.ai)
+- Model type: Transformer-based Language Model
+- Language: English
+- Learn more: [Blogpost](). For details about the training dataset,
+see [the Pile paper](https://arxiv.org/abs/2101.00027), and [its data
+sheet](https://arxiv.org/abs/2201.07311).
+- License: Apache 2.0
+- Contact: to ask questions about this model, join the [EleutherAI
+Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
+Please read the existing GPT-NeoX-20B documentation before asking about the model
+on Discord. For general correspondence: [contact@eleuther.
+ai](mailto:contact@eleuther.ai).
+<figure style="width:30em">
+| Hyperparameter             | Value       |
+| -------------------------- | ----------- |
+| n<sub>parameters</sub>     |             |
+| n<sub>encoder layers</sub> | 24          |
+| n<sub>decoder layers</sub> | 24          |
+| d<sub>model</sub>          | 2816        |
+| d<sub>emb</sub>            | 1024        |
+| n<sub>heads</sub>          | 16          |
+| d<sub>head</sub>           | 64          |
+| n<sub>vocab</sub>          | 32128       |
+| Sequence Length            | 512         |
+</figure>
+### Uses and limitations
+#### Intended use
+Pile-T5 was developed primarily for research purposes. It learns an inner
+representation of the English language that can be used to extract features
+useful for downstream tasks.
+In addition to scientific uses, you may also further fine-tune and adapt
+Pile-T5 for deployment, as long as your use is in accordance with the
+Apache 2.0 license. This model works with the [Transformers
+Library](https://huggingface.co/docs/transformers/index). If you decide to use
+pre-trained Pile-T5 as a basis for your fine-tuned model, please note that
+you need to conduct your own risk and bias assessment.
+#### Out-of-scope use
+Pile-T5 is **not** intended for deployment as-is. It is not a product
+and cannot be used for human-facing interactions without supervision.
+Pile-T5 has not been fine-tuned for downstream tasks for which language
+models are commonly deployed, such as writing genre prose, or commercial
+chatbots. This means Pile-T5 will likely **not** respond to a given prompt
+the way products such as ChatGPT do. This is because, unlike Pile-T5,
+ChatGPT was fine-tuned using methods such as Reinforcement Learning from Human
+Feedback (RLHF) to better “understand” human instructions and dialogue.
+This model is English-language only, and thus cannot be used for translation
+or generating text in other languages.
+#### Limitations and biases
+The core functionality of Pile-T5 is to take a string of text that has been
+partially replaced with mask tokens and predict a sequence of tokens that would
+replace those mask tokens. Remember that the statistically most likely sequence
+of tokens need not result in the most “accurate” text. Never rely on Pile-T5 to produce
+factually accurate output.
+This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
+known to contain profanity and texts that are lewd or otherwise offensive.
+See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
+discussion of documented biases with regards to gender, religion, and race.
+Pile-T5 may produce socially unacceptable or undesirable text, *even if*
+ the prompt itself does not include anything explicitly offensive.
+We recommend curating the outputs of this model before presenting it to a human
+reader. Please inform your audience that you are using artificially generated
+text.
+#### How to use
+Pile-T5 can be loaded using the `AutoModelForSeq2SeqLM` functionality:
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pile-t5-base")
+model = AutoModelForSeq2SeqLM.from_pretrained("EleutherAI/pile-t5-base")
+```
+### Training
+#### Training dataset
+The Pile is a 825GiB general-purpose dataset in English. It was created by
+EleutherAI specifically for training large language models. It contains texts
+from 22 diverse sources, roughly broken down into five categories: academic
+writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project
+Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub,
+Enron Emails). See [the Pile paper](https://arxiv.org/abs/2101.00027) for
+a breakdown of all data sources, methodology, and a discussion of ethical
+implications. Consult [the datasheet](https://arxiv.org/abs/2201.07311) for
+more detailed documentation about the Pile and its component datasets. The
+Pile can be downloaded from the [official website](https://pile.eleuther.ai/),
+or from a [community mirror](https://the-eye.eu/public/AI/pile/).
+The Pile was deduplicated before being used to train Pile-T5.
+#### Training procedure
+Pile-T5 was trained with a batch size of approximately 1M tokens
+(2048 sequences of 512 tokens each), for a total of 2,000,000 steps. Pile-T5 was trained
+with the span-corruption objective.
+#### Training checkpoints
+Intermediate checkpoints for Pile-T5 are accessible within this repository.
+There are in total 200 checkpoints that are spaced 10,000 steps. For T5x-native
+checkpoints that can be used for finetuning with the T5x library, refer to [here](https://huggingface.co/lintang/pile-t5-base-t5x/tree/main)
+### Evaluations
+TBD
+### BibTeX
+```
+@article{2024t5v2,
+  author  = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel},
+  title   = {Pile T5, an update of T5},
+  year    = {2024},
+  url     = {}
+}
+```