File size: 8,795 Bytes
81e83ae 499f0f0 81e83ae 499f0f0 81e83ae 499f0f0 81e83ae 499f0f0 81e83ae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 |
---
language:
- en
tags:
- text2text-generation
widget:
- text: "The <extra_id_0> walks in <extra_id_1> park"
example_title: "Masked Language Modeling"
datasets:
- c4
- xsum
license: apache-2.0
---
# Model Card for Switch Transformers Base - 256 experts
![model image](https://s3.amazonaws.com/moonup/production/uploads/1666966931908-62441d1d9fdefb55a0b7d12c.png)
# Table of Contents
0. [TL;DR](#TL;DR)
1. [Model Details](#model-details)
2. [Usage](#usage)
3. [Uses](#uses)
4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
5. [Training Details](#training-details)
6. [Evaluation](#evaluation)
7. [Environmental Impact](#environmental-impact)
8. [Citation](#citation)
9. [Model Card Authors](#model-card-authors)
# TL;DR
Switch Transformers is a Mixture of Experts (MoE) model trained on Masked Language Modeling (MLM) task. The model architecture is similar to the classic T5, but with the Feed Forward layers replaced by the Sparse MLP layers containing "experts" MLP. According to the [original paper](https://arxiv.org/pdf/2101.03961.pdf) the model enables faster training (scaling properties) while being better than T5 on fine-tuned tasks.
As mentioned in the first few lines of the abstract :
> we advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieve a 4x speedup over the T5-XXL model.
**Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the [original paper](https://arxiv.org/pdf/2101.03961.pdf).
# Model Details
## Model Description
- **Model type:** Language model
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Related Models:** [All FLAN-T5 Checkpoints](https://huggingface.co/models?search=switch)
- **Original Checkpoints:** [All Original FLAN-T5 Checkpoints](https://github.com/google-research/t5x/blob/main/docs/models.md#mixture-of-experts-moe-checkpoints)
- **Resources for more information:**
- [Research paper](https://arxiv.org/pdf/2101.03961.pdf)
- [GitHub Repo](https://github.com/google-research/t5x)
- [Hugging Face Switch Transformers Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/switch_transformers)
# Usage
Note that these checkpoints has been trained on Masked-Language Modeling (MLM) task. Therefore the checkpoints are not "ready-to-use" for downstream tasks. You may want to check `FLAN-T5` for running fine-tuned weights or fine-tune your own MoE following [this notebook](https://colab.research.google.com/drive/1aGGVHZmtKmcNBbAwa9hbu58DDpIuB5O4?usp=sharing)
Find below some example scripts on how to use the model in `transformers`:
## Using the Pytorch model
### Running the model on a CPU
<details>
<summary> Click to expand </summary>
```python
from transformers import AutoTokenizer, SwitchTransformersConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("google/switch-base-256")
model = SwitchTransformersConditionalGeneration.from_pretrained("google/switch-base-256")
input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>
```
</details>
### Running the model on a GPU
<details>
<summary> Click to expand </summary>
```python
# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("google/switch-base-256")
model = SwitchTransformersConditionalGeneration.from_pretrained("google/switch-base-256", device_map="auto")
input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>
```
</details>
### Running the model on a GPU using different precisions
#### FP16
<details>
<summary> Click to expand </summary>
```python
# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("google/switch-base-256")
model = SwitchTransformersConditionalGeneration.from_pretrained("google/switch-base-256", device_map="auto", torch_dtype=torch.float16)
input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>
```
</details>
#### INT8
<details>
<summary> Click to expand </summary>
```python
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("google/switch-base-256")
model = SwitchTransformersConditionalGeneration.from_pretrained("google/switch-base-256", device_map="auto")
input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>
```
</details>
# Uses
## Direct Use and Downstream Use
The authors write in [the original paper's model card](https://arxiv.org/pdf/2210.11416.pdf) that:
> The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models
See the [research paper](https://arxiv.org/pdf/2210.11416.pdf) for further details.
## Out-of-Scope Use
More information needed.
# Bias, Risks, and Limitations
More information needed.
## Ethical considerations and risks
More information needed.
## Known Limitations
More information needed.
## Sensitive Use:
> SwitchTransformers should not be applied for any unacceptable use cases, e.g., generation of abusive speech.
# Training Details
## Training Data
The model was trained on a Masked Language Modeling task, on Colossal Clean Crawled Corpus (C4) dataset, following the same procedure as `T5`.
## Training Procedure
According to the model card from the [original paper](https://arxiv.org/pdf/2210.11416.pdf):
> These models are based on pretrained SwitchTransformers and are not fine-tuned. It is normal if they perform well on zero-shot tasks.
The model has been trained on TPU v3 or TPU v4 pods, using [`t5x`](https://github.com/google-research/t5x) codebase together with [`jax`](https://github.com/google/jax).
# Evaluation
## Testing Data, Factors & Metrics
The authors evaluated the model on various tasks and compared the results against T5. See the table below for some quantitative evaluation:
![image.png](https://s3.amazonaws.com/moonup/production/uploads/1666967660372-62441d1d9fdefb55a0b7d12c.png)
For full details, please check the [research paper](https://arxiv.org/pdf/2101.03961.pdf).
## Results
For full results for Switch Transformers, see the [research paper](https://arxiv.org/pdf/2101.03961.pdf), Table 5.
# Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips ≥ 4.
- **Hours used:** More information needed
- **Cloud Provider:** GCP
- **Compute Region:** More information needed
- **Carbon Emitted:** More information needed
# Citation
**BibTeX:**
```bibtex
@misc{https://doi.org/10.48550/arxiv.2101.03961,
doi = {10.48550/ARXIV.2101.03961},
url = {https://arxiv.org/abs/2101.03961},
author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
publisher = {arXiv},
year = {2021},
copyright = {arXiv.org perpetual, non-exclusive license}
}
``` |