Model Card for Switch Transformers Base - 256 experts

TL;DR
Model Details
Usage
Uses
Bias, Risks, and Limitations
Training Details
Evaluation
Environmental Impact
Citation
Model Card Authors

TL;DR

Switch Transformers is a Mixture of Experts (MoE) model trained on Masked Language Modeling (MLM) task. The model architecture is similar to the classic T5, but with the Feed Forward layers replaced by the Sparse MLP layers containing "experts" MLP. According to the original paper the model enables faster training (scaling properties) while being better than T5 on fine-tuned tasks. As mentioned in the first few lines of the abstract :

we advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieve a 4x speedup over the T5-XXL model.

Disclaimer: Content from this model card has been written by the Hugging Face team, and parts of it were copy pasted from the original paper.

Model Details

Model Description

Model type: Language model
Language(s) (NLP): English
License: Apache 2.0
Related Models: All Switch Transformers Checkpoints
Original Checkpoints: All Original Switch Transformers Checkpoints
Resources for more information:

Usage

Note that these checkpoints has been trained on Masked-Language Modeling (MLM) task. Therefore the checkpoints are not "ready-to-use" for downstream tasks. You may want to check FLAN-T5 for running fine-tuned weights or fine-tune your own MoE following this notebook

Find below some example scripts on how to use the model in transformers:

Using the Pytorch model

Running the model on a CPU

Click to expand


from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-256")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-256")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

Running the model on a GPU

Click to expand

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-256")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-256", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

Running the model on a GPU using different precisions

FP16

Click to expand

# pip install accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-256")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-256", device_map="auto", torch_dtype=torch.float16)

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

INT8

Click to expand

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, SwitchTransformersForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("google/switch-base-256")
model = SwitchTransformersForConditionalGeneration.from_pretrained("google/switch-base-256", device_map="auto")

input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
>>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>

Uses

Direct Use and Downstream Use

See the research paper for further details.

Out-of-Scope Use

More information needed.

Bias, Risks, and Limitations

More information needed.

Ethical considerations and risks

More information needed.

Known Limitations

More information needed.

Sensitive Use:

More information needed.

Training Details

Training Data

The model was trained on a Masked Language Modeling task, on Colossal Clean Crawled Corpus (C4) dataset, following the same procedure as T5.

Training Procedure

According to the model card from the original paper the model has been trained on TPU v3 or TPU v4 pods, using t5x codebase together with jax.

Evaluation

Testing Data, Factors & Metrics

The authors evaluated the model on various tasks and compared the results against T5. See the table below for some quantitative evaluation: For full details, please check the research paper.

Results

For full results for Switch Transformers, see the research paper, Table 5.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips ≥ 4.
Hours used: More information needed
Cloud Provider: GCP
Compute Region: More information needed
Carbon Emitted: More information needed

Citation

BibTeX:

@misc{https://doi.org/10.48550/arxiv.2101.03961,
  doi = {10.48550/ARXIV.2101.03961},
  
  url = {https://arxiv.org/abs/2101.03961},
  
  author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
  
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
  
  publisher = {arXiv},
  
  year = {2021},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Downloads last month: 30

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train google/switch-base-256

Spaces using google/switch-base-256 2

Collection including google/switch-base-256

Switch-Transformers release

Collection

This release included various MoE (Mixture of expert) models, based on the T5 architecture . The base models use from 8 to 256 experts. • 9 items • Updated Mar 12 • 19

Papers for google/switch-base-256

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Paper • 2101.03961 • Published Jan 11, 2021 • 13

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 52

google
/

switch-base-256

Model Card for Switch Transformers Base - 256 experts

Table of Contents

TL;DR

Model Details

Model Description

Usage

Using the Pytorch model

Running the model on a CPU

Running the model on a GPU

Running the model on a GPU using different precisions

FP16

INT8

Uses

Direct Use and Downstream Use

Out-of-Scope Use

Bias, Risks, and Limitations

Ethical considerations and risks

Known Limitations

Sensitive Use:

Training Details

Training Data

Training Procedure

Evaluation

Testing Data, Factors & Metrics

Results

Environmental Impact

Citation

Dataset used to train google/switch-base-256

Spaces using google/switch-base-256 2

Collection including google/switch-base-256

Switch-Transformers release

Papers for google/switch-base-256

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Quantifying the Carbon Emissions of Machine Learning