Lightweight Transformer Search GPT-2-Small
The Lightweight Transformer Search (LTS) method is designed to identify the most optimal transformer architectures that exist on the Pareto Frontier, where trade-offs are made between several objectives, such as latency and memory usage. It is based on a combination of architecture search and performance prediction, and it is capable of finding high-performance models that are also highly efficient.
LTS GPT-2-Small is a variant of the GPT-2 language model, which is known for its ability to generate coherent and fluent text. The LTS method was used to optimize the GPT-2 architecture for both memory and latency, resulting in a model that is smaller and faster than its counterparts, while still maintaining a high level of language generation quality.
For additional information, please refer to the GitHub repository.
Model Description
LTS was applied to the base GPT-2 architecture to find the best performing models while adhering to a specific set of constraints. The table below displays the outcome of the search:
Model | Non-Embedding Parameters (M) | Latency (s) | Memory (MB) |
---|---|---|---|
gpt2_a9e3147996070fda25af4b39ed95b6a18d6d0402 | 1.06 | 0.008 | 29.06 |
gpt2_80fabe4acddff0dc796e287588e40d86e79df4b2 | 2.08 | 0.013 | 45.46 |
gpt2_90682823835acabd965294775983a1d5a2c2fa43 | 3.13 | 0.021 | 74.50 |
gpt2_c76bdddb5cf59275711672daa5b8c70e6c78bf4e | 3.95 | 0.024 | 77.62 |
gpt2_8f5159304179c77ecdc69c953b71a3f8fa528564 | 5.13 | 0.030 | 94.64 |
gpt2_131845381012a68c3a358514fdffc12b09db1ed8 | 6.44 | 0.036 | 112.16 |
gpt2_917c2f9601a1c29d1f280bb172015e5fb210b6b3 | 7.41 | 0.042 | 90.76 |
gpt2_538d4b101df48595a935d90dbf4a7fb2ac09ac01 | 8.23 | 0.047 | 93.88 |
gpt2_c679fa01f00dd6f584614c6d9784eb233b047283 | 9.46 | 0.053 | 148.71 |
gpt2_39563367097004cfd771d76d8822e51ad79b56d6 | 10.65 | 0.051 | 190.77 |
gpt2_ddf63c1125f1fed5a7dd3537f640834187719996 | 13.32 | 0.069 | 125.78 |
gpt2_0e1b5a3c867d6473da270799061f3089a1df5afd | 16.04 | 0.084 | 173.74 |
gpt2_3b30c85ac08c6b12b0ea46cb832270ba52b7fcd8 | 18.97 | 0.096 | 209.94 |
gpt2_1e9d92f0fed7288facc68cb448863e8120ccca9c | 20.96 | 0.105 | 217.50 |
gpt2_0e8c86e6babd924ff8b511c94cc1647bf61f81a2 | 24.83 | 0.121 | 244.77 |
gpt2_5fea22df661ad91676709da7a334505f15765659 | 26.89 | 0.131 | 252.65 |
gpt2_46e7c68a025417e20a7e13bd4c1ee71438d28069 | 30.07 | 0.146 | 252.23 |
gpt2_98b0196b5a865ba76f31723646f33e0461dc910d | 33.24 | 0.160 | 314.39 |
gpt2_4352a56f3fa9e7ba6d291867d356a08022753658 | 40.34 | 0.195 | 328.88 |
gpt2_6c6e63116ff74ba444ff5a08cef54380073ebea3 | 49.85 | 0.230 | 377.68 |
How to Get Started with the Model
Use the code below to get started with the gpt2_0e1b5a3c867d6473da270799061f3089a1df5afd model. For other models, please change the subfolder
argument to the proper identifier.
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/lts-gpt2-sm")
model = AutoModelForCausalLM.from_pretrained("microsoft/lts-gpt2-sm", subfolder="gpt2_0e1b5a3c867d6473da270799061f3089a1df5afd")
text = "# Halo Infinite Review"
input_ids = tokenizer(text, return_tensors="pt").input_ids
generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Training Details
Training Data
The models have been trained for only 7.8B tokens from The Pile dataset, which is roughy 10-15x times more than "Training Compute-Optimal Large Language Models" suggestion. However, since we are dealing with small-sized models (100M parameters less), there might be a chance of producing repetitive text when generating a large number of tokens given a short context.
Training Procedure
Please refer to the training script.
Limitations
Comparing smaller-sized transformers to large language models can be misleading and inaccurate, as they are fundamentally different in terms of the number of parameters, computational power, and capabilities. While smaller models may perform well on certain tasks, they lack the complexity and depth of larger models, which can lead to significant differences in their overall performance.
It is important to note that smaller models have their advantages. They require less computational resources, have a smaller memory footprint, and faster inference latency, which can be beneficial for real-time applications and devices with limited computing power. Additionally, research with smaller-sized transformers may lead to the discovery of more efficient architectures, which better utilizes computational resources and provide insights for training/deploying larger models.
Bias and Risks
Pre-training a language model using The Pile dataset may have several limitations and potential biases that need to be considered. The following are some of the technical and sociotechnical limitations associated with using this dataset for pre-training:
Domain-Specific Bias: Large and diverse dataset that contains text from various sources, including academic papers, news articles, and social media. However, the dataset may still have a bias towards certain domains, such as technology or politics.
Quality Control: Collection of various smaller datasets, each with its own quality control measures. The quality of the data may vary across different datasets, and some datasets may contain low-quality or biased text.
Societal Biases: May also contain biases related to race, gender, and other societal factors. These biases may be present in the text itself or may be introduced through the quality control measures used in the dataset creation process.
Data Privacy: Contains a large amount of user-generated content, which may raise privacy concerns. While the dataset is anonymized, it may still be possible to identify individuals based on the content of their text.
BibTeX entry and citation info
@misc{Archai:22,
title=Archai: Platform for Neural Architecture Search,
url=https://www.microsoft.com/en-us/research/project/archai-platform-for-neural-architecture-search,
journal=Microsoft Research,
year=2022,
month=Jul
}