Lightweight Transformer Search GPT-2-Small

The Lightweight Transformer Search (LTS) method is designed to identify the most optimal transformer architectures that exist on the Pareto Frontier, where trade-offs are made between several objectives, such as latency and memory usage. It is based on a combination of architecture search and performance prediction, and it is capable of finding high-performance models that are also highly efficient.

LTS GPT-2-Small is a variant of the GPT-2 language model, which is known for its ability to generate coherent and fluent text. The LTS method was used to optimize the GPT-2 architecture for both memory and latency, resulting in a model that is smaller and faster than its counterparts, while still maintaining a high level of language generation quality.

For additional information, please refer to the GitHub repository.

Model Description

LTS was applied to the base GPT-2 architecture to find the best performing models while adhering to a specific set of constraints. The table below displays the outcome of the search:

Model	Non-Embedding Parameters (M)	Latency (s)	Memory (MB)
gpt2_a9e3147996070fda25af4b39ed95b6a18d6d0402	1.06	0.008	29.06
gpt2_80fabe4acddff0dc796e287588e40d86e79df4b2	2.08	0.013	45.46
gpt2_90682823835acabd965294775983a1d5a2c2fa43	3.13	0.021	74.50
gpt2_c76bdddb5cf59275711672daa5b8c70e6c78bf4e	3.95	0.024	77.62
gpt2_8f5159304179c77ecdc69c953b71a3f8fa528564	5.13	0.030	94.64
gpt2_131845381012a68c3a358514fdffc12b09db1ed8	6.44	0.036	112.16
gpt2_917c2f9601a1c29d1f280bb172015e5fb210b6b3	7.41	0.042	90.76
gpt2_538d4b101df48595a935d90dbf4a7fb2ac09ac01	8.23	0.047	93.88
gpt2_c679fa01f00dd6f584614c6d9784eb233b047283	9.46	0.053	148.71
gpt2_39563367097004cfd771d76d8822e51ad79b56d6	10.65	0.051	190.77
gpt2_ddf63c1125f1fed5a7dd3537f640834187719996	13.32	0.069	125.78
gpt2_0e1b5a3c867d6473da270799061f3089a1df5afd	16.04	0.084	173.74
gpt2_3b30c85ac08c6b12b0ea46cb832270ba52b7fcd8	18.97	0.096	209.94
gpt2_1e9d92f0fed7288facc68cb448863e8120ccca9c	20.96	0.105	217.50
gpt2_0e8c86e6babd924ff8b511c94cc1647bf61f81a2	24.83	0.121	244.77
gpt2_5fea22df661ad91676709da7a334505f15765659	26.89	0.131	252.65
gpt2_46e7c68a025417e20a7e13bd4c1ee71438d28069	30.07	0.146	252.23
gpt2_98b0196b5a865ba76f31723646f33e0461dc910d	33.24	0.160	314.39
gpt2_4352a56f3fa9e7ba6d291867d356a08022753658	40.34	0.195	328.88
gpt2_6c6e63116ff74ba444ff5a08cef54380073ebea3	49.85	0.230	377.68

How to Get Started with the Model

Use the code below to get started with the gpt2_0e1b5a3c867d6473da270799061f3089a1df5afd model. For other models, please change the subfolder argument to the proper identifier.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/lts-gpt2-sm")
model = AutoModelForCausalLM.from_pretrained("microsoft/lts-gpt2-sm", subfolder="gpt2_0e1b5a3c867d6473da270799061f3089a1df5afd")

text = "# Halo Infinite Review"
input_ids = tokenizer(text, return_tensors="pt").input_ids

generated_ids = model.generate(input_ids, max_length=128)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Training Details

Training Data

The models have been trained for only 7.8B tokens from The Pile dataset, which is roughy 10-15x times more than "Training Compute-Optimal Large Language Models" suggestion. However, since we are dealing with small-sized models (100M parameters less), there might be a chance of producing repetitive text when generating a large number of tokens given a short context.

Training Procedure

Please refer to the training script.

Limitations

Comparing smaller-sized transformers to large language models can be misleading and inaccurate, as they are fundamentally different in terms of the number of parameters, computational power, and capabilities. While smaller models may perform well on certain tasks, they lack the complexity and depth of larger models, which can lead to significant differences in their overall performance.

It is important to note that smaller models have their advantages. They require less computational resources, have a smaller memory footprint, and faster inference latency, which can be beneficial for real-time applications and devices with limited computing power. Additionally, research with smaller-sized transformers may lead to the discovery of more efficient architectures, which better utilizes computational resources and provide insights for training/deploying larger models.

Bias and Risks

Pre-training a language model using The Pile dataset may have several limitations and potential biases that need to be considered. The following are some of the technical and sociotechnical limitations associated with using this dataset for pre-training:

Domain-Specific Bias: Large and diverse dataset that contains text from various sources, including academic papers, news articles, and social media. However, the dataset may still have a bias towards certain domains, such as technology or politics.
Quality Control: Collection of various smaller datasets, each with its own quality control measures. The quality of the data may vary across different datasets, and some datasets may contain low-quality or biased text.
Societal Biases: May also contain biases related to race, gender, and other societal factors. These biases may be present in the text itself or may be introduced through the quality control measures used in the dataset creation process.
Data Privacy: Contains a large amount of user-generated content, which may raise privacy concerns. While the dataset is anonymized, it may still be possible to identify individuals based on the content of their text.

BibTeX entry and citation info

@misc{Archai:22,
   title=Archai: Platform for Neural Architecture Search,
   url=https://www.microsoft.com/en-us/research/project/archai-platform-for-neural-architecture-search,
   journal=Microsoft Research,
   year=2022,
   month=Jul
}

microsoft
/

lts-gpt2-sm