microsoft
/

lts-gpt2-sm

Text Generation

pareto-frontier

Model card Files Files and versions Community

gugarosa commited on Mar 20, 2023

Commit

df8e23b

•

1 Parent(s): 8d5bdf1

Update README.md

Files changed (1) hide show

README.md +8 -2

README.md CHANGED Viewed

@@ -72,13 +72,19 @@ print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
 ### Training Data
-The models have been trained for 7.8B tokens from [The Pile](https://huggingface.co/datasets/the_pile) dataset.
 ### Training Procedure
 Please refer to the [training script](https://github.com/microsoft/archai/blob/main/tasks/text_generation/train.py).
-## Bias, Risks, and Limitations
 Pre-training a language model using The Pile dataset may have several limitations and potential biases that need to be considered. The following are some of the technical and sociotechnical limitations associated with using this dataset for pre-training:

 ### Training Data
+The models have been trained for only 7.8B tokens from [The Pile](https://huggingface.co/datasets/the_pile) dataset. Such number might imply in repetitive text when generating a large amount of tokens (32+ tokens).
 ### Training Procedure
 Please refer to the [training script](https://github.com/microsoft/archai/blob/main/tasks/text_generation/train.py).
+## Limitations
+Comparing smaller-sized transformers to large language models can be misleading and inaccurate, as they are fundamentally different in terms of the number of parameters, computational power, and capabilities. While smaller models may perform well on certain tasks, they lack the complexity and depth of larger models, which can lead to significant differences in their overall performance.
+It is important to note that smaller models have their advantages. They require less computational resources, have a smaller memory footprint, and faster inference latency, which can be beneficial for real-time applications and devices with limited computing power. Additionally, research with smaller-sized transformers may lead to the discovery of more efficient architectures, which better utilizes computational resources and provide insights for training/deploying larger models.
+## Bias and Risks
 Pre-training a language model using The Pile dataset may have several limitations and potential biases that need to be considered. The following are some of the technical and sociotechnical limitations associated with using this dataset for pre-training: