SweCTRL-Mini

SweCTRL-Mini is a large Swedish language model that can be used for inference and fine-tuning on a single consumer-grade GPU. The model is based on the CTRL architecture by Keskar, McCann, Varshney, Xiong, and Socher (2019), which means that users of the SweCTRL-Mini model can control the genre of the generated text by inserting special tokens in the generation prompts. Crucially, note that this model is:

NOT trained on following GPT-like instructions,
NOT trained for conversations, like ChatGPT,
NOT trained on any multi-modal data during training. Only one modality -- text, more than 99% of it in Swedish.

Note on using Inference API (text box to the right): There are a number of presets that start the text with appropriate control codes to control the genre, e.g., :wiki: for texts form Wikipedia. You can add your own prompt on top of these control codes. For instance, if you want a Wikipedia article about Stockholm, you could write :wiki: Stockholm. The generation in the example is limited to 75 new tokens max. Also, normally the generation should stop after reaching the ending control code, which has $ symbol at the end, e.g., :wiki:$ for Wikipedia texts, however I couldn't configure that here, so please ignore all text after such tokens if they were to be generated. Additionaly, note, there are no filters or other mechanisms for making the text safe from biases or prohibiting it from generating texts on any topics.

Model Details

Model Description

Developed by: Dmytro Kalpakchi (with supervision from Johan Boye)
Shared by: Dmytro Kalpakchi
Model type: Transformer-based language model trained by predicting the next token
Language(s) (NLP): Swedish
License: BigScience Open RAIL-M
Finetuned from model: None, trained from scratch

Model Sources

Website: https://swectrl.dev/
Repository: https://github.com/dkalpakchi/SweCTRL-Mini
Paper: https://arxiv.org/pdf/2304.13994.pdf
Technical note: https://zenodo.org/record/7868205

Uses

Direct Use

The model should be used for generating texts of various genres in Swedish.

Out-of-Scope Use

Please refer to Appendix A of the License file for information of use restrictions. The model has a limited context window of 256 tokens, so it will most probably not work well for text summarization. Additionally, vast majority of its training data was in Swedish, although it contains tokens in other languages as well, so tasks like Machine Translation would require further fine-tuning.

Bias, Risks, and Limitations

To mitigate the inclusion of personally-identifiable data we attempted to remove sources that could contain such data to the best of our ability (see Technical note for more details on the data filtering process). However, we have still noted that the model can generate text that includes various forms of biases, which is why we strongly recommend human curation of the generated texts. Currently we have conducted no systematic investigation on either the kinds of biases are included in the generated texts or how frequently they occur. The contribution of the community on this matter would be very welcome.

Recommendations

For further recommendations on the use of the model, please see the associated paper.

How to Get Started with the Model

The fastest way to start with the model is using the code below:

from transformers import pipeline
pipe = pipeline(model="dkalpakchi/SweCTRL-Mini")
print(pipe(":nyheter:", max_length=256, repetition_penalty=1.1, top_p=0.9))

For more advanced uses and other code examples, please see the associated GitHub repository (https://github.com/dkalpakchi/SweCTRL-Mini).

Training Details

Training Data

The training data includes the subset of cleaned Swedish mC4, as well as some documents from Project Runeberg. The extensive information on the training data is provided in the Section 1 of the Technical note. The interface to partially mine training data is available at: https://swectrl.dev/data

Training Procedure

Preprocessing [optional]

See Section 1 of the Technical note.

Training Hyperparameters

Training regime: fp32

Evaluation

See Sections 5.3, 6, and 7 in the associated paper, and Section 3 of the Technical note.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: 8 A100 GPUs
Hours used: 11907.6 GPU-hours for training and experimentation
Provider: BerzeLiUs supercomputer
Carbon Emitted: No public data on carbon efficiency, so hard to estimate

Technical Specifications

See Section 3 of the associated paper

Citation

BibTeX:

@article{kalpakchi2023swectrl,
  title={SweCTRL-Mini: a data-transparent Transformer-based large language model for controllable text generation in Swedish},
  author={Kalpakchi, Dmytro and Boye, Johan},
  journal={arXiv preprint arXiv:2304.13994},
  year={2023}
}

APA:

Kalpakchi, D., & Boye, J. (2023). SweCTRL-Mini: a data-transparent Transformer-based large language model for controllable text generation in Swedish. arXiv preprint arXiv:2304.13994.

Model Card Authors

Dmytro Kalpakchi (dmytroka@kth.se)

Model Card Contact