salamandra-7b / README.md
mapama247's picture
minor fixes readme
b5d5918 verified
|
raw
history blame
8.82 kB
metadata
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
language:
  - bg
  - ca
  - code
  - cs
  - cy
  - da
  - de
  - el
  - en
  - es
  - et
  - eu
  - fi
  - fr
  - ga
  - gl
  - hr
  - hu
  - it
  - lt
  - lv
  - mt
  - nl
  - nn
  - 'no'
  - oc
  - pl
  - pt
  - ro
  - ru
  - sh
  - sk
  - sl
  - sr
  - sv
  - uk

Salamandra Model Card

Salamandra comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants. This model card corresponds to the 7B version.

To visit the model cards of other Salamandra versions, please refer to the Model Index.

The entire Salamandra family is released under a permissive Apache 2.0 license, allowing both research and commercial use. Along with the open weights, all training scripts and configuration files are made publicly available in this GitHub repository.


Model Details

Description

Transformer-based decoder-only language model that has been pre-trained on 7.5 trillion tokens of highly curated data. The pre-training corpus contains text in 35 European languages and code.

Hyperparameters

The full list of hyperparameters for each model can be found here.

Architecture

Total Parameters 7,768,117,248
Embedding Parameters 1,048,576,000
Layers 32
Hidden size 4,096
Attention heads 32
Context length 8,192
Vocabulary size 256,000
Precision bfloat16
Embedding type RoPE
Activation Function SwiGLU
Layer normalization RMS Norm
Flash attention
Grouped Query Attention

Intended Use

Direct Use

The models are intended for both research and commercial use in any of the languages included in the training data. The base models are intended either for language generation or to be further fine-tuned for specific use-cases. The instruction-tuned variants can be used as general-purpose assistants, as long as the user is fully aware of the model’s limitations.

Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.


Hardware and Software

Training Framework

Pre-training was conducted using NVIDIA’s NeMo Framework, which leverages PyTorch Lightning for efficient model training in highly distributed settings.

The instruction-tuned versions were produced with FastChat.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

  • 4x Nvidia Hopper GPUs with 64 HBM2 memory
  • 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
  • 4x NDR200 (BW per node 800Gb/s)
  • 512 GB of Main memory (DDR5)
  • 460GB on NVMe storage
Model Nodes GPUs
2B 64 256
7B 128 512
40B 256 / 512 1,024 / 2,048

How to use

TODO


Data

TODO


Evaluation

TODO

Ethical Considerations and Limitations

We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report that while performance is high (accuracies between 0.69 and 0.87 depending on the social category) in disambiguated settings the model performs very poorly in ambiguous settings, which is indicative of the presence of societal biases which need to be addressed in post-training phases.

We additionally analyse model generations using the Regard dataset and classifier in Catalan, Spanish, and English using backtranslation and manual revision of the translations. We find no statistically significant difference in regard between majority and minority groups for any regard types, with the exception of negative regard in Catalan where model generations are actually slightly worse for social majorities. Our analyses on societal biases show that while these biases are capable of interfering with model performance as expressed in the results on the BBQ dataset, their tendency for representational harm is limited given the results of the Regard dataset. We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.

Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe moderate to strong primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We detect moderate effects, implying that outputs can be influenced by the prompts.

We highlight that these results can be expected from a pretrained model that has not yet been instruction-tuned or aligned. These tests are performed in order to show the biases the model may contain. We urge developers to take them into account and perform safety testing and tuning tailored to their specific applications of the model.


Additional information

Author

The Language Technologies Unit from Barcelona Supercomputing Center.

Contact

For further information, please send an email to langtech@bsc.es.

Copyright

Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.

Funding

This work has been promoted and financed by the Government of Catalonia through the Aina Project.

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of ILENIA Project with reference 2022/TL22/00215337, 2022/TL22/00215336, 2022/TL22/00215335, 2022/TL22/00215334.

Acknowledgements

This project benefited from the contributions of many teams and institutions, including: Senado de España, Parlament de Catalunya, Òmnium Cultural, Dialnet, Institut d’Estudis Aranesos, Fundación Elcano, Universidad de Las Palmas de Gran Canaria, Occiglot, Common Crawl, the Welsh Government, the German Research Center for Artificial Intelligence (DFKI) and the partners of Proyecto ILENIA. Their valuable efforts have been instrumental in the development of this work.

A special acknowledgment is reserved for the NVIDIA Team with whom we have been meeting on a regular basis. Their consistent support has been particularly appreciated throughout the process.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

Work in progress, paper coming soon.

@article{salamandra,
  title={Salamandra Technical Report},
  author={LangTech@BSC},
  year={2024},
  url = {}
}

License

Apache License, Version 2.0

Model Index

Model Base Instruct
2B WiP WiP
7B Link Link
40B WiP WiP