BSC-LT
/

salamandra-7b

+---
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
+language:
+- bg
+- ca
+- code
+- cs
+- cy
+- da
+- de
+- el
+- en
+- es
+- et
+- eu
+- fi
+- fr
+- ga
+- gl
+- hr
+- hu
+- it
+- lt
+- lv
+- mt
+- nl
+- nn
+- no
+- oc
+- pl
+- pt
+- ro
+- ru
+- sh
+- sk
+- sl
+- sr
+- sv
+- uk
+---
+![](./images/salamandra_header.png)
+# Salamandra Model Card
+Salamandra comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants.
+This model card corresponds to the 7B version.
+To visit the model cards of other Salamandra versions, please refer to the [Model Index](#model-index).
+The entire Salamandra family is released under a permissive [Apache 2.0 license]((https://www.apache.org/licenses/LICENSE-2.0)), allowing both research and commercial use.
+Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/projecte-aina/salamandra).
+---
+## Model Details
+### Description
+Transformer-based decoder-only language model that has been pre-trained on 7.5 trillion tokens of highly curated data.
+The pre-training corpus contains text in 35 European languages and code.
+### Hyperparameters
+The full list of hyperparameters for each model can be found [here](https://github.com/projecte-aina/salamandra/tree/main/configs).
+### Architecture
+|                         |               |
+|-------------------------|:--------------|
+| Total Parameters        | 7,768,117,248 |
+| Embedding Parameters    | 1,048,576,000 |
+| Layers                  | 32            |
+| Hidden size             | 4,096         |
+| Attention heads         | 32            |
+| Context length          | 8,192         |
+| Vocabulary size         | 256,000       |
+| Precision               | bfloat16      |
+| Embedding type          | RoPE          |
+| Activation Function     | SwiGLU        |
+| Layer normalization     | RMS Norm      |
+| Flash attention         | ✅            |
+| Grouped Query Attention | ✅            |
+---
+## Intended Use
+### Direct Use
+The models are intended for both research and commercial use in any of the languages included in the training data.
+The base models are intended either for language generation or to be further fine-tuned for specific use-cases.
+The instruction-tuned variants can be used as general-purpose assistants, as long as the user is fully aware of the model’s limitations.
+### Out-of-scope Use
+The model is not intended for malicious activities, such as harming others or violating human rights.
+Any downstream application must comply with current laws and regulations.
+Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
+---
+## Hardware and Software
+### Training Framework
+Pre-training was conducted using NVIDIA’s [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html),
+which leverages PyTorch Lightning for efficient model training in highly distributed settings.
+The instruction-tuned versions were produced with [FastChat](https://github.com/lm-sys/FastChat).
+### Compute Infrastructure
+All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and
+operated by Barcelona Supercomputing Center.
+The accelerated partition is composed of 1,120 nodes with the following specifications:
+- 4x Nvidia Hopper GPUs with 64 HBM2 memory
+- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
+- 4x NDR200 (BW per node 800Gb/s)
+- 512 GB of Main memory (DDR5)
+- 460GB on NVMe storage
+|Model|Nodes|GPUs|
+|:---:|:---:|:---:|
+|2B|64|256|
+|7B|128|512|
+|40B|256 / 512|1,024 / 2,048|
+---