|
--- |
|
license: apache-2.0 |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
language: |
|
- bg |
|
- ca |
|
- code |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- es |
|
- et |
|
- eu |
|
- fi |
|
- fr |
|
- ga |
|
- gl |
|
- hr |
|
- hu |
|
- it |
|
- lt |
|
- lv |
|
- mt |
|
- nl |
|
- nn |
|
- no |
|
- oc |
|
- pl |
|
- pt |
|
- ro |
|
- ru |
|
- sh |
|
- sk |
|
- sl |
|
- sr |
|
- sv |
|
- uk |
|
--- |
|
|
|
![](./images/salamandra_header.png) |
|
|
|
# Salamandra Model Card |
|
|
|
Salamandra comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants. |
|
This model card corresponds to the 7B version. |
|
|
|
To visit the model cards of other Salamandra versions, please refer to the [Model Index](#model-index). |
|
|
|
The entire Salamandra family is released under a permissive [Apache 2.0 license]((https://www.apache.org/licenses/LICENSE-2.0)), allowing both research and commercial use. |
|
Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/projecte-aina/salamandra). |
|
|
|
--- |
|
|
|
## Model Details |
|
|
|
### Description |
|
|
|
Transformer-based decoder-only language model that has been pre-trained on 7.5 trillion tokens of highly curated data. |
|
The pre-training corpus contains text in 35 European languages and code. |
|
|
|
### Hyperparameters |
|
|
|
The full list of hyperparameters for each model can be found [here](https://github.com/projecte-aina/salamandra/tree/main/configs). |
|
|
|
### Architecture |
|
|
|
| | | |
|
|-------------------------|:--------------| |
|
| Total Parameters | 7,768,117,248 | |
|
| Embedding Parameters | 1,048,576,000 | |
|
| Layers | 32 | |
|
| Hidden size | 4,096 | |
|
| Attention heads | 32 | |
|
| Context length | 8,192 | |
|
| Vocabulary size | 256,000 | |
|
| Precision | bfloat16 | |
|
| Embedding type | RoPE | |
|
| Activation Function | SwiGLU | |
|
| Layer normalization | RMS Norm | |
|
| Flash attention | ✅ | |
|
| Grouped Query Attention | ✅ | |
|
|
|
--- |
|
|
|
## Intended Use |
|
|
|
### Direct Use |
|
|
|
The models are intended for both research and commercial use in any of the languages included in the training data. |
|
The base models are intended either for language generation or to be further fine-tuned for specific use-cases. |
|
The instruction-tuned variants can be used as general-purpose assistants, as long as the user is fully aware of the model’s limitations. |
|
|
|
### Out-of-scope Use |
|
|
|
The model is not intended for malicious activities, such as harming others or violating human rights. |
|
Any downstream application must comply with current laws and regulations. |
|
Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged. |
|
|
|
--- |
|
|
|
## Hardware and Software |
|
|
|
### Training Framework |
|
|
|
Pre-training was conducted using NVIDIA’s [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html), |
|
which leverages PyTorch Lightning for efficient model training in highly distributed settings. |
|
|
|
The instruction-tuned versions were produced with [FastChat](https://github.com/lm-sys/FastChat). |
|
|
|
### Compute Infrastructure |
|
|
|
All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and |
|
operated by Barcelona Supercomputing Center. |
|
|
|
The accelerated partition is composed of 1,120 nodes with the following specifications: |
|
- 4x Nvidia Hopper GPUs with 64 HBM2 memory |
|
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores) |
|
- 4x NDR200 (BW per node 800Gb/s) |
|
- 512 GB of Main memory (DDR5) |
|
- 460GB on NVMe storage |
|
|
|
|Model|Nodes|GPUs| |
|
|:---:|:---:|:---:| |
|
|2B|64|256| |
|
|7B|128|512| |
|
|40B|256 / 512|1,024 / 2,048| |
|
|
|
--- |