salamandra-7b / README.md
mapama247's picture
update readme with model details, intended use, hw and sw
91108cd verified
|
raw
history blame
3.74 kB
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
language:
- bg
- ca
- code
- cs
- cy
- da
- de
- el
- en
- es
- et
- eu
- fi
- fr
- ga
- gl
- hr
- hu
- it
- lt
- lv
- mt
- nl
- nn
- no
- oc
- pl
- pt
- ro
- ru
- sh
- sk
- sl
- sr
- sv
- uk
---
![](./images/salamandra_header.png)
# Salamandra Model Card
Salamandra comes in three different sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants.
This model card corresponds to the 7B version.
To visit the model cards of other Salamandra versions, please refer to the [Model Index](#model-index).
The entire Salamandra family is released under a permissive [Apache 2.0 license]((https://www.apache.org/licenses/LICENSE-2.0)), allowing both research and commercial use.
Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/projecte-aina/salamandra).
---
## Model Details
### Description
Transformer-based decoder-only language model that has been pre-trained on 7.5 trillion tokens of highly curated data.
The pre-training corpus contains text in 35 European languages and code.
### Hyperparameters
The full list of hyperparameters for each model can be found [here](https://github.com/projecte-aina/salamandra/tree/main/configs).
### Architecture
| | |
|-------------------------|:--------------|
| Total Parameters | 7,768,117,248 |
| Embedding Parameters | 1,048,576,000 |
| Layers | 32 |
| Hidden size | 4,096 |
| Attention heads | 32 |
| Context length | 8,192 |
| Vocabulary size | 256,000 |
| Precision | bfloat16 |
| Embedding type | RoPE |
| Activation Function | SwiGLU |
| Layer normalization | RMS Norm |
| Flash attention | ✅ |
| Grouped Query Attention | ✅ |
---
## Intended Use
### Direct Use
The models are intended for both research and commercial use in any of the languages included in the training data.
The base models are intended either for language generation or to be further fine-tuned for specific use-cases.
The instruction-tuned variants can be used as general-purpose assistants, as long as the user is fully aware of the model’s limitations.
### Out-of-scope Use
The model is not intended for malicious activities, such as harming others or violating human rights.
Any downstream application must comply with current laws and regulations.
Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
---
## Hardware and Software
### Training Framework
Pre-training was conducted using NVIDIA’s [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html),
which leverages PyTorch Lightning for efficient model training in highly distributed settings.
The instruction-tuned versions were produced with [FastChat](https://github.com/lm-sys/FastChat).
### Compute Infrastructure
All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and
operated by Barcelona Supercomputing Center.
The accelerated partition is composed of 1,120 nodes with the following specifications:
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
- 4x NDR200 (BW per node 800Gb/s)
- 512 GB of Main memory (DDR5)
- 460GB on NVMe storage
|Model|Nodes|GPUs|
|:---:|:---:|:---:|
|2B|64|256|
|7B|128|512|
|40B|256 / 512|1,024 / 2,048|
---