File size: 11,119 Bytes
a3cab29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86b7da7
a3cab29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
license: apache-2.0
library_name: transformers
pipeline_tag: translation
language:
- it
- pt
- de
- en
- es
- eu
- gl
- fr
- bg
- cs
- lt
- hr
- ca
- nl
- ro
- da
- el
- fi
- hu
- sk
- sl
- et
- pl
- lv
- mt
- ga
- sv
- an
- ast
- arn
base_model:
- BSC-LT/salamandra-2b
---

![](./images/salamandra_header.png)

# Salamandra Model Card


SalamandraTA-2B is a machine translation model that has been continually pre-trained on Salamandra2B on 70 billion tokens of parallel data in 30 different languages: Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian. SalamandraTA-2B is the first model in **SalamandraTA** series and is trained to handle sentence- and paragraph- level machine translation.

- **Developed by:** The Language Technologies Unit from Barcelona Supercomputing Center (BSC).
- **Model type:** A 2B parameter model continually pre-trained on 70 billion tokens.
- **Languages:** Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
- **License:** Apache License, Version 2.0


## Model Details

### Description

Continual pre-trained model from Salamandra-2B on 70 billion tokens of highly curated parallel data.

### Hyperparameters

The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).

### Architecture

|                         |               |
|-------------------------|:--------------|
| Total Parameters        | 2,253,490,176 |
| Embedding Parameters    | 524,288,000   |
| Layers                  | 24            |
| Hidden size             | 2,048         |
| Attention heads         | 16            |
| Context length          | 8,192         |
| Vocabulary size         | 256,000       |
| Precision               | bfloat16      |
| Embedding type          | RoPE          |
| Activation Function     | SwiGLU        |
| Layer normalization     | RMS Norm      |
| Flash attention         | ✅            |
| Grouped Query Attention | ❌            |
| Num. query groups       | N/A           |

---

## Intended Use

### Direct Use

The models are intended for both research and commercial use in any of the languages included in the training data. 
The base models are intended either for general machine translation tasks.

### Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. 
Any downstream application must comply with current laws and regulations. 
Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged. 

---

## Hardware and Software

### Training Framework

Continual pre-training was conducted using [LLaMA-Factory framework](https://github.com/hiyouga/LLaMA-Factory).

### Compute Infrastructure

All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
- 4x NDR200 (BW per node 800Gb/s)
- 512 GB of Main memory (DDR5)
- 460GB on NVMe storage

---

## How to use
This section offers examples of how to perform inference using various methods.

### Inference

To run inference using Huggingface's AutoModel class on a single sentence you can use the following code.

<details>
<summary>Show code</summary>

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'BSC-LT/salamandraTA-2b'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

src_lang_code = 'Spanish'
tgt_lang_code = 'Catalan'
sentence = 'Ayer se fue, tomó sus cosas y se puso a navegar.'

prompt = f'[{src_lang_code}] {sentence} \n[{tgt_lang_code}]'

input_ids = tokenizer(prompt, return_tensors='pt').input_ids
output_ids = model.generate( input_ids, max_length=500, num_beams=5 )
input_length = input_ids.shape[1]

generated_text = tokenizer.decode(output_ids[0, input_length: ], skip_special_tokens=True).strip()
# Ahir se'n va anar, va agafar les seves coses i es va posar a navegar.
```
</details>

<br>


To run batch inference using Huggingface's AutoModel class you can use the following code.

<details>
<summary>Show code</summary>

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = 'BSC-LT/salamandraTA-2b'

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, attn_implementation='eager')

# List of sentences to translate
sentences = [
  'Ayer se fue, tomó sus cosas y se puso a navegar.',
  'Se despidió y decidió batirse en duelo con el mar, y recorrer el mundo en su velero',
  'Su corazón buscó una forma diferente de vivir, pero las olas le gritaron: Vete con los demás',
  'Y se durmió y la noche le gritó: Dónde vas, y en sus sueños dibujó gaviotas, y pensó: Hoy debo regresar.'
]

src_lang_code = 'Spanish'
tgt_lang_code = 'Catalan'

prompt = lambda x: f'[{src_lang_code}] {x} \n[{tgt_lang_code}]'
prompts = [prompt(x) for x in sentences]


encodings = tokenizer(prompts, return_tensors='pt', padding=True, add_special_tokens=True)

input_ids = encodings['input_ids'].to(model.device)
attention_mask = encodings['attention_mask'].to(model.device)

with torch.no_grad(): 
    outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask, num_beams=5,max_length=256,early_stopping=True)

results_detokenized = []
for i, output in enumerate(outputs):
    input_length = input_ids[i].shape[0]
    generated_text = tokenizer.decode(output[input_length:], skip_special_tokens=True).strip()
    results_detokenized.append(generated_text)

print("Generated Translations:", results_detokenized)

#Generated Translations: ["Ahir se'n va anar, va agafar les seves coses i es va posar a navegar.", "Es va acomiadar i va decidir batre's en duel amb el mar, i recórrer el món en el seu veler", 'El seu cor va buscar una forma diferent de viure, però les onades li van cridar: Vés amb els altres', 'I es va adormir i la nit li va cridar: On vas, i en els seus somnis va dibuixar gavines, i va pensar: Avui he de tornar.']

```
</details>

## Data




## Evaluation

Below are the evaluation results on Flores-200 dev and devtest compared to NLLB-3.3 ([Costa-jussà et al., 2022](https://arxiv.org/abs/2207.04672)) for CA-XX and XX-CA directions.


#### Flores200-dev

|              |   Bleu ↑ |   Ter ↓ |   ChrF ↑ |   Comet ↑ |   Comet-kiwi ↑ |   Bleurt ↑ |
|:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
| **CA-XX** |  |  |  |  |  |  |
| SalamandraTA  |  **27.41** | **60.88** |  **56.27** |    0.86 |         0.82 |     0.76 |
| nllb 3.3B               |  26.84 | 61.75 |  55.7  |    0.86 |         0.82 |     0.76 |
| **XX-CA** |  |  |  |  |  |  |
| SalamandraTA |  **30.75** | **57.66** |  **57.6**  |    0.85 |         0.81 |     0.73 |
| nllb 3.3B              |  29.76 | 58.25 |  56.75 |    0.85 |         **0.82** |     0.73 |


<details>
<summary>Click to show full table</summary>
</details>

#### Flores200-devtest

|              |   Bleu ↑ |   Ter ↓ |   ChrF ↑ |   Comet ↑ |   Comet-kiwi ↑ |   Bleurt ↑ |
|:-----------------------|-------:|------:|-------:|--------:|-------------:|---------:|
| **CA-XX** |  |  |  |  |  |  |
| SalamandraTA  |  **27.09** | **61.06** |  **56.41** |    0.86 |         0.81 |     0.75 |
| nllb 3.3B     |  26.7  | 61.74 |  55.85 |    0.86 |         **0.82** |     **0.76** |
| **XX-CA** |  |  |  |  |  |  |
| SalamandraTA | **31**    | **57.46** |  **57.96** |    0.85 |         0.81 |     0.73 |
| nllb 3.3B    |  30.31 | 58.26 |  57.12 |    0.85 |         **0.82** |     0.73 |
<details>
<summary>Click to show full table</summary>

</details>



## Ethical Considerations and Limitations

---

## Additional information

### Author
The Language Technologies Unit from Barcelona Supercomputing Center.

### Contact
For further information, please send an email to <langtech@bsc.es>.

### Copyright
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.

### Funding
This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).

This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU 
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.

### Acknowledgements


This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support. 

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria. 

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and  Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.

Their valuable efforts have been instrumental in the development of this work.

### Disclaimer
Be aware that the model may contain biases or other unintended distortions. 
When third parties deploy systems or provide services based on this model, or use the model themselves, 
they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, 
including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

### License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)