Update README.md
Browse files
README.md
CHANGED
@@ -32,7 +32,7 @@ language:
|
|
32 |
- sv
|
33 |
- an
|
34 |
- ast
|
35 |
-
-
|
36 |
base_model:
|
37 |
- BSC-LT/salamandra-2b
|
38 |
---
|
@@ -42,11 +42,15 @@ base_model:
|
|
42 |
# Salamandra Model Card
|
43 |
|
44 |
|
45 |
-
SalamandraTA-2B is a machine translation model that has been continually pre-trained on Salamandra2B on 70 billion tokens of parallel data in 30 different languages:
|
|
|
|
|
|
|
46 |
|
47 |
- **Developed by:** The Language Technologies Unit from Barcelona Supercomputing Center (BSC).
|
48 |
- **Model type:** A 2B parameter model continually pre-trained on 70 billion tokens.
|
49 |
-
- **Languages:** Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish,
|
|
|
50 |
- **License:** Apache License, Version 2.0
|
51 |
|
52 |
|
@@ -54,11 +58,13 @@ SalamandraTA-2B is a machine translation model that has been continually pre-tra
|
|
54 |
|
55 |
### Description
|
56 |
|
57 |
-
This machine translation model is built upon the foundation of Salamandra 2B. By leveraging the knowledge of the base Salamandra 2B model,
|
|
|
58 |
|
59 |
Key Features:
|
60 |
|
61 |
-
* **Continual Pretraining:** The model is trained on 70 Billion tokens of parallel data. All data employed is open-sourced or generated from open-source
|
|
|
62 |
* **Large Language Model Foundation:** Built on Salamandra 2B, providing a strong language understanding and generation capability.
|
63 |
* **Multilingual Support:** Capable of translating between 30 european languages, including low-resource languages.
|
64 |
* **High-Quality Translations:** Delivers accurate and fluent translations, thanks to its continual pretraining and large-scale dataset.
|
@@ -112,7 +118,8 @@ Continual pre-training was conducted using [LLaMA-Factory framework](https://git
|
|
112 |
|
113 |
### Compute Infrastructure
|
114 |
|
115 |
-
All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and
|
|
|
116 |
|
117 |
The accelerated partition is composed of 1,120 nodes with the following specifications:
|
118 |
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
|
@@ -133,7 +140,8 @@ To translate with the salamandraTA-2B model, first you need to create a prompt t
|
|
133 |
|
134 |
You can translate between these languages by using their names directly:
|
135 |
|
136 |
-
Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish,
|
|
|
137 |
|
138 |
|
139 |
### Inference
|
@@ -224,7 +232,9 @@ print("Generated Translations:", results_detokenized)
|
|
224 |
|
225 |
## Evaluation
|
226 |
|
227 |
-
Below are the evaluation results on Flores-200 dev and devtest compared to NLLB-3.3 ([Costa-jussà et al., 2022](https://arxiv.org/abs/2207.04672)) for CA-XX
|
|
|
|
|
228 |
|
229 |
<details>
|
230 |
<summary>Click to show metrics details</summary>
|
@@ -679,18 +689,6 @@ This work has been promoted and financed by the Government of Catalonia through
|
|
679 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
680 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
681 |
|
682 |
-
### Acknowledgements
|
683 |
-
|
684 |
-
|
685 |
-
This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.
|
686 |
-
|
687 |
-
In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
|
688 |
-
|
689 |
-
At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
|
690 |
-
|
691 |
-
At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.
|
692 |
-
|
693 |
-
Their valuable efforts have been instrumental in the development of this work.
|
694 |
|
695 |
### Disclaimer
|
696 |
Be aware that the model may contain biases or other unintended distortions.
|
|
|
32 |
- sv
|
33 |
- an
|
34 |
- ast
|
35 |
+
- oc
|
36 |
base_model:
|
37 |
- BSC-LT/salamandra-2b
|
38 |
---
|
|
|
42 |
# Salamandra Model Card
|
43 |
|
44 |
|
45 |
+
SalamandraTA-2B is a machine translation model that has been continually pre-trained on Salamandra2B on 70 billion tokens of parallel data in 30 different languages:
|
46 |
+
Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish,
|
47 |
+
Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
|
48 |
+
SalamandraTA-2B is the first model in **SalamandraTA** series and is trained to handle sentence- and paragraph- level machine translation.
|
49 |
|
50 |
- **Developed by:** The Language Technologies Unit from Barcelona Supercomputing Center (BSC).
|
51 |
- **Model type:** A 2B parameter model continually pre-trained on 70 billion tokens.
|
52 |
+
- **Languages:** Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish,
|
53 |
+
- Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
|
54 |
- **License:** Apache License, Version 2.0
|
55 |
|
56 |
|
|
|
58 |
|
59 |
### Description
|
60 |
|
61 |
+
This machine translation model is built upon the foundation of Salamandra 2B. By leveraging the knowledge of the base Salamandra 2B model,
|
62 |
+
this model is able to perform high quality translations between **almost 900 translation directions**.
|
63 |
|
64 |
Key Features:
|
65 |
|
66 |
+
* **Continual Pretraining:** The model is trained on 70 Billion tokens of parallel data. All data employed is open-sourced or generated from open-source
|
67 |
+
* data using the Machine Translation models at [BSC](https://huggingface.co/collections/projecte-aina/mt-models-655e154668c6dd132159081c)
|
68 |
* **Large Language Model Foundation:** Built on Salamandra 2B, providing a strong language understanding and generation capability.
|
69 |
* **Multilingual Support:** Capable of translating between 30 european languages, including low-resource languages.
|
70 |
* **High-Quality Translations:** Delivers accurate and fluent translations, thanks to its continual pretraining and large-scale dataset.
|
|
|
118 |
|
119 |
### Compute Infrastructure
|
120 |
|
121 |
+
All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and
|
122 |
+
operated by Barcelona Supercomputing Center.
|
123 |
|
124 |
The accelerated partition is composed of 1,120 nodes with the following specifications:
|
125 |
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
|
|
|
140 |
|
141 |
You can translate between these languages by using their names directly:
|
142 |
|
143 |
+
Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish,
|
144 |
+
Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
|
145 |
|
146 |
|
147 |
### Inference
|
|
|
232 |
|
233 |
## Evaluation
|
234 |
|
235 |
+
Below are the evaluation results on Flores-200 dev and devtest compared to NLLB-3.3 ([Costa-jussà et al., 2022](https://arxiv.org/abs/2207.04672)) for CA-XX
|
236 |
+
and XX-CA directions. The metrics have been computed excluding Asturian, Aranese, and Aragonese as we report them separately. The evaluation was conducted
|
237 |
+
using [MT Lens](https://github.com/langtech-bsc/mt-evaluation) following the standard setting (beam search with beam size 5, limiting the translation length to 250 tokens). We report the following metrics:
|
238 |
|
239 |
<details>
|
240 |
<summary>Click to show metrics details</summary>
|
|
|
689 |
This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
|
690 |
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
|
691 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
692 |
|
693 |
### Disclaimer
|
694 |
Be aware that the model may contain biases or other unintended distortions.
|