fdelucaf commited on
Commit
aa207e8
·
verified ·
1 Parent(s): a83497c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -20
README.md CHANGED
@@ -32,7 +32,7 @@ language:
32
  - sv
33
  - an
34
  - ast
35
- - arn
36
  base_model:
37
  - BSC-LT/salamandra-2b
38
  ---
@@ -42,11 +42,15 @@ base_model:
42
  # Salamandra Model Card
43
 
44
 
45
- SalamandraTA-2B is a machine translation model that has been continually pre-trained on Salamandra2B on 70 billion tokens of parallel data in 30 different languages: Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian. SalamandraTA-2B is the first model in **SalamandraTA** series and is trained to handle sentence- and paragraph- level machine translation.
 
 
 
46
 
47
  - **Developed by:** The Language Technologies Unit from Barcelona Supercomputing Center (BSC).
48
  - **Model type:** A 2B parameter model continually pre-trained on 70 billion tokens.
49
- - **Languages:** Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
 
50
  - **License:** Apache License, Version 2.0
51
 
52
 
@@ -54,11 +58,13 @@ SalamandraTA-2B is a machine translation model that has been continually pre-tra
54
 
55
  ### Description
56
 
57
- This machine translation model is built upon the foundation of Salamandra 2B. By leveraging the knowledge of the base Salamandra 2B model, this model is able to perform high quality translations between **almost 900 translation directions**.
 
58
 
59
  Key Features:
60
 
61
- * **Continual Pretraining:** The model is trained on 70 Billion tokens of parallel data. All data employed is open-sourced or generated from open-source data using the Machine Translation models at [BSC](https://huggingface.co/collections/projecte-aina/mt-models-655e154668c6dd132159081c)
 
62
  * **Large Language Model Foundation:** Built on Salamandra 2B, providing a strong language understanding and generation capability.
63
  * **Multilingual Support:** Capable of translating between 30 european languages, including low-resource languages.
64
  * **High-Quality Translations:** Delivers accurate and fluent translations, thanks to its continual pretraining and large-scale dataset.
@@ -112,7 +118,8 @@ Continual pre-training was conducted using [LLaMA-Factory framework](https://git
112
 
113
  ### Compute Infrastructure
114
 
115
- All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.
 
116
 
117
  The accelerated partition is composed of 1,120 nodes with the following specifications:
118
  - 4x Nvidia Hopper GPUs with 64 HBM2 memory
@@ -133,7 +140,8 @@ To translate with the salamandraTA-2B model, first you need to create a prompt t
133
 
134
  You can translate between these languages by using their names directly:
135
 
136
- Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
 
137
 
138
 
139
  ### Inference
@@ -224,7 +232,9 @@ print("Generated Translations:", results_detokenized)
224
 
225
  ## Evaluation
226
 
227
- Below are the evaluation results on Flores-200 dev and devtest compared to NLLB-3.3 ([Costa-jussà et al., 2022](https://arxiv.org/abs/2207.04672)) for CA-XX and XX-CA directions. The metrics have been computed excluding Asturian, Aranese, and Aragonese as we report them separately. The evaluation was conducted using [MT Lens](https://github.com/langtech-bsc/mt-evaluation) following the standard setting (beam search with beam size 5, limiting the translation length to 250 tokens). We report the following metrics:
 
 
228
 
229
  <details>
230
  <summary>Click to show metrics details</summary>
@@ -679,18 +689,6 @@ This work has been promoted and financed by the Government of Catalonia through
679
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
680
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
681
 
682
- ### Acknowledgements
683
-
684
-
685
- This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support.
686
-
687
- In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
688
-
689
- At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
690
-
691
- At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.
692
-
693
- Their valuable efforts have been instrumental in the development of this work.
694
 
695
  ### Disclaimer
696
  Be aware that the model may contain biases or other unintended distortions.
 
32
  - sv
33
  - an
34
  - ast
35
+ - oc
36
  base_model:
37
  - BSC-LT/salamandra-2b
38
  ---
 
42
  # Salamandra Model Card
43
 
44
 
45
+ SalamandraTA-2B is a machine translation model that has been continually pre-trained on Salamandra2B on 70 billion tokens of parallel data in 30 different languages:
46
+ Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish,
47
+ Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
48
+ SalamandraTA-2B is the first model in **SalamandraTA** series and is trained to handle sentence- and paragraph- level machine translation.
49
 
50
  - **Developed by:** The Language Technologies Unit from Barcelona Supercomputing Center (BSC).
51
  - **Model type:** A 2B parameter model continually pre-trained on 70 billion tokens.
52
+ - **Languages:** Catalan, Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish,
53
+ - Greek, Finnish, Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
54
  - **License:** Apache License, Version 2.0
55
 
56
 
 
58
 
59
  ### Description
60
 
61
+ This machine translation model is built upon the foundation of Salamandra 2B. By leveraging the knowledge of the base Salamandra 2B model,
62
+ this model is able to perform high quality translations between **almost 900 translation directions**.
63
 
64
  Key Features:
65
 
66
+ * **Continual Pretraining:** The model is trained on 70 Billion tokens of parallel data. All data employed is open-sourced or generated from open-source
67
+ * data using the Machine Translation models at [BSC](https://huggingface.co/collections/projecte-aina/mt-models-655e154668c6dd132159081c)
68
  * **Large Language Model Foundation:** Built on Salamandra 2B, providing a strong language understanding and generation capability.
69
  * **Multilingual Support:** Capable of translating between 30 european languages, including low-resource languages.
70
  * **High-Quality Translations:** Delivers accurate and fluent translations, thanks to its continual pretraining and large-scale dataset.
 
118
 
119
  ### Compute Infrastructure
120
 
121
+ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and
122
+ operated by Barcelona Supercomputing Center.
123
 
124
  The accelerated partition is composed of 1,120 nodes with the following specifications:
125
  - 4x Nvidia Hopper GPUs with 64 HBM2 memory
 
140
 
141
  You can translate between these languages by using their names directly:
142
 
143
+ Italian, Portuguese, German, English, Spanish, Euskera, Galician, French, Bulgarian, Czech, Lithuanian, Croatian, Dutch, Romanian, Danish, Greek, Finnish,
144
+ Hungarian, Slovak, Slovenian, Estonian, Polish, Latvian, Swedish, Maltese, Irish, Aranese, Aragonese, Asturian.
145
 
146
 
147
  ### Inference
 
232
 
233
  ## Evaluation
234
 
235
+ Below are the evaluation results on Flores-200 dev and devtest compared to NLLB-3.3 ([Costa-jussà et al., 2022](https://arxiv.org/abs/2207.04672)) for CA-XX
236
+ and XX-CA directions. The metrics have been computed excluding Asturian, Aranese, and Aragonese as we report them separately. The evaluation was conducted
237
+ using [MT Lens](https://github.com/langtech-bsc/mt-evaluation) following the standard setting (beam search with beam size 5, limiting the translation length to 250 tokens). We report the following metrics:
238
 
239
  <details>
240
  <summary>Click to show metrics details</summary>
 
689
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
690
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
691
 
 
 
 
 
 
 
 
 
 
 
 
 
692
 
693
  ### Disclaimer
694
  Be aware that the model may contain biases or other unintended distortions.