BSC-LT
/

salamandra-2b-instruct

@@ -34,3 +34,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text
+model.safetensors filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+tokenizer.model filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -146,7 +146,7 @@ The accelerated partition is composed of 1,120 nodes with the following specific
 The instruction-following models use the commonly adopted ChatML template:
 ```jinja
-{%- if not date_string is defined %}{%- set date_string = "2024-09-30" %}{%- endif %}{%- set system_message = messages[0].content if messages[0].role == "system" else "system message. Today Date: "+ date_string -%}{%- if messages[0].role == "system" -%}{%- set messages = messages[1:] -%}{%- endif -%}{{ "<|im_start|>system\n" + system_message + "<|im_end|>\n" }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
 ```
 Where `system_message` is used to guide the model during generation and `date_string` can be set to allow the model to respond with the current date.
@@ -607,31 +607,34 @@ The dataset does not allow for external contributions.
 ### Finetuning Data
-This instruction-tuned variant has been trained with a mixture of 276k English, Spanish, and Catalan multi-turn instructions gathered from open datasets:
-| Dataset               | ca     | en     | es     |
-|-----------------------|:------:|:------:|:------:|
-| alpaca-cleaned        | -      | 50,000 | -      |
-| aya-dataset           | -      | 3,944  | 3,854  |
-| CoQCat                | 4,797  | -      | -      |
-| databricks-dolly-15k  | -      | 15,011 | -      |
-| dolly-3k-ca           | 3,232  | -      | -      |
-| flores-instr          | 1,994  | 1,994  | 3,988  |
-| MentorCA              | 7,122  | -      | -      |
-| MentorES              | -      | -      | 7,122  |
-| no-robots             | -      | 9,499  | -      |
-| oasst-ca              | 2,518  | -      | -      |
-| oasst2                | 750    | 31,086 | 15,438 |
-| open-orca	         	| -	     | 50,000 | -	   |
-| RagMultilingual       | 16,043 | 14,997 | 11,263 |
-| tower-blocks          | -      | 19,895 | 2,000  |
-| **Total** | **36,456** | **196,426** | **43,665** |
 ---
 ## Evaluation
 ### Gold-standard benchmarks
 Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/spanish_bench), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/catalan_bench), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/basque_bench) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/galician_bench). These benchmarks include both new and existing tasks and datasets. Given that this is an instructed model, we add LM Evaluation Harness's native feature of `chat-template` to the setup. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
 We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
@@ -866,6 +869,7 @@ All results reported below are on a 0-shot setting.
   </tr>
 </tbody>
 </table>
 ### LLM-as-a-judge
@@ -1087,13 +1091,23 @@ Further details on all tasks and criteria, a full list of results compared to ot
 ## Ethical Considerations and Limitations
-We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report that moderate  accuracies (between 0.5 and 0.6 depending on the social groups) in disambiguated settings, the model performs very poorly in ambiguous setting. Taken together, these results suggest the pervasiveness of social biases that may have an effect on task performance
-Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant, but moderate weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects, with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
-We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.
-These results can be expected from a model that has undergone only a preliminary instruction tuning. These tests are performed in order to show the biases the model may contain. We urge developers to take them into account and perform safety testing and tuning tailored to their specific applications of the model.
 ---
@@ -1120,7 +1134,7 @@ This project has benefited from the contributions of numerous teams and institut
 In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
-At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
 At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and  Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.
@@ -1136,7 +1150,7 @@ The Barcelona Supercomputing Center, as the owner and creator of the model, shal
 ### Citation
-Technical report and paper coming soon.
 ### License
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
@@ -1146,4 +1160,4 @@ Technical report and paper coming soon.
 |:---:|:---:|:---:|
 |2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
 |7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
-|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |

 The instruction-following models use the commonly adopted ChatML template:
 ```jinja
+{%- if messages[0]['role'] == 'system' %}{%- set system_message = messages[0]['content'] %}{%- set loop_messages = messages[1:] %}{%- else %}{%- set system_message = 'SYSTEM MESSAGE' %}{%- set loop_messages = messages %}{%- endif %}{%- if not date_string is defined %}{%- set date_string = '2024-09-30' %}{%- endif %}{{ '<|im_start|>system\n' + system_message + '<|im_end|>\n' }}{% for message in loop_messages %}{%- if (message['role'] != 'user') and (message['role'] != 'assistant')%}{{ raise_exception('Only user and assitant roles are suported after the initial optional system message.') }}{% endif %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
 ```
 Where `system_message` is used to guide the model during generation and `date_string` can be set to allow the model to respond with the current date.
 ### Finetuning Data
+This instructed-tuned variant has been fine-tuned with a collection of 273k instructions, focusing on the performance of Catalan, English and Spanish. However, instruction data for other closely related Iberian languages has also been included, since it yielded a positive impact on the languages of interest. That said, the performance in these additional languages is not guaranteed due to the limited amount of available data and the lack of resources for thorough testing.
+| **Dataset**          | **ca**     | **en**      | **es**     | **eu**    | **gl**  | **pt**     | **Total**   |
+|----------------------|------------|-------------|------------|-----------|---------|------------|-------------|
+| alpaca-cleaned       |            | 49,950      |            |           |         |            | **49,950**  |
+| aya-dataset          |            | 3,941       | 3,851      | 939       |         | 8,995      | **17,726**  |
+| coqcat               | 4,797      |             |            |           |         |            | **4,797**   |
+| databricks-dolly-15k |            | 15,011      |            |           |         |            | **15,011**  |
+| dolly-ca             | 3,232      |             |            |           |         |            | **3,232**   |
+| flores-dev           | 986        | 1,037       | 1,964      | 493       | 505     |            | **4,985**   |
+| mentor-ca            | 7,119      |             |            |           |         |            | **7,119**   |
+| mentor-es            |            |             | 7,122      |           |         |            | **7,122**   |
+| no-robots            |            | 9,485       |            |           |         |            | **9,485**   |
+| oasst-ca             | 2,517      |             |            |           |         |            | **2,517**   |
+| oasst2               | 750        | 31,086      | 15,438     | 190       | 197     | 1,203      | **48,864**  |
+| open-orca            |            | 49,996      |            |           |         |            | **49,996**  |
+| rag-multilingual     | 16,043     | 14,997      | 11,263     |           |         |            | **42,303**  |
+| tower-blocks         |            | 7,762       | 1,000      |           |         | 1,000      | **9,762**   |
+| **Total**            | **35,444** | **183,265** | **40,638** | **1,622** | **702** | **11,198** | **272,869** |
 ---
 ## Evaluation
 ### Gold-standard benchmarks
+WiP
+<!--
 Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/spanish_bench), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/catalan_bench), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/basque_bench) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/galician_bench). These benchmarks include both new and existing tasks and datasets. Given that this is an instructed model, we add LM Evaluation Harness's native feature of `chat-template` to the setup. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
 We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
   </tr>
 </tbody>
 </table>
+-->
 ### LLM-as-a-judge
 ## Ethical Considerations and Limitations
+We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases,
+we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019).
+We report that while performance is high (accuracies around 0.8 depending on the social category) in disambiguated settings,
+the model performs very poorly in ambiguous settings, which indicates the presence of societal biases that need to be further addressed in post-training phases.
+Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
+For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant,
+but relatively weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
+We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects,
+with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
+ We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources
+ in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.
+ These results can be expected from a  model that has undergone only a preliminary instruction tuning.
+ These tests are performed in order to show the biases the model may contain. We urge developers to take
+ them into account and perform safety testing and tuning tailored to their specific applications of the model.
 ---
 In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
+At the national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
 At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and  Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.
 ### Citation
+Technical report coming soon.
 ### License
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 |:---:|:---:|:---:|
 |2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
 |7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
+|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "BSC-LT/salamandra-2b-instruct",
   "architectures": [
     "LlamaForCausalLM"
   ],
@@ -7,6 +7,7 @@
   "attention_dropout": 0.0,
   "bos_token_id": 1,
   "eos_token_id": 2,
   "hidden_act": "silu",
   "hidden_size": 2048,
   "initializer_range": 0.02,
@@ -17,7 +18,6 @@
   "num_attention_heads": 16,
   "num_hidden_layers": 24,
   "num_key_value_heads": 16,
-  "num_layers": 24,
   "pretraining_tp": 1,
   "rms_norm_eps": 1e-05,
   "rope_scaling": null,

 {
+  "_name_or_path": "/gpfs/projects/bsc88/text/models/instruction-tuning/models/base_models_with_special_tokens/restart_mix1_all_fineweb_2b_new_data_hf",
   "architectures": [
     "LlamaForCausalLM"
   ],
   "attention_dropout": 0.0,
   "bos_token_id": 1,
   "eos_token_id": 2,
+  "head_dim": 128,
   "hidden_act": "silu",
   "hidden_size": 2048,
   "initializer_range": 0.02,
   "num_attention_heads": 16,
   "num_hidden_layers": 24,
   "num_key_value_heads": 16,
   "pretraining_tp": 1,
   "rms_norm_eps": 1e-05,
   "rope_scaling": null,

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6d80ff3fb790ff53500c495369e135545fe55caa1aca593e6c013b0f90d8f154
 size 4507005744

 version https://git-lfs.github.com/spec/v1
+oid sha256:9aabb07d19e1cafe7b9f4bdff98bacc7c9f325a829505a2e376140568c227490
 size 4507005744