joanllop commited on
Commit
0b01ef3
·
1 Parent(s): b85272d
Files changed (4) hide show
  1. .gitattributes +3 -0
  2. README.md +42 -28
  3. config.json +2 -2
  4. model.safetensors +1 -1
.gitattributes CHANGED
@@ -34,3 +34,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text
 
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  images/salamandra_header.png filter=lfs diff=lfs merge=lfs -text
37
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
38
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
39
+ tokenizer.model filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -146,7 +146,7 @@ The accelerated partition is composed of 1,120 nodes with the following specific
146
  The instruction-following models use the commonly adopted ChatML template:
147
 
148
  ```jinja
149
- {%- if not date_string is defined %}{%- set date_string = "2024-09-30" %}{%- endif %}{%- set system_message = messages[0].content if messages[0].role == "system" else "system message. Today Date: "+ date_string -%}{%- if messages[0].role == "system" -%}{%- set messages = messages[1:] -%}{%- endif -%}{{ "<|im_start|>system\n" + system_message + "<|im_end|>\n" }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
150
  ```
151
  Where `system_message` is used to guide the model during generation and `date_string` can be set to allow the model to respond with the current date.
152
 
@@ -607,31 +607,34 @@ The dataset does not allow for external contributions.
607
 
608
  ### Finetuning Data
609
 
610
- This instruction-tuned variant has been trained with a mixture of 276k English, Spanish, and Catalan multi-turn instructions gathered from open datasets:
611
- | Dataset | ca | en | es |
612
- |-----------------------|:------:|:------:|:------:|
613
- | alpaca-cleaned | - | 50,000 | - |
614
- | aya-dataset | - | 3,944 | 3,854 |
615
- | CoQCat | 4,797 | - | - |
616
- | databricks-dolly-15k | - | 15,011 | - |
617
- | dolly-3k-ca | 3,232 | - | - |
618
- | flores-instr | 1,994 | 1,994 | 3,988 |
619
- | MentorCA | 7,122 | - | - |
620
- | MentorES | - | - | 7,122 |
621
- | no-robots | - | 9,499 | - |
622
- | oasst-ca | 2,518 | - | - |
623
- | oasst2 | 750 | 31,086 | 15,438 |
624
- | open-orca | - | 50,000 | - |
625
- | RagMultilingual | 16,043 | 14,997 | 11,263 |
626
- | tower-blocks | - | 19,895 | 2,000 |
627
- | **Total** | **36,456** | **196,426** | **43,665** |
 
628
 
629
  ---
630
 
 
631
  ## Evaluation
632
 
633
  ### Gold-standard benchmarks
634
-
 
635
  Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/spanish_bench), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/catalan_bench), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/basque_bench) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/galician_bench). These benchmarks include both new and existing tasks and datasets. Given that this is an instructed model, we add LM Evaluation Harness's native feature of `chat-template` to the setup. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
636
 
637
  We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
@@ -866,6 +869,7 @@ All results reported below are on a 0-shot setting.
866
  </tr>
867
  </tbody>
868
  </table>
 
869
 
870
  ### LLM-as-a-judge
871
 
@@ -1087,13 +1091,23 @@ Further details on all tasks and criteria, a full list of results compared to ot
1087
 
1088
  ## Ethical Considerations and Limitations
1089
 
1090
- We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report that moderate accuracies (between 0.5 and 0.6 depending on the social groups) in disambiguated settings, the model performs very poorly in ambiguous setting. Taken together, these results suggest the pervasiveness of social biases that may have an effect on task performance
 
 
 
1091
 
1092
- Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant, but moderate weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects, with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
 
 
 
 
1093
 
1094
- We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.
1095
-
1096
- These results can be expected from a model that has undergone only a preliminary instruction tuning. These tests are performed in order to show the biases the model may contain. We urge developers to take them into account and perform safety testing and tuning tailored to their specific applications of the model.
 
 
 
1097
 
1098
  ---
1099
 
@@ -1120,7 +1134,7 @@ This project has benefited from the contributions of numerous teams and institut
1120
 
1121
  In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
1122
 
1123
- At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
1124
 
1125
  At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.
1126
 
@@ -1136,7 +1150,7 @@ The Barcelona Supercomputing Center, as the owner and creator of the model, shal
1136
 
1137
  ### Citation
1138
 
1139
- Technical report and paper coming soon.
1140
 
1141
  ### License
1142
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
@@ -1146,4 +1160,4 @@ Technical report and paper coming soon.
1146
  |:---:|:---:|:---:|
1147
  |2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
1148
  |7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
1149
- |40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
 
146
  The instruction-following models use the commonly adopted ChatML template:
147
 
148
  ```jinja
149
+ {%- if messages[0]['role'] == 'system' %}{%- set system_message = messages[0]['content'] %}{%- set loop_messages = messages[1:] %}{%- else %}{%- set system_message = 'SYSTEM MESSAGE' %}{%- set loop_messages = messages %}{%- endif %}{%- if not date_string is defined %}{%- set date_string = '2024-09-30' %}{%- endif %}{{ '<|im_start|>system\n' + system_message + '<|im_end|>\n' }}{% for message in loop_messages %}{%- if (message['role'] != 'user') and (message['role'] != 'assistant')%}{{ raise_exception('Only user and assitant roles are suported after the initial optional system message.') }}{% endif %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
150
  ```
151
  Where `system_message` is used to guide the model during generation and `date_string` can be set to allow the model to respond with the current date.
152
 
 
607
 
608
  ### Finetuning Data
609
 
610
+ This instructed-tuned variant has been fine-tuned with a collection of 273k instructions, focusing on the performance of Catalan, English and Spanish. However, instruction data for other closely related Iberian languages has also been included, since it yielded a positive impact on the languages of interest. That said, the performance in these additional languages is not guaranteed due to the limited amount of available data and the lack of resources for thorough testing.
611
+
612
+ | **Dataset** | **ca** | **en** | **es** | **eu** | **gl** | **pt** | **Total** |
613
+ |----------------------|------------|-------------|------------|-----------|---------|------------|-------------|
614
+ | alpaca-cleaned | | 49,950 | | | | | **49,950** |
615
+ | aya-dataset | | 3,941 | 3,851 | 939 | | 8,995 | **17,726** |
616
+ | coqcat | 4,797 | | | | | | **4,797** |
617
+ | databricks-dolly-15k | | 15,011 | | | | | **15,011** |
618
+ | dolly-ca | 3,232 | | | | | | **3,232** |
619
+ | flores-dev | 986 | 1,037 | 1,964 | 493 | 505 | | **4,985** |
620
+ | mentor-ca | 7,119 | | | | | | **7,119** |
621
+ | mentor-es | | | 7,122 | | | | **7,122** |
622
+ | no-robots | | 9,485 | | | | | **9,485** |
623
+ | oasst-ca | 2,517 | | | | | | **2,517** |
624
+ | oasst2 | 750 | 31,086 | 15,438 | 190 | 197 | 1,203 | **48,864** |
625
+ | open-orca | | 49,996 | | | | | **49,996** |
626
+ | rag-multilingual | 16,043 | 14,997 | 11,263 | | | | **42,303** |
627
+ | tower-blocks | | 7,762 | 1,000 | | | 1,000 | **9,762** |
628
+ | **Total** | **35,444** | **183,265** | **40,638** | **1,622** | **702** | **11,198** | **272,869** |
629
 
630
  ---
631
 
632
+
633
  ## Evaluation
634
 
635
  ### Gold-standard benchmarks
636
+ WiP
637
+ <!--
638
  Evaluation is done using the Language Model Evaluation Harness (Gao et al., 2024). We evaluate on a set of tasks taken from [SpanishBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/spanish_bench), [CatalanBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/catalan_bench), [BasqueBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/basque_bench) and [GalicianBench](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/galician_bench). These benchmarks include both new and existing tasks and datasets. Given that this is an instructed model, we add LM Evaluation Harness's native feature of `chat-template` to the setup. In the tables below, we include the results in a selection of evaluation datasets that represent model's performance across a variety of tasks within these benchmarks.
639
 
640
  We only use tasks that are either human generated, human translated, or with a strong human-in-the-loop (i.e., machine translation followed by professional revision or machine generation followed by human revision and annotation). This is the reason behind the variety in number of tasks reported across languages. As more tasks that fulfill these requirements are published, we will update the presented results. We also intend to expand the evaluation to other languages, as long as the datasets meet our quality standards.
 
869
  </tr>
870
  </tbody>
871
  </table>
872
+ -->
873
 
874
  ### LLM-as-a-judge
875
 
 
1091
 
1092
  ## Ethical Considerations and Limitations
1093
 
1094
+ We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases,
1095
+ we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019).
1096
+ We report that while performance is high (accuracies around 0.8 depending on the social category) in disambiguated settings,
1097
+ the model performs very poorly in ambiguous settings, which indicates the presence of societal biases that need to be further addressed in post-training phases.
1098
 
1099
+ Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
1100
+ For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant,
1101
+ but relatively weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
1102
+ We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects,
1103
+ with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
1104
 
1105
+ We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources
1106
+ in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.
1107
+
1108
+ These results can be expected from a model that has undergone only a preliminary instruction tuning.
1109
+ These tests are performed in order to show the biases the model may contain. We urge developers to take
1110
+ them into account and perform safety testing and tuning tailored to their specific applications of the model.
1111
 
1112
  ---
1113
 
 
1134
 
1135
  In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.
1136
 
1137
+ At the national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria.
1138
 
1139
  At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.
1140
 
 
1150
 
1151
  ### Citation
1152
 
1153
+ Technical report coming soon.
1154
 
1155
  ### License
1156
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
1160
  |:---:|:---:|:---:|
1161
  |2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
1162
  |7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
1163
+ |40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "BSC-LT/salamandra-2b-instruct",
3
  "architectures": [
4
  "LlamaForCausalLM"
5
  ],
@@ -7,6 +7,7 @@
7
  "attention_dropout": 0.0,
8
  "bos_token_id": 1,
9
  "eos_token_id": 2,
 
10
  "hidden_act": "silu",
11
  "hidden_size": 2048,
12
  "initializer_range": 0.02,
@@ -17,7 +18,6 @@
17
  "num_attention_heads": 16,
18
  "num_hidden_layers": 24,
19
  "num_key_value_heads": 16,
20
- "num_layers": 24,
21
  "pretraining_tp": 1,
22
  "rms_norm_eps": 1e-05,
23
  "rope_scaling": null,
 
1
  {
2
+ "_name_or_path": "/gpfs/projects/bsc88/text/models/instruction-tuning/models/base_models_with_special_tokens/restart_mix1_all_fineweb_2b_new_data_hf",
3
  "architectures": [
4
  "LlamaForCausalLM"
5
  ],
 
7
  "attention_dropout": 0.0,
8
  "bos_token_id": 1,
9
  "eos_token_id": 2,
10
+ "head_dim": 128,
11
  "hidden_act": "silu",
12
  "hidden_size": 2048,
13
  "initializer_range": 0.02,
 
18
  "num_attention_heads": 16,
19
  "num_hidden_layers": 24,
20
  "num_key_value_heads": 16,
 
21
  "pretraining_tp": 1,
22
  "rms_norm_eps": 1e-05,
23
  "rope_scaling": null,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:6d80ff3fb790ff53500c495369e135545fe55caa1aca593e6c013b0f90d8f154
3
  size 4507005744
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9aabb07d19e1cafe7b9f4bdff98bacc7c9f325a829505a2e376140568c227490
3
  size 4507005744