dtamayo commited on
Commit
228aa6a
·
verified ·
1 Parent(s): 31bef15

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -95,7 +95,7 @@ The pre-training corpus contains text in 35 European languages and code.
95
 
96
  ### Hyperparameters
97
 
98
- The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).
99
 
100
  ### Architecture
101
 
@@ -149,7 +149,7 @@ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/mar
149
  operated by Barcelona Supercomputing Center.
150
 
151
  The accelerated partition is composed of 1,120 nodes with the following specifications:
152
- - 4x Nvidia Hopper GPUs with 64 HBM2 memory
153
  - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
154
  - 4x NDR200 (BW per node 800Gb/s)
155
  - 512 GB of Main memory (DDR5)
@@ -662,7 +662,7 @@ We only use tasks that are either human generated, human translated, or with a s
662
 
663
  During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
664
 
665
- It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the models capabilities and potential. We thus advise caution when reading and interpreting the results.
666
 
667
  A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
668
 
@@ -956,7 +956,7 @@ Score 1: The answer is mathematically correct, with accurate calculations and ap
956
 
957
  #### Multilingual results
958
 
959
- Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.
960
 
961
  Further details on all tasks and criteria, a full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
962
 
@@ -1120,7 +1120,7 @@ the model performs very poorly in ambiguous settings, which indicates the presen
1120
  Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
1121
  For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant,
1122
  but relatively weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
1123
- We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects,
1124
  with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
1125
 
1126
  We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources
 
95
 
96
  ### Hyperparameters
97
 
98
+ The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/blob/main/configs/bsc_7b.yaml).
99
 
100
  ### Architecture
101
 
 
149
  operated by Barcelona Supercomputing Center.
150
 
151
  The accelerated partition is composed of 1,120 nodes with the following specifications:
152
+ - 4x Nvidia Hopper GPUs with 64GB HBM2 memory
153
  - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
154
  - 4x NDR200 (BW per node 800Gb/s)
155
  - 512 GB of Main memory (DDR5)
 
662
 
663
  During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
664
 
665
+ It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the model's capabilities and potential. We thus advise caution when reading and interpreting the results.
666
 
667
  A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
668
 
 
956
 
957
  #### Multilingual results
958
 
959
+ Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 means that the model generates similar responses when comparing the three prompt varieties for a single instance.
960
 
961
  Further details on all tasks and criteria, a full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
962
 
 
1120
  Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
1121
  For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant,
1122
  but relatively weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
1123
+ We measure the effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects,
1124
  with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
1125
 
1126
  We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources