BSC-LT
/

ALIA-40b

@@ -463,28 +463,26 @@ We provide an extense Datasheet section following the best practices defined by
 **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
-The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of
-European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
-languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
-We detected that there is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of
-our efforts in the creation of this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR
-(Brack et al., 2024), which includes 151 languages and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in
-Catalan in the world.
 **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
-The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
-Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
-and the use of HPC. In particular, it was created by the unit's data team, the main contributors being Javier Saiz, Ferran Espuña, and
-Jorge Palomar.
-However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
-and public institutions, which can be found in detail in the acknowledgements.
 **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
-This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
 This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
 within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
@@ -525,14 +523,14 @@ sources were sampled in proportion to their occurrence.
 **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
-Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some
-documents required optical character recognition (OCR) to extract text from non-text formats such as PDFs.
 **Is there a label or target associated with each instance? If so, please provide a description.**
-Each instance is labeled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional
-labels were automatically assigned to detect specific types of content —harmful or toxic content— and to assign preliminary indicators of
-undesired qualities —very short documents, high density of symbols, etc.— which were used for filtering instances.
 **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
@@ -544,12 +542,12 @@ Instances are related through shared metadata, such as source and language ident
 **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
-The dataset is split randomly into training, validation, and test sets.
 **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
-Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in
-web-sourced instances where SEO techniques and templates contribute to repeated textual patterns. Some instances may also be duplicated
 across sources due to format variations.
 **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
@@ -573,10 +571,10 @@ The dataset does not explicitly identify any subpopulations.
 **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
-Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as
-names, IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the
-combination of multiple data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are
-made to filter or anonymize sensitive data during pre-processing, but some identifiable information may remain in the dataset.
 **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
@@ -590,28 +588,27 @@ especially if the content originates from less-regulated sources or user-generat
 This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
 - Web-sourced datasets with some preprocessing available under permissive license.
-- Domain-specific or language-specific raw crawls, always respecting robots.txt.
-- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
-  (p.e. CATalog).
 **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
-According to the three groups previously defined, these are the mechanisms used in each of them:
-- Open direct download. Validation: data integrity tests.
-- Ad-hoc scrapers or crawlers. Validation: software unit and data integrity tests.
-- Direct download via FTP, SFTP, API or S3. Validation: data integrity tests.
 **If the dataset is a sample from a larger set, what was the sampling strategy?**
-The sampling strategy was to use the whole dataset resulting from the filtering explained in the ‘preprocessing/cleaning/labelling’ section,
-with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official
-languages of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a
-code document, evenly distributed among all programming languages).
 **Who was involved in the data collection process and how were they compensated?**
-This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed
-entirely by members of the LangTech data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
 consideration for acquiring data from suppliers.
 **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
@@ -630,12 +627,9 @@ ethical and legal point of view, respectively.
 **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
-Instances of text documents were not altered, but web-sourced documents were filtered based on specific criteria along two dimensions:
-- Quality: documents with a score lower than 0.8, based on undesired qualities, such as documents with low number of lines, very short
-  sentences, presence of long footers and headers, and high percentage of punctuation, obtained through CURATE (Palomar-Giner et al., 2024)
-  were filtered out.
-- Harmful or adult content: documents originating from Colossal OSCAR were filtered using LLM-Datasets (Ostendorff et al., 2024) based on
-  the perplexity from a language model (‘harmful_pp’ field) provided by the Ungoliant pipeline (Abadji et al., 2021).
 **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
@@ -694,11 +688,10 @@ The dataset will not be updated.
 **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
-The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly
-available in web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data
-retention on an individual basis. However, efforts are made to mitigate the risks associated with sensitive information through
-pre-processing and filtering to remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential
-privacy and ethical issues.
 **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**

 **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
+The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of European languages (35)
+and programming languages (92). We also want to represent the co-official languages of Spain: Spanish, Catalan, Galician and Basque. For this reason, we oversample
+these languages by a factor of 2.
+There is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of our efforts in the creation of
+this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR (Brack et al., 2024), which includes 151 languages
+and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in Catalan in the world.
 **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
+The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS),
+which aims to advance the field of natural language processing through cutting-edge research and development and the use of HPC. In particular, it was created by
+the unit's data team, the main contributors being José Javier Saiz, Ferran Espuña and Jorge Palomar.
+However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners and public institutions,
+which can be found in detail in the acknowledgements.
 **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
+This work has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
 This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
 within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
 **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
+Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some documents required
+optical character recognition (OCR) to extract text from non-text formats such as PDFs.
 **Is there a label or target associated with each instance? If so, please provide a description.**
+Each instance is labelled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional labels were
+automatically assigned to detect specific types of content -harmful or toxic content- and to assign preliminary indicators of undesired qualities -very
+short documents, high density of symbols, etc.- which were used for filtering instances.
 **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
 **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
+The dataset is randomly divided into training, validation and test sets, where the validation and test sets are each 1% of the total corpus.
 **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
+Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in web-sourced
+instances where search engine optimization techniques and templates contribute to repeated textual patterns. Some instances may be also duplicated
 across sources due to format variations.
 **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
 **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
+Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as names,
+IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the combination of multiple
+data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are made to filter or anonymize
+sensitive data (Mina et al., 2024), but some identifiable information may remain in the dataset.
 **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
 This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
 - Web-sourced datasets with some preprocessing available under permissive license.
+- Domain-specific or language-specific raw crawls.
+- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects (e.g. CATalog).
 **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
+The data collection process was carried out using three different mechanisms, each corresponding to one of the groups defined in the previous answer. The specific methods used and their respective validation procedures are outlined below:
+- Open Direct Download: Data were obtained directly from publicly accessible sources, such as websites or repositories that provide open data downloads. We validate the data with a data integrity check, which ensures that the downloaded files are complete, uncorrupted and in the expected format and structure.
+- Ad hoc scrapers or crawlers: Custom web scraping scripts or crawlers were used to extract data from various online sources where direct downloads were not available. These scripts navigate web pages, extract relevant data and store it in a structured format. We validate this method with software unit tests to evaluate the functionality of individual components of the scraping programs, checking for errors or unexpected behaviour. In addition, data integrity tests were performed to verify that the collected data remained complete throughout the extraction and storage process.
+- Direct download via FTP, SFTP, API or S3: Some datasets were acquired using secure transfer protocols such as FTP (File Transfer Protocol), SFTP (Secure File Transfer Protocol), or API (Application Programming Interface) requests from cloud storage services such as Amazon S3. As with the open direct download method, data integrity tests were used to validate the completeness of the files to ensure that the files were not altered or corrupted during the transfer process.
 **If the dataset is a sample from a larger set, what was the sampling strategy?**
+The sampling strategy was to use the whole dataset resulting from the filtering explained in the 'preprocessing/cleaning/labelling' section,
+with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official languages
+of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a code document,
+evenly distributed among all programming languages).
 **Who was involved in the data collection process and how were they compensated?**
+This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed entirely
+by members of the Language Technologies data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
 consideration for acquiring data from suppliers.
 **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
 **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
+No changes were made to the content of individual text document instances. However, the web-sourced documents underwent a filtering process based on specific criteria along two key dimensions:
+- Quality filtering: The text processing pipeline CURATE (Palomar et. al, 2024) calculates a quality score for each document based on a set of filtering criteria that identify undesirable textual characteristics. Any document with a score below the 0.8 threshold was excluded from the dataset.
+- Harmful or adult content filtering: To reduce the amount of harmful or inappropriate material in the dataset, documents from Colossal OSCAR were filtered using the Ungoliant pipeline (Abadji et al., 2021), which uses the 'harmful\_pp' field, a perplexity-based score generated by a language model.
 **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
 **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
+The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly available in
+web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data retention on an
+individual basis. However, efforts are made to mitigate the risks associated with sensitive information through pre-processing and filtering to
+remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential privacy and ethical issues.
 **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**