jsaizant commited on
Commit
32e544d
·
verified ·
1 Parent(s): 8a4eb20

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -50
README.md CHANGED
@@ -463,28 +463,26 @@ We provide an extense Datasheet section following the best practices defined by
463
 
464
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
465
 
466
- The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of
467
- European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
468
- languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
469
 
470
- We detected that there is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of
471
- our efforts in the creation of this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR
472
- (Brack et al., 2024), which includes 151 languages and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in
473
- Catalan in the world.
474
 
475
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
476
 
477
- The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de
478
- Supercomputación (BSC-CNS), which aims to advance the field of natural language processing through cutting-edge research and development
479
- and the use of HPC. In particular, it was created by the unit's data team, the main contributors being Javier Saiz, Ferran Espuña, and
480
- Jorge Palomar.
481
 
482
- However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners,
483
- and public institutions, which can be found in detail in the acknowledgements.
484
 
485
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
486
 
487
- This work/research has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
488
 
489
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
490
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
@@ -525,14 +523,14 @@ sources were sampled in proportion to their occurrence.
525
 
526
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
527
 
528
- Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some
529
- documents required optical character recognition (OCR) to extract text from non-text formats such as PDFs.
530
 
531
  **Is there a label or target associated with each instance? If so, please provide a description.**
532
 
533
- Each instance is labeled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional
534
- labels were automatically assigned to detect specific types of content harmful or toxic content and to assign preliminary indicators of
535
- undesired qualities —very short documents, high density of symbols, etc.— which were used for filtering instances.
536
 
537
  **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
538
 
@@ -544,12 +542,12 @@ Instances are related through shared metadata, such as source and language ident
544
 
545
  **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
546
 
547
- The dataset is split randomly into training, validation, and test sets.
548
 
549
  **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
550
 
551
- Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in
552
- web-sourced instances where SEO techniques and templates contribute to repeated textual patterns. Some instances may also be duplicated
553
  across sources due to format variations.
554
 
555
  **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
@@ -573,10 +571,10 @@ The dataset does not explicitly identify any subpopulations.
573
 
574
  **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
575
 
576
- Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as
577
- names, IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the
578
- combination of multiple data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are
579
- made to filter or anonymize sensitive data during pre-processing, but some identifiable information may remain in the dataset.
580
 
581
  **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
582
 
@@ -590,28 +588,27 @@ especially if the content originates from less-regulated sources or user-generat
590
 
591
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
592
  - Web-sourced datasets with some preprocessing available under permissive license.
593
- - Domain-specific or language-specific raw crawls, always respecting robots.txt.
594
- - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
595
- (p.e. CATalog).
596
 
597
  **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
598
 
599
- According to the three groups previously defined, these are the mechanisms used in each of them:
600
- - Open direct download. Validation: data integrity tests.
601
- - Ad-hoc scrapers or crawlers. Validation: software unit and data integrity tests.
602
- - Direct download via FTP, SFTP, API or S3. Validation: data integrity tests.
603
 
604
  **If the dataset is a sample from a larger set, what was the sampling strategy?**
605
 
606
- The sampling strategy was to use the whole dataset resulting from the filtering explained in the preprocessing/cleaning/labelling section,
607
- with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official
608
- languages of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a
609
- code document, evenly distributed among all programming languages).
610
 
611
  **Who was involved in the data collection process and how were they compensated?**
612
 
613
- This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed
614
- entirely by members of the LangTech data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
615
  consideration for acquiring data from suppliers.
616
 
617
  **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
@@ -630,12 +627,9 @@ ethical and legal point of view, respectively.
630
 
631
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
632
 
633
- Instances of text documents were not altered, but web-sourced documents were filtered based on specific criteria along two dimensions:
634
- - Quality: documents with a score lower than 0.8, based on undesired qualities, such as documents with low number of lines, very short
635
- sentences, presence of long footers and headers, and high percentage of punctuation, obtained through CURATE (Palomar-Giner et al., 2024)
636
- were filtered out.
637
- - Harmful or adult content: documents originating from Colossal OSCAR were filtered using LLM-Datasets (Ostendorff et al., 2024) based on
638
- the perplexity from a language model (‘harmful_pp’ field) provided by the Ungoliant pipeline (Abadji et al., 2021).
639
 
640
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
641
 
@@ -694,11 +688,10 @@ The dataset will not be updated.
694
 
695
  **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
696
 
697
- The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly
698
- available in web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data
699
- retention on an individual basis. However, efforts are made to mitigate the risks associated with sensitive information through
700
- pre-processing and filtering to remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential
701
- privacy and ethical issues.
702
 
703
  **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**
704
 
 
463
 
464
  **For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
465
 
466
+ The purpose of creating this dataset is to pre-train the Salamandra family of multilingual models with high performance in a large number of European languages (35)
467
+ and programming languages (92). We also want to represent the co-official languages of Spain: Spanish, Catalan, Galician and Basque. For this reason, we oversample
468
+ these languages by a factor of 2.
469
 
470
+ There is a great lack of massive multilingual data, especially in minority languages (Ostendorff & Rehm, 2023), so part of our efforts in the creation of
471
+ this pre-training dataset have resulted in the contribution to large projects such as the Community OSCAR (Brack et al., 2024), which includes 151 languages
472
+ and 40T words, or CATalog (Palomar-Giner et al., 2024), the largest open dataset in Catalan in the world.
 
473
 
474
  **Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?**
475
 
476
+ The dataset has been created by the Language Technologies unit (LangTech) of the Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS),
477
+ which aims to advance the field of natural language processing through cutting-edge research and development and the use of HPC. In particular, it was created by
478
+ the unit's data team, the main contributors being José Javier Saiz, Ferran Espuña and Jorge Palomar.
 
479
 
480
+ However, the creation of the dataset would not have been possible without the collaboration of a large number of collaborators, partners and public institutions,
481
+ which can be found in detail in the acknowledgements.
482
 
483
  **Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.**
484
 
485
+ This work has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/).
486
 
487
  This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU
488
  within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.
 
523
 
524
  **What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.**
525
 
526
+ Each instance consists of a text document processed for deduplication, language identification, and source-specific filtering. Some documents required
527
+ optical character recognition (OCR) to extract text from non-text formats such as PDFs.
528
 
529
  **Is there a label or target associated with each instance? If so, please provide a description.**
530
 
531
+ Each instance is labelled with a unique identifier, the primary language of the content, and the URL for web-sourced instances. Additional labels were
532
+ automatically assigned to detect specific types of content -harmful or toxic content- and to assign preliminary indicators of undesired qualities -very
533
+ short documents, high density of symbols, etc.- which were used for filtering instances.
534
 
535
  **Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.**
536
 
 
542
 
543
  **Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.**
544
 
545
+ The dataset is randomly divided into training, validation and test sets, where the validation and test sets are each 1% of the total corpus.
546
 
547
  **Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.**
548
 
549
+ Despite removing duplicated instances within each source, redundancy remains at the paragraph and sentence levels, particularly in web-sourced
550
+ instances where search engine optimization techniques and templates contribute to repeated textual patterns. Some instances may be also duplicated
551
  across sources due to format variations.
552
 
553
  **Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.**
 
571
 
572
  **Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.**
573
 
574
+ Web-sourced instances in the dataset may contain personally identifiable information (PII) that is publicly available on the Web, such as names,
575
+ IP addresses, email addresses, and phone numbers. While it would be possible to indirectly identify individuals through the combination of multiple
576
+ data points, the nature and scale of web data makes it difficult to parse such information. In any case, efforts are made to filter or anonymize
577
+ sensitive data (Mina et al., 2024), but some identifiable information may remain in the dataset.
578
 
579
  **Does the dataset contain data that might be considered sensitive in any way? If so, please provide a description.**
580
 
 
588
 
589
  This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
590
  - Web-sourced datasets with some preprocessing available under permissive license.
591
+ - Domain-specific or language-specific raw crawls.
592
+ - Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects (e.g. CATalog).
 
593
 
594
  **What mechanisms or procedures were used to collect the data? How were these mechanisms or procedures validated?**
595
 
596
+ The data collection process was carried out using three different mechanisms, each corresponding to one of the groups defined in the previous answer. The specific methods used and their respective validation procedures are outlined below:
597
+ - Open Direct Download: Data were obtained directly from publicly accessible sources, such as websites or repositories that provide open data downloads. We validate the data with a data integrity check, which ensures that the downloaded files are complete, uncorrupted and in the expected format and structure.
598
+ - Ad hoc scrapers or crawlers: Custom web scraping scripts or crawlers were used to extract data from various online sources where direct downloads were not available. These scripts navigate web pages, extract relevant data and store it in a structured format. We validate this method with software unit tests to evaluate the functionality of individual components of the scraping programs, checking for errors or unexpected behaviour. In addition, data integrity tests were performed to verify that the collected data remained complete throughout the extraction and storage process.
599
+ - Direct download via FTP, SFTP, API or S3: Some datasets were acquired using secure transfer protocols such as FTP (File Transfer Protocol), SFTP (Secure File Transfer Protocol), or API (Application Programming Interface) requests from cloud storage services such as Amazon S3. As with the open direct download method, data integrity tests were used to validate the completeness of the files to ensure that the files were not altered or corrupted during the transfer process.
600
 
601
  **If the dataset is a sample from a larger set, what was the sampling strategy?**
602
 
603
+ The sampling strategy was to use the whole dataset resulting from the filtering explained in the 'preprocessing/cleaning/labelling' section,
604
+ with the particularity that an upsampling of 2 (i.e. twice the probability of sampling a document) was performed for the co-official languages
605
+ of Spain (Spanish, Catalan, Galician, Basque), and a downsampling of 1/2 was applied for code (half the probability of sampling a code document,
606
+ evenly distributed among all programming languages).
607
 
608
  **Who was involved in the data collection process and how were they compensated?**
609
 
610
+ This data is generally extracted, filtered and sampled by automated processes. The code required to run these processes has been developed entirely
611
+ by members of the Language Technologies data team, or otherwise obtained from open-source software. Furthermore, there has been no monetary
612
  consideration for acquiring data from suppliers.
613
 
614
  **Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances? If not, please describe the timeframe in which the data associated with the instances was created.**
 
627
 
628
  **Was any preprocessing/cleaning/labeling of the data done? If so, please provide a description. If not, you may skip the remaining questions in this section.**
629
 
630
+ No changes were made to the content of individual text document instances. However, the web-sourced documents underwent a filtering process based on specific criteria along two key dimensions:
631
+ - Quality filtering: The text processing pipeline CURATE (Palomar et. al, 2024) calculates a quality score for each document based on a set of filtering criteria that identify undesirable textual characteristics. Any document with a score below the 0.8 threshold was excluded from the dataset.
632
+ - Harmful or adult content filtering: To reduce the amount of harmful or inappropriate material in the dataset, documents from Colossal OSCAR were filtered using the Ungoliant pipeline (Abadji et al., 2021), which uses the 'harmful\_pp' field, a perplexity-based score generated by a language model.
 
 
 
633
 
634
  **Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data? If so, please provide a link or other access point to the “raw” data.**
635
 
 
688
 
689
  **If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances? If so, please describe these limits and explain how they will be enforced.**
690
 
691
+ The dataset does not keep sensitive data that could allow direct identification of individuals, apart from the data that is publicly available in
692
+ web-sourced content. Due to the sheer volume and diversity of web data, it is not feasible to notify individuals or manage data retention on an
693
+ individual basis. However, efforts are made to mitigate the risks associated with sensitive information through pre-processing and filtering to
694
+ remove identifiable or harmful content. Despite these measures, vigilance is maintained to address potential privacy and ethical issues.
 
695
 
696
  **Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.**
697