Update README.md
Browse files
README.md
CHANGED
@@ -505,8 +505,8 @@ especially if the content originates from less-regulated sources or user-generat
|
|
505 |
**How was the data collected?**
|
506 |
|
507 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
508 |
-
- Web-sourced datasets with some preprocessing available under permissive license
|
509 |
-
- Domain-specific or language-specific raw crawls, always respecting robots.txt
|
510 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
511 |
(p.e. CATalog).
|
512 |
|
@@ -559,7 +559,7 @@ The original raw data was not kept.
|
|
559 |
|
560 |
**Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
|
561 |
|
562 |
-
Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for
|
563 |
and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
|
564 |
|
565 |
#### Uses
|
|
|
505 |
**How was the data collected?**
|
506 |
|
507 |
This dataset is constituted by combining several sources, whose acquisition methods can be classified into three groups:
|
508 |
+
- Web-sourced datasets with some preprocessing available under permissive license.
|
509 |
+
- Domain-specific or language-specific raw crawls, always respecting robots.txt.
|
510 |
- Manually curated data obtained through collaborators, data providers (by means of legal assignment agreements) or open source projects
|
511 |
(p.e. CATalog).
|
512 |
|
|
|
559 |
|
560 |
**Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.**
|
561 |
|
562 |
+
Yes, the preprocessing and filtering software is open-sourced. The [CURATE](https://github.com/langtech-bsc/CURATE) pipeline was used for CATalog and other curated sources,
|
563 |
and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was used for the OSCAR project.
|
564 |
|
565 |
#### Uses
|