jsaizant commited on
Commit
6b61acb
·
verified ·
1 Parent(s): c8df254

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -3
README.md CHANGED
@@ -203,7 +203,7 @@ and the rest of the languages were kept as is, resulting in the following distri
203
  This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
204
  which contributes a significant 66.06% of the total tokens.
205
  Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
206
- The next largest sources are French FR at 3.12% and Proof Pile at 1.98%.
207
  Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
208
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
209
  The remaining 10% comes from smaller sources in various languages.
@@ -217,7 +217,6 @@ Feel free to click the expand button below to see the full list of sources.
217
  |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
218
  | Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
219
  | Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
220
- | Crawl of Bulgarian news websites | bg | [Link](http://old.dcl.bas.bg/dataset/Bulgarian_news.7z) |
221
  | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
222
  | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
223
  | OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
@@ -247,7 +246,7 @@ Feel free to click the expand button below to see the full list of sources.
247
  | proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
248
  | RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
249
  | The Pile (PhilPapers subset) | en | Gao et al., 2021 |
250
- | Biomedical | es | Internally generated scientific dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
251
  | HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
252
  | Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
253
  | Scientific | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
 
203
  This highly multilingual corpus is predominantly composed of data from Colossal OSCAR,
204
  which contributes a significant 66.06% of the total tokens.
205
  Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%.
206
+ The next largest sources are French PD at 3.12% and Proof Pile at 1.98%.
207
  Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%.
208
  These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
209
  The remaining 10% comes from smaller sources in various languages.
 
217
  |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
218
  | Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
219
  | Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
 
220
  | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
221
  | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
222
  | OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
 
246
  | proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
247
  | RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
248
  | The Pile (PhilPapers subset) | en | Gao et al., 2021 |
249
+ | Biomedical | es | Internally generated biomedical dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
250
  | HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
251
  | Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
252
  | Scientific | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |