jsaizant commited on
Commit
e383f52
·
verified ·
1 Parent(s): 42e3c75

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -69
README.md CHANGED
@@ -213,75 +213,77 @@ Feel free to click the expand button below to see the full list of sources.
213
  <details>
214
  <summary>Data Sources</summary>
215
 
216
- | Dataset | Language | Source |
217
- |-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------|
218
- | Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
219
- | Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
220
- | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
221
- | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
222
- | OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
223
- | MaCoCu web corpus | bg, ca, el, hr, mt, sl, sr, uk | Bañón et al., 2022 |
224
- | EurLEX-Resources | bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelniklaus/eurlex_resources) |
225
- | MC4-Legal | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4) |
226
- | CURLICAT Corpus | bg, hr, hu, pl, ro, sk, sl | Váradi et al., 2022 |
227
- | CATalog | ca | Palomar-Giner et al., 2024 |
228
- | Spanish Crawling | ca, es, eu, gl | Relevant Spanish websites crawling |
229
- | Starcoder | code | Li et al., 2023 |
230
- | SYN v9: large corpus of written Czech | cs | Křen et al., 2021 |
231
- | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
232
- | DaNewsroom | da | Varab & Schluter, 2020 |
233
- | Danish GigaWord | da | Strømberg-Derczynski et al., 2021 |
234
- | DK-CLARIN Reference Corpus of General Danish | da | [Link](https://korpus.dsl.dk/clarin/) |
235
- | The Danish Parliament Corpus 2009 - 2017, v1 | da | Hansen, 2018 |
236
- | DeWaC | de | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:dewac) |
237
- | Open Legal Data - German court decisions and laws | de | Ostendorff et al., 2020 |
238
- | Greek Legal Code | el | Papaloukas et al., 2021 |
239
- | Greek Web Corpus | el | Outsios et al., 2018 |
240
- | Auxiliary Mathematics Problems and Solutions (AMPS) dataset | en | Hendrycks et al., 2021 |
241
- | BIGPATENT | en | Sharma et al., 2019 |
242
- | FineWeb-Edu (350BT subset) | en | Penedo et al., 2024 |
243
- | peS2o | en | Soldaini & Lo, 2023 |
244
- | PG-19 | en | Rae et al., 2019 |
245
- | Pile of Law (selected subsets) | en | Henderson* et al., 2022 |
246
- | proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
247
- | RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
248
- | The Pile (PhilPapers subset) | en | Gao et al., 2021 |
249
- | Biomedical | es | Internally generated biomedical dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
250
- | HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
251
- | Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
252
- | Scientific | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
253
- | Spanish Legal Domain Corpora | es | Gutiérrez-Fandiño et al., 2021 |
254
- | Estonian National Corpus 2021 | et | Koppel & Kallas, 2022 |
255
- | Estonian Reference Corpus | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
256
- | EusCrawl (w/o Wikipedia or NC-licenses) | eu | Artetxe et al., 2022 |
257
- | Latxa Corpus v1.1 | eu | Etxaniz et al., 2024 [Link](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1) |
258
- | Aya Dataset (w/o Evaluation Suite) | eu, hr, nl, fi, ka, hu, lt, nn, ro, sk, lv, cy, bg, cs, en, fr, de, ga, mt, pl, ru, sl, sv, ca, da, et, gl, el, it, no, pt, sr, es, uk | Singh et al., 2024 |
259
- | Yle Finnish News Archive | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
260
- | CaBeRnet: a New French Balanced Reference Corpus | fr | Popa-Fabre et al., 2020 |
261
- | French Public Domain Books | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Books) |
262
- | French Public Domain Newspapers | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
263
- | Irish Universal Dependencies | ga | [Link](https://universaldependencies.org/ga/index.html) |
264
- | The Gaois bilingual corpus of English-Irish legislation (Irish legislation) | ga | [Link](https://portulanclarin.net/repository/browse/the-gaois-bilingual-corpus-of-english-irish-legislation-processed/daeac17c9e3511ea9b7f02420a000407b83de243dc0b469aab41084386c5b80f/) |
265
- | CorpusNÓS | gl | de-Dios-Flores et al., 2024 |
266
- | Croatian web corpus hrWaC 2.1 | hr | Ljubešić & Klubička, 2014 |
267
- | ITWaC | it | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:itwac) |
268
- | Corpus of State-related content from the Latvian Web (Processed) | lv | [Link](https://catalog.elra.info/en-us/repository/browse/ELRA-W0169/) |
269
- | Korpus Malti | mt | Micallef et al., 2022 |
270
- | SoNaR Corpus NC 1.2 | nl | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/) |
271
- | Norwegian Colossal Corpus | nn, no | Kummervold et al., 2021 |
272
- | Occitan Corpus | oc | Provided by [IEA](https://www.institutestudisaranesi.cat/) |
273
- | NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish) | pl | Lewandowska-Tomaszczyk et al., 2013 |
274
- | Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego | pl | Ogrodniczuk, 2018 |
275
- | Brazilian Portuguese Web as Corpus | pt | Wagner Filho et al., 2018 |
276
- | ParlamentoPT | pt | Rodrigues et al., 2023 |
277
- | MARCELL Romanian legislative subcorpus v2 | ro | [Link](https://elrc-share.eu/reposMARCELL%20Romanian%20legislative%20subcorpus%20v2itory/browse/marcell-romanian-legislative-subcorpus-v2/2da548428b9d11eb9c1a00155d026706ce94a6b59ffc4b0e9fb5cd9cebe6889e/) |
278
- | Korpus slovenských právnych predpisov v1.9 | sk | [Link](https://www.juls.savba.sk/data/marcell/legal-sk-20220322-1.9.ver.xz) |
279
- | od-justice 2.0 | sk | [Link](https://www.juls.savba.sk/data/od-justice/od-justice-2.0.ver.xz) |
280
- | Corpus of academic Slovene KAS 2.0 | sl | Žagar et al., 2022 |
281
- | slWaC web corpus | sl | Erjavec et al., 2015 |
282
- | SrpKorSubset (news, legal, academic, conversation, literary) | sr | [Link](http://www.korpus.matf.bg.ac.rs/) |
283
- | The Swedish Culturomics Gigaword Corpus | sv | Rødven-Eide, 2016 |
284
- | Corpus of laws and legal acts of Ukraine | uk | [Link](https://lang.org.ua/en/corpora/#anchor7) |
 
 
285
 
286
  To consult the data summary document with the respective licences, please send an e-mail to ipr@bsc.es.
287
 
 
213
  <details>
214
  <summary>Data Sources</summary>
215
 
216
+
217
+ | Dataset | Language | Source |
218
+ |---|---|---|
219
+ | Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
220
+ | Aya Dataset (w/o Evaluation Suite) | eu, hr, nl, fi, ka, hu, lt, nn, ro, sk, lv, cy, bg, cs, en, fr, de, ga, mt, pl, ru, sl, sv, ca, da, et, gl, el, it, no, pt, sr, es, uk | Singh et al., 2024 |
221
+ | Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
222
+ | OpenSubtitles v2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
223
+ | EurLEX-Resources | bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelniklaus/eurlex_resources) |
224
+ | MC4-Legal | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4) |
225
+ | Parlamint | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
226
+ | MaCoCu | bg, ca, el, hr, mt, sl, sr, uk | Bañón et al., 2022 |
227
+ | CURLICAT | bg, hr, hu, pl, ro, sk, sl | Váradi et al., 2022 |
228
+ | Norwegian Colossal Corpus (NCC) | nn, no | Kummervold et al., 2021 |
229
+ | Academic Slovene KAS 2.0 | sl | Žagar et al., 2022 |
230
+ | BIGPATENT | en | Sharma et al., 2019 |
231
+ | Biomedical-ES | es | Internally generated biomedical dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
232
+ | Brazilian Portuguese Web as Corpus (BrWaC) | pt | Wagner Filho et al., 2018 |
233
+ | Bulgarian National Corpus (BulNC) | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
234
+ | CaBeRnet | fr | Popa-Fabre et al., 2020 |
235
+ | CATalog 1.0 | ca | Palomar-Giner et al., 2024 |
236
+ | CorpusNÓS | gl | de-Dios-Flores et al., 2024 |
237
+ | Croatian Web as Corpus 2.1 (hrWaC) | hr | Ljubešić & Klubička, 2014 |
238
+ | DaNewsroom | da | Varab & Schluter, 2020 |
239
+ | Danish GigaWord | da | Strømberg-Derczynski et al., 2021 |
240
+ | DK-CLARIN Reference Corpus of General Danish | da | [Link](https://korpus.dsl.dk/clarin/) |
241
+ | Estonian National Corpus 2021 (ENC) | et | Koppel & Kallas, 2022 |
242
+ | Estonian Reference Corpus (ERC) | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
243
+ | EusCrawl (w/o Wikipedia or NC-licenses) | eu | Artetxe et al., 2022 |
244
+ | FineWeb-Edu (350BT subset) | en | Penedo et al., 2024 |
245
+ | French Public Domain Books (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Books) |
246
+ | French Public Domain Newspapers (French-PD) | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
247
+ | German Web as Corpus (DeWaC) | de | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:dewac) |
248
+ | Greek Legal Code (GLC) | el | Papaloukas et al., 2021 |
249
+ | Greek Web Corpus (GWC) | el | Outsios et al., 2018 |
250
+ | HPLT v1 - Spanish | es | de Gibert et al., 2024 |
251
+ | HPLT v1.1 - Spanish | es | de Gibert et al., 2024 |
252
+ | Irish Universal Dependencies (Ga-UD) | ga | [Link](https://universaldependencies.org/ga/index.html) |
253
+ | Italian Web as Corpus (ItWaC) | it | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:itwac) |
254
+ | Korpus Malti | mt | Micallef et al., 2022 |
255
+ | Korpus slovenských právnych predpisov v1.9 (SK-Laws) | sk | [Link](https://www.juls.savba.sk/data/marcell/legal-sk-20220322-1.9.ver.xz) |
256
+ | Latxa Corpus v1.1 (GAITU) | eu | Etxaniz et al., 2024 [Link](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1) |
257
+ | Laws and legal acts of Ukraine (UK-Laws) | uk | [Link](https://lang.org.ua/en/corpora/#anchor7) |
258
+ | Legal-ES | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
259
+ | MARCELL Romanian legislative subcorpus v2 | ro | [Link](https://elrc-share.eu/reposMARCELL%20Romanian%20legislative%20subcorpus%20v2itory/browse/marcell-romanian-legislative-subcorpus-v2/2da548428b9d11eb9c1a00155d026706ce94a6b59ffc4b0e9fb5cd9cebe6889e/) |
260
+ | Math AMPS | en | Hendrycks et al., 2021 |
261
+ | NKPJ National Corpus of Polish v1.2 (NKPJ) | pl | Lewandowska-Tomaszczyk et al., 2013 |
262
+ | Occitan Corpus (IEA-AALO) | oc | Provided by [IEA](https://www.institutestudisaranesi.cat/) |
263
+ | Open Legal Data - German court decisions and laws | de | Ostendorff et al., 2020 |
264
+ | ParlamentoPT | pt | Rodrigues et al., 2023 |
265
+ | peS2o | en | Soldaini & Lo, 2023 |
266
+ | PG-19 | en | Rae et al., 2019 |
267
+ | Pile of Law (selected subsets) | en | Henderson* et al., 2022 |
268
+ | Polish Parliamentary Corpus (PPC) | pl | Ogrodniczuk, 2018 |
269
+ | Proof Pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
270
+ | RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
271
+ | Scientific-ES | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
272
+ | SK Court Decisions v2.0 (OD-Justice) | sk | [Link](https://www.juls.savba.sk/data/od-justice/od-justice-2.0.ver.xz) |
273
+ | Slovene Web as Corpus (slWaC) | sl | Erjavec et al., 2015 |
274
+ | SoNaR Corpus NC 1.2 | nl | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/) |
275
+ | Spanish Legal Domain Corpora (Spanish-Legal) | es | Gutiérrez-Fandiño et al., 2021 |
276
+ | SrpKorSubset: news, legal, academic, conversation, lit- erary (SrpKor) | sr | [Link](http://www.korpus.matf.bg.ac.rs/) |
277
+ | Starcoder | code | Li et al., 2023 |
278
+ | State-related content from the Latvian Web (State-Latvian-Web) | lv | [Link](https://catalog.elra.info/en-us/repository/browse/ELRA-W0169/) |
279
+ | SYN v9: large corpus of written Czech | cs | Křen et al., 2021 |
280
+ | Tagesschau Archive Article | de | [Link](https://huggingface.co/datasets/bjoernp/tagesschau-2018-2023) |
281
+ | The Danish Parliament Corpus 2009 - 2017, v1 | da | Hansen, 2018 |
282
+ | The Gaois bilingual corpus of English-Irish legislation (Ga-Legislation) | ga | [Link](https://portulanclarin.net/repository/browse/the-gaois-bilingual-corpus-of-english-irish-legislation-processed/daeac17c9e3511ea9b7f02420a000407b83de243dc0b469aab41084386c5b80f/) |
283
+ | The Pile (PhilPapers) | en | Gao et al., 2021 |
284
+ | The Swedish Culturomics Gigaword Corpus (Swedish- Gigaword) | sv | Rødven-Eide, 2016 |
285
+ | Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
286
+ | Yle Finnish News Archive (Yle-News) | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
287
 
288
  To consult the data summary document with the respective licences, please send an e-mail to ipr@bsc.es.
289