Update README.md
Browse files
README.md
CHANGED
@@ -211,74 +211,76 @@ Feel free to click the expand button below to see the full list of sources.
|
|
211 |
<details>
|
212 |
<summary>Data Sources</summary>
|
213 |
|
214 |
-
| Dataset
|
215 |
-
|
216 |
-
|
|
217 |
-
|
|
218 |
-
|
|
219 |
-
|
|
220 |
-
|
|
221 |
-
|
|
222 |
-
|
|
223 |
-
|
|
224 |
-
|
|
225 |
-
|
|
226 |
-
|
|
227 |
-
|
|
228 |
-
|
|
229 |
-
|
|
230 |
-
|
|
231 |
-
|
|
232 |
-
|
|
233 |
-
| Danish GigaWord
|
234 |
-
|
|
235 |
-
|
|
236 |
-
|
|
237 |
-
|
|
238 |
-
|
|
239 |
-
|
|
240 |
-
|
|
241 |
-
|
|
242 |
-
|
|
243 |
-
|
|
244 |
-
|
|
245 |
-
|
|
246 |
-
|
|
247 |
-
|
|
248 |
-
|
|
249 |
-
|
|
250 |
-
|
|
251 |
-
|
|
252 |
-
|
|
253 |
-
|
|
254 |
-
|
|
255 |
-
|
|
256 |
-
|
|
257 |
-
|
|
258 |
-
|
|
259 |
-
|
|
260 |
-
|
|
261 |
-
|
|
262 |
-
|
|
263 |
-
|
|
264 |
-
|
|
265 |
-
|
|
266 |
-
| Korpus
|
267 |
-
|
|
268 |
-
|
|
269 |
-
|
|
270 |
-
|
|
271 |
-
|
|
272 |
-
|
|
273 |
-
|
|
274 |
-
|
|
275 |
-
|
|
276 |
-
|
|
277 |
-
| Corpus
|
278 |
-
|
|
279 |
-
|
|
280 |
-
| The Swedish Culturomics Gigaword Corpus
|
281 |
-
|
|
|
|
|
|
282 |
|
283 |
<details>
|
284 |
<summary>References</summary>
|
@@ -330,6 +332,8 @@ Feel free to click the expand button below to see the full list of sources.
|
|
330 |
- Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The Woman Worked as a Babysitter: On Biases in Language Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China. Association for Computational Linguistics.
|
331 |
- Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., & Tafjord, O. (2018). Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803. 05457v1.
|
332 |
- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
|
|
|
|
|
333 |
|
334 |
</details>
|
335 |
|
|
|
211 |
<details>
|
212 |
<summary>Data Sources</summary>
|
213 |
|
214 |
+
| Dataset | Language | Source |
|
215 |
+
|-----------------------------------------------------|------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
|
216 |
+
| Auxiliary Mathematics Problems and Solutions (AMPS) dataset | en | Hendrycks et al., 2021 |
|
217 |
+
| Aya Dataset (w/o Evaluation Suite) | eu, hr, nl, fi, ka, hu, lt, nn, ro, sk, lv, cy, bg, cs, en, fr, de, ga, mt, pl, ru, sl, sv, ca, da, et, gl, el, it, no, pt, sr, es, uk | Singh et al., 2024 |
|
218 |
+
| BIGPATENT | en | Sharma et al., 2019 |
|
219 |
+
| Biomedical | es | Internally generated scientific dataset: Dialnet, Scielo, CSIC, TDX, BSC, UCM |
|
220 |
+
| Brazilian Portuguese Web as Corpus | pt | Wagner Filho et al., 2018 |
|
221 |
+
| Bulgarian National Corpus | bg | [Link](http://old.dcl.bas.bg/dataset/BulNC.7z) |
|
222 |
+
| CaBeRnet: a New French Balanced Reference Corpus | fr | Popa-Fabre et al., 2020 |
|
223 |
+
| CATalog | ca | Palomar-Giner et al., 2024 |
|
224 |
+
| Colossal OSCAR 1.0 | bg, ca, cs, cy, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, oc, pl, pt, ro, ru, sh, sk, sl, sr, sv, uk | Brack et al., 2024 |
|
225 |
+
| Corpus of academic Slovene KAS 2.0 | sl | Žagar et al., 2022 |
|
226 |
+
| Corpus of laws and legal acts of Ukraine | uk | [Link](https://lang.org.ua/en/corpora/#anchor7) |
|
227 |
+
| Corpus of State-related content from the Latvian Web (Processed) | lv | [Link](https://catalog.elra.info/en-us/repository/browse/ELRA-W0169/) |
|
228 |
+
| CorpusNÓS | gl | de-Dios-Flores et al., 2024 |
|
229 |
+
| Crawl of Bulgarian news websites | bg | [Link](http://old.dcl.bas.bg/dataset/Bulgarian_news.7z) |
|
230 |
+
| Croatian web corpus hrWaC 2.1 | hr | Ljubešić & Klubička, 2014 |
|
231 |
+
| CURLICAT Corpus | bg, hr, hu, pl, ro, sk, sl | Váradi et al., 2022 |
|
232 |
+
| DaNewsroom | da | Varab & Schluter, 2020 |
|
233 |
+
| Danish GigaWord | da | Strømberg-Derczynski et al., 2021 |
|
234 |
+
| DeWaC | de | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:dewac) |
|
235 |
+
| DK-CLARIN Reference Corpus of General Danish | da | [Link](https://korpus.dsl.dk/clarin/) |
|
236 |
+
| Estonian National Corpus 2021 | et | Koppel & Kallas, 2022 |
|
237 |
+
| Estonian Reference Corpus | et | [Link](https://www.cl.ut.ee/korpused/segakorpus/) |
|
238 |
+
| EurLEX-Resources | bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelniklaus/eurlex_resources) |
|
239 |
+
| EusCrawl (w/o Wikipedia or NC-licenses) | eu | Artetxe et al., 2022 |
|
240 |
+
| FineWeb-Edu (350BT subset) | en | Penedo et al., 2024 |
|
241 |
+
| French Public Domain Books | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Books) |
|
242 |
+
| French Public Domain Newspapers | fr | [Link](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers) |
|
243 |
+
| Greek Legal Code | el | Papaloukas et al., 2021 |
|
244 |
+
| Greek Web Corpus | el | Outsios et al., 2018 |
|
245 |
+
| HPLTDatasets v1 - Spanish | es | de Gibert et al., 2024 |
|
246 |
+
| Irish Universal Dependencies | ga | [Link](https://universaldependencies.org/ga/index.html) |
|
247 |
+
| ITWaC | it | [Link](https://docs.sslmit.unibo.it/doku.php?id=corpora:itwac) |
|
248 |
+
| Korpus Malti | mt | Micallef et al., 2022 |
|
249 |
+
| Korpus slovenských právnych predpisov v1.9 | sk | [Link](https://www.juls.savba.sk/data/marcell/legal-sk-20220322-1.9.ver.xz) |
|
250 |
+
| Latxa Corpus v1.1 | eu | Etxaniz et al., 2024 [Link](https://huggingface.co/datasets/HiTZ/latxa-corpus-v1.1) |
|
251 |
+
| Legal | es | Internally generated legal dataset: BOE, BORME, Senado, Congreso, Spanish court orders, DOGC |
|
252 |
+
| MaCoCu web corpus | bg, ca, el, hr, mt, sl, sr, uk | Bañón et al., 2022 |
|
253 |
+
| MARCELL Romanian legislative subcorpus v2 | ro | [Link](https://elrc-share.eu/reposMARCELL%20Romanian%20legislative%20subcorpus%20v2itory/browse/marcell-romanian-legislative-subcorpus-v2/2da548428b9d11eb9c1a00155d026706ce94a6b59ffc4b0e9fb5cd9cebe6889e/) |
|
254 |
+
| MC4-Legal | bg, cs, da, de, el, en, es, et, fi, fr, ga, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv | [Link](https://huggingface.co/datasets/joelito/legal-mc4) |
|
255 |
+
| NKJP-PodkorpusMilionowy-1.2 (National Corpus of Polish) | pl | Lewandowska-Tomaszczyk et al., 2013 |
|
256 |
+
| Norwegian Colossal Corpus | nn, no | Kummervold et al., 2021 |
|
257 |
+
| Occitan Corpus | oc | Provided by [IEA](https://www.institutestudisaranesi.cat/) |
|
258 |
+
| od-justice 2.0 | sk | [Link](https://www.juls.savba.sk/data/od-justice/od-justice-2.0.ver.xz) |
|
259 |
+
| Open Legal Data - German court decisions and laws | de | Ostendorff et al., 2020 |
|
260 |
+
| OpenSubtitlesv2016 | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, gl, hr, it, lt, lv, nl, no, pl, pt, ro, sk, sl, sr, sv, uk | Lison & Tiedemann, 2016 |
|
261 |
+
| ParlamentoPT | pt | Rodrigues et al., 2023 |
|
262 |
+
| Parlamint corpus | at, bg, cz, dk, ee, es, es-ga, fi, fr, gb, gr, hr, hu, it, lv, nl, no, pl, pt, rs, se, si | Erjavec et al., 2021 |
|
263 |
+
| peS2o | en | Soldaini & Lo, 2023 |
|
264 |
+
| PG-19 | en | Rae et al., 2019 |
|
265 |
+
| Pile of Law (selected subsets) | en | Henderson* et al., 2022 |
|
266 |
+
| Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego | pl | Ogrodniczuk, 2018 |
|
267 |
+
| proof-pile | en | [Link](https://huggingface.co/datasets/hoskinson-center/proof-pile) |
|
268 |
+
| RedPajama-Data T1 (StackExchange subset) | en | Computer, 2023 |
|
269 |
+
| Scientific | es | Internally generated scientific dataset: Wikipedia LS, Pubmed, MeSpEn, patents, clinical cases, medical crawler |
|
270 |
+
| slWaC web corpus | sl | Erjavec et al., 2015 |
|
271 |
+
| SoNaR Corpus NC 1.2 | nl | [Link](https://taalmaterialen.ivdnt.org/download/tstc-sonar-corpus/) |
|
272 |
+
| Spanish Crawling | ca, es, eu, gl | Relevant Spanish websites crawling |
|
273 |
+
| Spanish Legal Domain Corpora | es | Gutiérrez-Fandiño et al., 2021 |
|
274 |
+
| SrpKorSubset (news, legal, academic, conversation, literary) | sr | [Link](http://www.korpus.matf.bg.ac.rs/) |
|
275 |
+
| Starcoder | code | Li et al., 2023 |
|
276 |
+
| SYN v9: large corpus of written Czech | cs | Křen et al., 2021 |
|
277 |
+
| The Danish Parliament Corpus 2009 - 2017, v1 | da | Hansen, 2018 |
|
278 |
+
| The Gaois bilingual corpus of English-Irish legislation (Irish legislation) | ga | [Link](https://portulanclarin.net/repository/browse/the-gaois-bilingual-corpus-of-english-irish-legislation-processed/daeac17c9e3511ea9b7f02420a000407b83de243dc0b469aab41084386c5b80f/) |
|
279 |
+
| The Pile (PhilPapers subset) | en | Gao et al., 2021 |
|
280 |
+
| The Swedish Culturomics Gigaword Corpus | sv | Rødven-Eide, 2016 |
|
281 |
+
| Welsh-GOV | cy | Crawling from [Link](https://www.llyw.cymru) |
|
282 |
+
| Wikimedia dumps | bg, ca, cs, da, de, el, en, es, et, eu, fi, fr, ga, gl, hr, hu, it, lt, lv, mt, nl, nn, no, pl, pt, ro, sh, sk, sl, sr, uk | [Link](https://dumps.wikimedia.org/) |
|
283 |
+
| Yle Finnish News Archive | fi | [Link](http://urn.fi/urn:nbn:fi:lb-2021050401) |
|
284 |
|
285 |
<details>
|
286 |
<summary>References</summary>
|
|
|
332 |
- Emily Sheng, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. The Woman Worked as a Babysitter: On Biases in Language Generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3407–3412, Hong Kong, China. Association for Computational Linguistics.
|
333 |
- Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., & Tafjord, O. (2018). Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803. 05457v1.
|
334 |
- Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
|
335 |
+
- Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., & Wolf, T. (2024). The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale (arXiv:2406.17557). arXiv. http://arxiv.org/abs/2406.17557
|
336 |
+
- Singh, S., Vargus, F., Dsouza, D., Karlsson, B. F., Mahendiran, A., Ko, W.-Y., Shandilya, H., Patel, J., Mataciunas, D., OMahony, L., Zhang, M., Hettiarachchi, R., Wilson, J., Machado, M., Moura, L. S., Krzemiński, D., Fadaei, H., Ergün, I., Okoh, I., … Hooker, S. (2024). Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning (arXiv:2402.06619). arXiv. http://arxiv.org/abs/2402.06619
|
337 |
|
338 |
</details>
|
339 |
|