Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

guipenedo HF staff commited on Jun 3

Commit

decf500

•

1 Parent(s): f3b0151

add missing citation

Browse files

Files changed (4) hide show

dist/bibliography.bib +8 -0
dist/index.html +1 -1
src/bibliography.bib +8 -0
src/index.html +1 -1

dist/bibliography.bib CHANGED Viewed

@@ -323,4 +323,12 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
       eprint={2009.03300},
       archivePrefix={arXiv},
       primaryClass={cs.CY}
 }

       eprint={2009.03300},
       archivePrefix={arXiv},
       primaryClass={cs.CY}
+}
+@misc{mitchell2023measuring,
+      title={Measuring Data},
+      author={Margaret Mitchell and Alexandra Sasha Luccioni and Nathan Lambert and Marissa Gerchick and Angelina McMillan-Major and Ezinwanne Ozoani and Nazneen Rajani and Tristan Thrush and Yacine Jernite and Douwe Kiela},
+      year={2023},
+      eprint={2212.05129},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI}
 }

dist/index.html CHANGED Viewed

@@ -126,7 +126,7 @@
     <h3>What is good data?</h3>
     <p>This is probably the main question to keep in mind when
-        creating a dataset. In most contexts and, in particular, in the context of large language model pretraining <d-footnote>Note that this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens obtained from the web) used to pretrain a Large Language Model (by pretraining we mean the very first step in the training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field besides this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite>, and not even a property of documents that can always be clearly perceived through direct human observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
     <p>It is still common to train a model on a given corpus considered "clean"
         (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
         that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream

     <h3>What is good data?</h3>
     <p>This is probably the main question to keep in mind when
+        creating a dataset. In most contexts and, in particular, in the context of large language model pretraining <d-footnote>Note that this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens obtained from the web) used to pretrain a Large Language Model (by pretraining we mean the very first step in the training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field besides this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite><d-cite bibtex-key="mitchell2023measuring"></d-cite>, and not even a property of documents that can always be clearly perceived through direct human observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
     <p>It is still common to train a model on a given corpus considered "clean"
         (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
         that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream

src/bibliography.bib CHANGED Viewed

@@ -323,4 +323,12 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
       eprint={2009.03300},
       archivePrefix={arXiv},
       primaryClass={cs.CY}
 }

       eprint={2009.03300},
       archivePrefix={arXiv},
       primaryClass={cs.CY}
+}
+@misc{mitchell2023measuring,
+      title={Measuring Data},
+      author={Margaret Mitchell and Alexandra Sasha Luccioni and Nathan Lambert and Marissa Gerchick and Angelina McMillan-Major and Ezinwanne Ozoani and Nazneen Rajani and Tristan Thrush and Yacine Jernite and Douwe Kiela},
+      year={2023},
+      eprint={2212.05129},
+      archivePrefix={arXiv},
+      primaryClass={cs.AI}
 }

src/index.html CHANGED Viewed

@@ -126,7 +126,7 @@
     <h3>What is good data?</h3>
     <p>This is probably the main question to keep in mind when
-        creating a dataset. In most contexts and, in particular, in the context of large language model pretraining <d-footnote>Note that this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens obtained from the web) used to pretrain a Large Language Model (by pretraining we mean the very first step in the training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field besides this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite>, and not even a property of documents that can always be clearly perceived through direct human observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
     <p>It is still common to train a model on a given corpus considered "clean"
         (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
         that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream

     <h3>What is good data?</h3>
     <p>This is probably the main question to keep in mind when
+        creating a dataset. In most contexts and, in particular, in the context of large language model pretraining <d-footnote>Note that this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens obtained from the web) used to pretrain a Large Language Model (by pretraining we mean the very first step in the training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field besides this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite><d-cite bibtex-key="mitchell2023measuring"></d-cite>, and not even a property of documents that can always be clearly perceived through direct human observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
     <p>It is still common to train a model on a given corpus considered "clean"
         (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
         that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream