guipenedo HF staff commited on
Commit
decf500
1 Parent(s): f3b0151

add missing citation

Browse files
dist/bibliography.bib CHANGED
@@ -323,4 +323,12 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
323
  eprint={2009.03300},
324
  archivePrefix={arXiv},
325
  primaryClass={cs.CY}
 
 
 
 
 
 
 
 
326
  }
 
323
  eprint={2009.03300},
324
  archivePrefix={arXiv},
325
  primaryClass={cs.CY}
326
+ }
327
+ @misc{mitchell2023measuring,
328
+ title={Measuring Data},
329
+ author={Margaret Mitchell and Alexandra Sasha Luccioni and Nathan Lambert and Marissa Gerchick and Angelina McMillan-Major and Ezinwanne Ozoani and Nazneen Rajani and Tristan Thrush and Yacine Jernite and Douwe Kiela},
330
+ year={2023},
331
+ eprint={2212.05129},
332
+ archivePrefix={arXiv},
333
+ primaryClass={cs.AI}
334
  }
dist/index.html CHANGED
@@ -126,7 +126,7 @@
126
 
127
  <h3>What is good data?</h3>
128
  <p>This is probably the main question to keep in mind when
129
- creating a dataset. In most contexts and, in particular, in the context of large language model pretraining <d-footnote>Note that this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens obtained from the web) used to pretrain a Large Language Model (by pretraining we mean the very first step in the training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field besides this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite>, and not even a property of documents that can always be clearly perceived through direct human observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
130
  <p>It is still common to train a model on a given corpus considered "clean"
131
  (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
132
  that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream
 
126
 
127
  <h3>What is good data?</h3>
128
  <p>This is probably the main question to keep in mind when
129
+ creating a dataset. In most contexts and, in particular, in the context of large language model pretraining <d-footnote>Note that this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens obtained from the web) used to pretrain a Large Language Model (by pretraining we mean the very first step in the training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field besides this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite><d-cite bibtex-key="mitchell2023measuring"></d-cite>, and not even a property of documents that can always be clearly perceived through direct human observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
130
  <p>It is still common to train a model on a given corpus considered "clean"
131
  (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
132
  that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream
src/bibliography.bib CHANGED
@@ -323,4 +323,12 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
323
  eprint={2009.03300},
324
  archivePrefix={arXiv},
325
  primaryClass={cs.CY}
 
 
 
 
 
 
 
 
326
  }
 
323
  eprint={2009.03300},
324
  archivePrefix={arXiv},
325
  primaryClass={cs.CY}
326
+ }
327
+ @misc{mitchell2023measuring,
328
+ title={Measuring Data},
329
+ author={Margaret Mitchell and Alexandra Sasha Luccioni and Nathan Lambert and Marissa Gerchick and Angelina McMillan-Major and Ezinwanne Ozoani and Nazneen Rajani and Tristan Thrush and Yacine Jernite and Douwe Kiela},
330
+ year={2023},
331
+ eprint={2212.05129},
332
+ archivePrefix={arXiv},
333
+ primaryClass={cs.AI}
334
  }
src/index.html CHANGED
@@ -126,7 +126,7 @@
126
 
127
  <h3>What is good data?</h3>
128
  <p>This is probably the main question to keep in mind when
129
- creating a dataset. In most contexts and, in particular, in the context of large language model pretraining <d-footnote>Note that this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens obtained from the web) used to pretrain a Large Language Model (by pretraining we mean the very first step in the training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field besides this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite>, and not even a property of documents that can always be clearly perceived through direct human observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
130
  <p>It is still common to train a model on a given corpus considered "clean"
131
  (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
132
  that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream
 
126
 
127
  <h3>What is good data?</h3>
128
  <p>This is probably the main question to keep in mind when
129
+ creating a dataset. In most contexts and, in particular, in the context of large language model pretraining <d-footnote>Note that this report is focused on the special field of web-scale datasets ("web-scale" typically meaning >100 billion tokens obtained from the web) used to pretrain a Large Language Model (by pretraining we mean the very first step in the training of a model, starting from random weights). We don't pretend to cover any other field of dataset creation nor that the lessons or hypothesis we develop in this document can extend to any field besides this specific field.</d-footnote>, "high quality" is not a very well defined term<d-cite bibtex-key="albalak2024survey"></d-cite><d-cite bibtex-key="mitchell2023measuring"></d-cite>, and not even a property of documents that can always be clearly perceived through direct human observation alone.<d-cite bibtex-key="longpre2023pretrainers"></d-cite></p>
130
  <p>It is still common to train a model on a given corpus considered "clean"
131
  (typically wikipedia<d-footnote>Even though as we mentioned above the notion of "clean" is so ill-defined that it should probably not been seen as equivalent to wikipedia-type of text</d-footnote>) and use it to check the perplexity on the dataset
132
  that we were trying to curate<d-cite bibtex-key="wenzek2019ccnet"></d-cite>. Unfortunately this does not always correlate with improved performance on a set of downstream