guipenedo HF staff commited on
Commit
9a51a2c
·
unverified ·
1 Parent(s): acabe52

attempt to fix latex

Browse files
Files changed (2) hide show
  1. dist/index.html +14 -13
  2. src/index.html +14 -13
dist/index.html CHANGED
@@ -1,5 +1,5 @@
1
- <!doctype html>
2
-
3
  <head>
4
  <script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
5
  <script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
@@ -76,11 +76,11 @@
76
  <aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
77
 
78
  <p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
79
- (<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we decided to carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset it-self, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
80
 
81
  <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
82
 
83
- <p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality education annotations and which outperform all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
84
  <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
85
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
86
  <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
@@ -88,8 +88,8 @@
88
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
89
  recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
90
 
91
- <h2>What's web data</h2>
92
- <h3>Finding the data</h3>
93
  <p>A common question often asked regarding web datasets used
94
  to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
95
  <ul>
@@ -161,7 +161,7 @@
161
  <ul>
162
  <li>small variance between runs trained on different samplings of the same
163
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
164
- resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter ablations.
165
  </li>
166
  </ul>
167
  <ul>
@@ -222,9 +222,9 @@
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
 
225
- <h3>First steps of filtering</h3>
226
  <p>Filtering is an important part of the curation process. It consists in
227
- removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
228
  deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
229
  <p>As a basis for our filtering we used part of the setup
230
  from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
@@ -252,17 +252,17 @@
252
  just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
253
  can even be introduced by the crawler itself, when different links point to the same page. </p>
254
  <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
255
- allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to an increased training
256
- efficiency: by removing duplicated content, a model can reach the same performance level with less training iteration – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
257
  <p>There are different ways to identify and even define
258
  duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
259
  efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
260
  similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
261
- documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or periphrase. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
262
 
263
  <h4>Our deduplication parameters</h4>
264
  <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
265
- fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the sequences considered (by controlling the n-gram size). We chose to work on 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
  <p>This would mean that for two documents with a similarity ($$s$$)
@@ -765,3 +765,4 @@
765
  }
766
  </script>
767
  </body>
 
 
1
+ <!DOCTYPE html>
2
+ <html>
3
  <head>
4
  <script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
5
  <script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
 
76
  <aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
77
 
78
  <p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
79
+ (<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we carefully documented and ablated all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset itself, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
80
 
81
  <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
82
 
83
+ <p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality synthetic annotations for educational value, and which outperforms all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
84
  <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
85
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
86
  <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
 
88
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
89
  recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
90
 
91
+ <h2>Web data</h2>
92
+ <h3>Finding the raw data</h3>
93
  <p>A common question often asked regarding web datasets used
94
  to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
95
  <ul>
 
161
  <ul>
162
  <li>small variance between runs trained on different samplings of the same
163
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
164
+ resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter's effect
165
  </li>
166
  </ul>
167
  <ul>
 
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
 
225
+ <h3>Base filtering</h3>
226
  <p>Filtering is an important part of the curation process. It consists in
227
+ removing part of the data (be it words, lines, or even full documents) that lowers the performance of the model and is thus
228
  deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
229
  <p>As a basis for our filtering we used part of the setup
230
  from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
 
252
  just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
253
  can even be introduced by the crawler itself, when different links point to the same page. </p>
254
  <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
255
+ allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to increased training
256
+ efficiency: by removing duplicated content, a model can reach the same performance level with fewer training iterations – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
257
  <p>There are different ways to identify and even define
258
  duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
259
  efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
260
  similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
261
+ documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or paraphrasing. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
262
 
263
  <h4>Our deduplication parameters</h4>
264
  <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
265
+ fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
  <p>This would mean that for two documents with a similarity ($$s$$)
 
765
  }
766
  </script>
767
  </body>
768
+ </html>
src/index.html CHANGED
@@ -1,5 +1,5 @@
1
- <!doctype html>
2
-
3
  <head>
4
  <script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
5
  <script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
@@ -76,11 +76,11 @@
76
  <aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
77
 
78
  <p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
79
- (<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we decided to carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset it-self, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
80
 
81
  <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
82
 
83
- <p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality education annotations and which outperform all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
84
  <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
85
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
86
  <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
@@ -88,8 +88,8 @@
88
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
89
  recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
90
 
91
- <h2>What's web data</h2>
92
- <h3>Finding the data</h3>
93
  <p>A common question often asked regarding web datasets used
94
  to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
95
  <ul>
@@ -161,7 +161,7 @@
161
  <ul>
162
  <li>small variance between runs trained on different samplings of the same
163
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
164
- resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter ablations.
165
  </li>
166
  </ul>
167
  <ul>
@@ -222,9 +222,9 @@
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
 
225
- <h3>First steps of filtering</h3>
226
  <p>Filtering is an important part of the curation process. It consists in
227
- removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
228
  deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
229
  <p>As a basis for our filtering we used part of the setup
230
  from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
@@ -252,17 +252,17 @@
252
  just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
253
  can even be introduced by the crawler itself, when different links point to the same page. </p>
254
  <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
255
- allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to an increased training
256
- efficiency: by removing duplicated content, a model can reach the same performance level with less training iteration – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
257
  <p>There are different ways to identify and even define
258
  duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
259
  efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
260
  similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
261
- documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or periphrase. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
262
 
263
  <h4>Our deduplication parameters</h4>
264
  <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
265
- fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the sequences considered (by controlling the n-gram size). We chose to work on 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
  <p>This would mean that for two documents with a similarity ($$s$$)
@@ -765,3 +765,4 @@
765
  }
766
  </script>
767
  </body>
 
 
1
+ <!DOCTYPE html>
2
+ <html>
3
  <head>
4
  <script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
5
  <script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
 
76
  <aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
77
 
78
  <p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
79
+ (<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we carefully documented and ablated all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset itself, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
80
 
81
  <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
82
 
83
+ <p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality synthetic annotations for educational value, and which outperforms all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
84
  <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
85
  download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
86
  <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
 
88
  <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
89
  recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
90
 
91
+ <h2>Web data</h2>
92
+ <h3>Finding the raw data</h3>
93
  <p>A common question often asked regarding web datasets used
94
  to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
95
  <ul>
 
161
  <ul>
162
  <li>small variance between runs trained on different samplings of the same
163
  dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
164
+ resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter's effect
165
  </li>
166
  </ul>
167
  <ul>
 
222
  <div id="plot-wet_comparison"></div>
223
  </div>
224
 
225
+ <h3>Base filtering</h3>
226
  <p>Filtering is an important part of the curation process. It consists in
227
+ removing part of the data (be it words, lines, or even full documents) that lowers the performance of the model and is thus
228
  deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
229
  <p>As a basis for our filtering we used part of the setup
230
  from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
 
252
  just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
253
  can even be introduced by the crawler itself, when different links point to the same page. </p>
254
  <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
255
+ allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to increased training
256
+ efficiency: by removing duplicated content, a model can reach the same performance level with fewer training iterations – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
257
  <p>There are different ways to identify and even define
258
  duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
259
  efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
260
  similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
261
+ documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or paraphrasing. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
262
 
263
  <h4>Our deduplication parameters</h4>
264
  <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
265
+ fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
266
  112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
267
  75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
268
  <p>This would mean that for two documents with a similarity ($$s$$)
 
765
  }
766
  </script>
767
  </body>
768
+ </html>