Spaces:

HuggingFaceFW
/

blogpost-fineweb-v1

Running

App Files Files Community

guipenedo HF staff commited on Jun 2, 2024

Commit

9a51a2c

unverified ·

1 Parent(s): acabe52

attempt to fix latex

Browse files

Files changed (2) hide show

dist/index.html +14 -13
src/index.html +14 -13

dist/index.html CHANGED Viewed

@@ -1,5 +1,5 @@
-<!doctype html>
 <head>
     <script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
     <script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
@@ -76,11 +76,11 @@
         <aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
         <p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
-        (<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we decided to carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset it-self, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
         <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
-        <p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality education annotations and which outperform all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
         <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
     <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
@@ -88,8 +88,8 @@
     <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
         recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
-    <h2>What's web data</h2>
-    <h3>Finding the data</h3>
     <p>A common question often asked regarding web datasets used
         to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
     <ul>
@@ -161,7 +161,7 @@
     <ul>
         <li>small variance between runs trained on different samplings of the same
             dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
-            resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter ablations.
         </li>
     </ul>
     <ul>
@@ -222,9 +222,9 @@
         <div id="plot-wet_comparison"></div>
     </div>
-    <h3>First steps of filtering</h3>
     <p>Filtering is an important part of the curation process. It consists in
-        removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
         deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
     <p>As a basis for our filtering we used part of the setup
         from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
@@ -252,17 +252,17 @@
         just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
         can even be introduced by the crawler itself, when different links point to the same page. </p>
     <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
-        allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to an increased training
-        efficiency: by removing duplicated content, a model can reach the same performance level with less training iteration – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
     <p>There are different ways to identify and even define
         duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
         efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
         similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
-        documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or periphrase. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
     <h4>Our deduplication parameters</h4>
     <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
-        fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the sequences considered (by controlling the n-gram size). We chose to work on 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
         112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
         75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
     <p>This would mean that for two documents with a similarity ($$s$$)
@@ -765,3 +765,4 @@
     }
 </script>
 </body>

+<!DOCTYPE html>
+<html>
 <head>
     <script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
     <script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
         <aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
         <p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
+        (<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we carefully documented and ablated all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset itself, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
         <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
+        <p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality synthetic annotations for educational value, and which outperforms all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
         <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
     <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
     <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
         recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
+    <h2>Web data</h2>
+    <h3>Finding the raw data</h3>
     <p>A common question often asked regarding web datasets used
         to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
     <ul>
     <ul>
         <li>small variance between runs trained on different samplings of the same
             dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
+            resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter's effect
         </li>
     </ul>
     <ul>
         <div id="plot-wet_comparison"></div>
     </div>
+    <h3>Base filtering</h3>
     <p>Filtering is an important part of the curation process. It consists in
+        removing part of the data (be it words, lines, or even full documents) that lowers the performance of the model and is thus
         deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
     <p>As a basis for our filtering we used part of the setup
         from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
         just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
         can even be introduced by the crawler itself, when different links point to the same page. </p>
     <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
+        allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to increased training
+        efficiency: by removing duplicated content, a model can reach the same performance level with fewer training iterations – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
     <p>There are different ways to identify and even define
         duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
         efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
         similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
+        documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or paraphrasing. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
     <h4>Our deduplication parameters</h4>
     <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
+        fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
         112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
         75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
     <p>This would mean that for two documents with a similarity ($$s$$)
     }
 </script>
 </body>
+</html>

src/index.html CHANGED Viewed

@@ -1,5 +1,5 @@
-<!doctype html>
 <head>
     <script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
     <script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
@@ -76,11 +76,11 @@
         <aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
         <p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
-        (<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we decided to carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset it-self, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
         <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
-        <p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality education annotations and which outperform all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
         <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
     <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
@@ -88,8 +88,8 @@
     <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
         recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
-    <h2>What's web data</h2>
-    <h3>Finding the data</h3>
     <p>A common question often asked regarding web datasets used
         to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
     <ul>
@@ -161,7 +161,7 @@
     <ul>
         <li>small variance between runs trained on different samplings of the same
             dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
-            resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter ablations.
         </li>
     </ul>
     <ul>
@@ -222,9 +222,9 @@
         <div id="plot-wet_comparison"></div>
     </div>
-    <h3>First steps of filtering</h3>
     <p>Filtering is an important part of the curation process. It consists in
-        removing part of the data (which can consists in removing words, lines, or even full documents) that lower the performances of the model and is thus
         deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
     <p>As a basis for our filtering we used part of the setup
         from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
@@ -252,17 +252,17 @@
         just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
         can even be introduced by the crawler itself, when different links point to the same page. </p>
     <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
-        allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to an increased training
-        efficiency: by removing duplicated content, a model can reach the same performance level with less training iteration – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
     <p>There are different ways to identify and even define
         duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
         efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
         similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
-        documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or periphrase. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
     <h4>Our deduplication parameters</h4>
     <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
-        fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the sequences considered (by controlling the n-gram size). We chose to work on 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
         112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
         75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
     <p>This would mean that for two documents with a similarity ($$s$$)
@@ -765,3 +765,4 @@
     }
 </script>
 </body>

+<!DOCTYPE html>
+<html>
 <head>
     <script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
     <script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
         <aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
         <p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
+        (<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we carefully documented and ablated all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset itself, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
         <aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
+        <p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality synthetic annotations for educational value, and which outperforms all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
         <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
         download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
     <p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
     <p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
         recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
+    <h2>Web data</h2>
+    <h3>Finding the raw data</h3>
     <p>A common question often asked regarding web datasets used
         to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
     <ul>
     <ul>
         <li>small variance between runs trained on different samplings of the same
             dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
+            resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter's effect
         </li>
     </ul>
     <ul>
         <div id="plot-wet_comparison"></div>
     </div>
+    <h3>Base filtering</h3>
     <p>Filtering is an important part of the curation process. It consists in
+        removing part of the data (be it words, lines, or even full documents) that lowers the performance of the model and is thus
         deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
     <p>As a basis for our filtering we used part of the setup
         from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
         just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
         can even be introduced by the crawler itself, when different links point to the same page. </p>
     <p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
+        allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to increased training
+        efficiency: by removing duplicated content, a model can reach the same performance level with fewer training iterations – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
     <p>There are different ways to identify and even define
         duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
         efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
         similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
+        documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or paraphrasing. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
     <h4>Our deduplication parameters</h4>
     <p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
+        fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
         112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
         75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
     <p>This would mean that for two documents with a similarity ($$s$$)
     }
 </script>
 </body>
+</html>