attempt to fix latex
Browse files- dist/index.html +14 -13
- src/index.html +14 -13
dist/index.html
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
-
<!
|
2 |
-
|
3 |
<head>
|
4 |
<script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
|
5 |
<script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
|
@@ -76,11 +76,11 @@
|
|
76 |
<aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
|
77 |
|
78 |
<p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
|
79 |
-
(<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we
|
80 |
|
81 |
<aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
|
82 |
|
83 |
-
<p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality
|
84 |
<a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
|
85 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
|
86 |
<p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
|
@@ -88,8 +88,8 @@
|
|
88 |
<p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
|
89 |
recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
|
90 |
|
91 |
-
<h2>
|
92 |
-
<h3>Finding the data</h3>
|
93 |
<p>A common question often asked regarding web datasets used
|
94 |
to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
|
95 |
<ul>
|
@@ -161,7 +161,7 @@
|
|
161 |
<ul>
|
162 |
<li>small variance between runs trained on different samplings of the same
|
163 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
164 |
-
resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter
|
165 |
</li>
|
166 |
</ul>
|
167 |
<ul>
|
@@ -222,9 +222,9 @@
|
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
|
225 |
-
<h3>
|
226 |
<p>Filtering is an important part of the curation process. It consists in
|
227 |
-
removing part of the data (
|
228 |
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
229 |
<p>As a basis for our filtering we used part of the setup
|
230 |
from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
|
@@ -252,17 +252,17 @@
|
|
252 |
just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
|
253 |
can even be introduced by the crawler itself, when different links point to the same page. </p>
|
254 |
<p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
|
255 |
-
allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to
|
256 |
-
efficiency: by removing duplicated content, a model can reach the same performance level with
|
257 |
<p>There are different ways to identify and even define
|
258 |
duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
|
259 |
efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
|
260 |
similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
|
261 |
-
documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or
|
262 |
|
263 |
<h4>Our deduplication parameters</h4>
|
264 |
<p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
|
265 |
-
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
<p>This would mean that for two documents with a similarity ($$s$$)
|
@@ -765,3 +765,4 @@
|
|
765 |
}
|
766 |
</script>
|
767 |
</body>
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html>
|
3 |
<head>
|
4 |
<script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
|
5 |
<script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
|
|
|
76 |
<aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
|
77 |
|
78 |
<p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
|
79 |
+
(<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we carefully documented and ablated all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset itself, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
|
80 |
|
81 |
<aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
|
82 |
|
83 |
+
<p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality synthetic annotations for educational value, and which outperforms all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
|
84 |
<a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
|
85 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
|
86 |
<p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
|
|
|
88 |
<p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
|
89 |
recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
|
90 |
|
91 |
+
<h2>Web data</h2>
|
92 |
+
<h3>Finding the raw data</h3>
|
93 |
<p>A common question often asked regarding web datasets used
|
94 |
to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
|
95 |
<ul>
|
|
|
161 |
<ul>
|
162 |
<li>small variance between runs trained on different samplings of the same
|
163 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
164 |
+
resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter's effect
|
165 |
</li>
|
166 |
</ul>
|
167 |
<ul>
|
|
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
|
225 |
+
<h3>Base filtering</h3>
|
226 |
<p>Filtering is an important part of the curation process. It consists in
|
227 |
+
removing part of the data (be it words, lines, or even full documents) that lowers the performance of the model and is thus
|
228 |
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
229 |
<p>As a basis for our filtering we used part of the setup
|
230 |
from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
|
|
|
252 |
just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
|
253 |
can even be introduced by the crawler itself, when different links point to the same page. </p>
|
254 |
<p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
|
255 |
+
allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to increased training
|
256 |
+
efficiency: by removing duplicated content, a model can reach the same performance level with fewer training iterations – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
|
257 |
<p>There are different ways to identify and even define
|
258 |
duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
|
259 |
efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
|
260 |
similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
|
261 |
+
documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or paraphrasing. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
|
262 |
|
263 |
<h4>Our deduplication parameters</h4>
|
264 |
<p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
|
265 |
+
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
<p>This would mean that for two documents with a similarity ($$s$$)
|
|
|
765 |
}
|
766 |
</script>
|
767 |
</body>
|
768 |
+
</html>
|
src/index.html
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
-
<!
|
2 |
-
|
3 |
<head>
|
4 |
<script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
|
5 |
<script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
|
@@ -76,11 +76,11 @@
|
|
76 |
<aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
|
77 |
|
78 |
<p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
|
79 |
-
(<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we
|
80 |
|
81 |
<aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
|
82 |
|
83 |
-
<p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality
|
84 |
<a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
|
85 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
|
86 |
<p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
|
@@ -88,8 +88,8 @@
|
|
88 |
<p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
|
89 |
recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
|
90 |
|
91 |
-
<h2>
|
92 |
-
<h3>Finding the data</h3>
|
93 |
<p>A common question often asked regarding web datasets used
|
94 |
to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
|
95 |
<ul>
|
@@ -161,7 +161,7 @@
|
|
161 |
<ul>
|
162 |
<li>small variance between runs trained on different samplings of the same
|
163 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
164 |
-
resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter
|
165 |
</li>
|
166 |
</ul>
|
167 |
<ul>
|
@@ -222,9 +222,9 @@
|
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
|
225 |
-
<h3>
|
226 |
<p>Filtering is an important part of the curation process. It consists in
|
227 |
-
removing part of the data (
|
228 |
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
229 |
<p>As a basis for our filtering we used part of the setup
|
230 |
from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
|
@@ -252,17 +252,17 @@
|
|
252 |
just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
|
253 |
can even be introduced by the crawler itself, when different links point to the same page. </p>
|
254 |
<p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
|
255 |
-
allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to
|
256 |
-
efficiency: by removing duplicated content, a model can reach the same performance level with
|
257 |
<p>There are different ways to identify and even define
|
258 |
duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
|
259 |
efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
|
260 |
similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
|
261 |
-
documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or
|
262 |
|
263 |
<h4>Our deduplication parameters</h4>
|
264 |
<p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
|
265 |
-
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
<p>This would mean that for two documents with a similarity ($$s$$)
|
@@ -765,3 +765,4 @@
|
|
765 |
}
|
766 |
</script>
|
767 |
</body>
|
|
|
|
1 |
+
<!DOCTYPE html>
|
2 |
+
<html>
|
3 |
<head>
|
4 |
<script src="distill.bundle.js" type="module" fetchpriority="high" blocking></script>
|
5 |
<script src="main.bundle.js" type="module" fetchpriority="low" defer></script>
|
|
|
76 |
<aside>Reading time: 45 min. For the best reading experience, we recommend not using a mobile phone.</aside>
|
77 |
|
78 |
<p>Recently, we released <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb"><strong>🍷 FineWeb</strong></a>, a new, large-scale
|
79 |
+
(<strong>15-trillion tokens, 44TB disk space</strong>) dataset for LLM pretraining. FineWeb is derived from 96 <a href="https://commoncrawl.org/">CommonCrawl</a> snapshots and produces <strong>better-performing LLMs than other open pretraining datasets</strong>. To bring more clarity in machine learning and advance the open understanding of how to train good quality large language models, we carefully documented and ablated all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. The present long form report is a deep dive in how to create a large and high-quality web-scale dataset for LLM pretraining. The dataset itself, 🍷 FineWeb, is available <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">here</a>.
|
80 |
|
81 |
<aside>We are extremely thankful to the whole <a href="https://distill.pub/">distill.pub</a> team (Christopher Olah, Shan Carter, Ludwig Schubert in particular) for creating the template on which we based this blog post. Thanks also for inspiring us with exquisitely crafted articles and blog posts.</aside>
|
82 |
|
83 |
+
<p>In this report we also introduce <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu"><strong>📚 FineWeb-Edu</strong></a>, a subset of FineWeb constructed using scalable high-quality synthetic annotations for educational value, and which outperforms all openly accessible web-datasets on a number of educational benchmarks such as MMLU, ARC, and OpenBookQA.
|
84 |
<a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">📚 FineWeb-Edu</a> is available in two sizes/filtering-level: <strong>1.3 trillion (very high educational content) and 5.4 trillion (high educational content) tokens</strong> (all tokens are measured with GPT2 tokenizer <d-cite bibtex-key="radford2019language"></d-cite>). 📚 FineWeb-Edu outperforms all existing public web datasets, with models pretrained on it showing notable improvements on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA. You can
|
85 |
download it <a href="https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu">here</a>.</p>
|
86 |
<p>Both datasets are released under the permissive <a href="https://opendatacommons.org/licenses/by/1-0/">ODC-By 1.0 license</a></p>
|
|
|
88 |
<p><strong>TLDR:</strong> This blog covers a discussion on processing and evaluating data quality at scale, the 🍷 FineWeb
|
89 |
recipe (listing and explaining all of our design choices), and the process followed to create its 📚 FineWeb-Edu subset.</p>
|
90 |
|
91 |
+
<h2>Web data</h2>
|
92 |
+
<h3>Finding the raw data</h3>
|
93 |
<p>A common question often asked regarding web datasets used
|
94 |
to train LLMs is “where do they even get all that data?”. There are generally two options:</p>
|
95 |
<ul>
|
|
|
161 |
<ul>
|
162 |
<li>small variance between runs trained on different samplings of the same
|
163 |
dataset: we want our runs on a subset of the data to be representative of the whole dataset, and the
|
164 |
+
resulting scores to be, in the limit of what is possible, less sensitive to exact data point choices than to our filter's effect
|
165 |
</li>
|
166 |
</ul>
|
167 |
<ul>
|
|
|
222 |
<div id="plot-wet_comparison"></div>
|
223 |
</div>
|
224 |
|
225 |
+
<h3>Base filtering</h3>
|
226 |
<p>Filtering is an important part of the curation process. It consists in
|
227 |
+
removing part of the data (be it words, lines, or even full documents) that lowers the performance of the model and is thus
|
228 |
deemed to be “lower quality” in our eval-driven process of dataset crafting.</p>
|
229 |
<p>As a basis for our filtering we used part of the setup
|
230 |
from RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>. Namely, we:</p>
|
|
|
252 |
just otherwise repeated content spread over different domains and webpages. Sometimes, these duplicated pages
|
253 |
can even be introduced by the crawler itself, when different links point to the same page. </p>
|
254 |
<p>Removing these duplicates (deduplicating) has been correlated with improvements in model performance<d-cite bibtex-key="lee2022deduplicating"></d-cite> and a reduction in memorization of pretraining data<d-cite bibtex-key="carlini2023quantifying"></d-cite>, which might
|
255 |
+
allow for better generalization. Additionally, the performance uplift obtained through deduplication can be equated to increased training
|
256 |
+
efficiency: by removing duplicated content, a model can reach the same performance level with fewer training iterations – or equivalently, for a given number of training tokens, a model will have seen more diverse data.<d-cite bibtex-key="muennighoff2023scaling"></d-cite><d-cite bibtex-key="hernandez2022scaling"></d-cite></p>
|
257 |
<p>There are different ways to identify and even define
|
258 |
duplicated data. Common approaches rely on hashing techniques to speed up the process, or on building
|
259 |
efficient data structures to index the data (like suffix arrays). Methods can also be “fuzzy”, by using some
|
260 |
similarity metric to mark documents as duplicates, or “exact” by checking for exact matches between two
|
261 |
+
documents (or lines, paragraphs, or whatever other granularity level being used)<d-footnote>Note that here, even when we discuss "fuzzy" deduplication, we are only employing methods that operate on character/word matches, aka surface-level text. A more complex concept of deduplication is concerned with "semantic" deduplication: comparing/removing texts which are relative to the same concepts and use for instance synonyms or paraphrasing. We don't discuss these topics here but note that they can be important in the field of large-scale synthetic data generation for instance (see our <a href="https://huggingface.co/blog/cosmopedia">Cosmopedia release</a> on this topic)</d-footnote>.</p>
|
262 |
|
263 |
<h4>Our deduplication parameters</h4>
|
264 |
<p>Following RefinedWeb<d-cite bibtex-key="penedo2023refinedweb"></d-cite>, we decided to apply MinHash, a
|
265 |
+
fuzzy hash based deduplication technique that scales efficiently to many CPU-nodes and allows us to tune similarity thresholds (by controlling the number and size of buckets) as well as the length of the subsequences considered (by controlling the n-gram size). We chose to collect each document's 5-grams<d-footnote>Our units are "words", computed in the <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/pipeline/dedup/minhash.py#L196">MinHash processing function</a> with a <a href="https://github.com/huggingface/datatrove/blob/e9963f69f1fbab1a61339bd1b497f6e138b9f47f/src/datatrove/utils/word_tokenizers.py#L323">language-specific word tokenizer</a>.</d-footnote> and compute minhashes using
|
266 |
112 hash functions in total, split into 14 buckets of 8 hashes each — targeting documents that are at least
|
267 |
75% similar. Documents with the same 8 minhashes in any bucket are considered a duplicate of each other.</p>
|
268 |
<p>This would mean that for two documents with a similarity ($$s$$)
|
|
|
765 |
}
|
766 |
</script>
|
767 |
</body>
|
768 |
+
</html>
|