Spaces:
Running
Running
omkarenator
commited on
Commit
•
7bde7c0
1
Parent(s):
149e56a
more fixes
Browse files- curated.py +1 -1
- main.py +2 -2
- web.py +2 -2
curated.py
CHANGED
@@ -609,7 +609,7 @@ data_preprocessing_div = Div(
|
|
609 |
B("Unigram Log Probability Filter"),
|
610 |
" calculates the log probability of each unigram to measure the significance of individual words. This step quantifies the importance of individual words but may not capture the semantic meaning of words. To calculate the average log word probability, we use word frequencies extracted from the ",
|
611 |
A("1T Web-gram corpus", href="https://catalog.ldc.upenn.edu/LDC2006T13"),
|
612 |
-
". Specifically, we use the list
|
613 |
A(
|
614 |
"Rachel Tatman",
|
615 |
href="https://www.kaggle.com/datasets/rtatman/english-word-frequency",
|
|
|
609 |
B("Unigram Log Probability Filter"),
|
610 |
" calculates the log probability of each unigram to measure the significance of individual words. This step quantifies the importance of individual words but may not capture the semantic meaning of words. To calculate the average log word probability, we use word frequencies extracted from the ",
|
611 |
A("1T Web-gram corpus", href="https://catalog.ldc.upenn.edu/LDC2006T13"),
|
612 |
+
". Specifically, we use the available list created by ",
|
613 |
A(
|
614 |
"Rachel Tatman",
|
615 |
href="https://www.kaggle.com/datasets/rtatman/english-word-frequency",
|
main.py
CHANGED
@@ -864,7 +864,7 @@ def intro():
|
|
864 |
A(B("TxT360 (Trillion eXtracted Text),"), href="https://huggingface.co/datasets/LLM360/TxT360"),
|
865 |
" the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. Our findings highlight the importance of both high-quality data sources and appropriate weighting for optimal blending in LLM training."
|
866 |
),
|
867 |
-
P("In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics, our code (stay tuned!), analysis results and more, in
|
868 |
),
|
869 |
plotly2fasthtml(all_eval_res_figs["MMLU"]),
|
870 |
P(
|
@@ -899,7 +899,7 @@ def intro():
|
|
899 |
"In LLM pretraining, it is common to combine all possible text sources due to the Scaling Law. Crawled web pages are included to provide a vast quantity of data which can cover long tail and diverse information, while curated datasets such as Wikipedia are also used, which often provide the 'deep-dive' domain information. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training."
|
900 |
),
|
901 |
P(
|
902 |
-
"** TxT360 does not include very specific domains such as code and math. This decision was made due to the perceived low duplication code with other sources, and the different logic
|
903 |
D_cite(bibtex_key="lozhkov2024starcoder2stackv2"),
|
904 |
".",
|
905 |
),
|
|
|
864 |
A(B("TxT360 (Trillion eXtracted Text),"), href="https://huggingface.co/datasets/LLM360/TxT360"),
|
865 |
" the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. We demonstrate a simple but effective upsampling recipe that creates a 15+ trillion-token corpus, outperforming FineWeb 15T on several key metrics. With the information, TxT360 empowers pre-trainers to explore more advanced weighting techniques, a feature not commonly available in previous pre-training datasets. Our findings highlight the importance of both high-quality data sources and appropriate weighting for optimal blending in LLM training."
|
866 |
),
|
867 |
+
P("In line with our 360° open source spirit, we document all detailed steps, reasons of our decisions, detailed statistics, our code (stay tuned!), analysis results and more, in addition to the dataset itself. We hope this can serve as a useful resource for future developers."
|
868 |
),
|
869 |
plotly2fasthtml(all_eval_res_figs["MMLU"]),
|
870 |
P(
|
|
|
899 |
"In LLM pretraining, it is common to combine all possible text sources due to the Scaling Law. Crawled web pages are included to provide a vast quantity of data which can cover long tail and diverse information, while curated datasets such as Wikipedia are also used, which often provide the 'deep-dive' domain information. By integrating the reach of web data with the quality of curated sources, TxT360 meets and surpasses the rigorous standards required for state-of-the-art LLM pre-training."
|
900 |
),
|
901 |
P(
|
902 |
+
"** TxT360 does not include very specific domains such as code and math. This decision was made due to the perceived low duplication code with other sources, and the different logic required to build those datasets. We leave that to future work and recommend users refer to existing projects such as Stack V2",
|
903 |
D_cite(bibtex_key="lozhkov2024starcoder2stackv2"),
|
904 |
".",
|
905 |
),
|
web.py
CHANGED
@@ -657,7 +657,7 @@ def web_data():
|
|
657 |
""",
|
658 |
),
|
659 |
P(B("Toxic Lines: "), """
|
660 |
-
When
|
661 |
document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
|
662 |
by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
|
663 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
@@ -1455,7 +1455,7 @@ def web_data():
|
|
1455 |
),
|
1456 |
P("""
|
1457 |
Both Dolma and RedPajama V2 split texts into words using white spaces and newline symbols. However,
|
1458 |
-
DataTrove employs a tokenizer to split texts into words and ignore
|
1459 |
word count compared to simple `text.split()`.
|
1460 |
We decided to use simple `len(text.split())` to compute the word count.
|
1461 |
"""),
|
|
|
657 |
""",
|
658 |
),
|
659 |
P(B("Toxic Lines: "), """
|
660 |
+
When manually inspecting the data, we found that there are some adult ads in the beginning or end of the
|
661 |
document (with a sample shown below), which are hard to remove via document-level filtering strategies. Inspired
|
662 |
by this, we develop line-level detoxification using a bad word list from LDNOOBW (+ rule: word length < 10 + the
|
663 |
line is in the first 3 lines or in the last 3 lines) to remove toxic lines. Specifically, we do not only consider
|
|
|
1455 |
),
|
1456 |
P("""
|
1457 |
Both Dolma and RedPajama V2 split texts into words using white spaces and newline symbols. However,
|
1458 |
+
DataTrove employs a tokenizer to split texts into words and ignore punctuation, resulting in a higher
|
1459 |
word count compared to simple `text.split()`.
|
1460 |
We decided to use simple `len(text.split())` to compute the word count.
|
1461 |
"""),
|