victormiller
commited on
Commit
•
860a948
1
Parent(s):
0f2f2a7
Update curated.py
Browse files- curated.py +3 -3
curated.py
CHANGED
@@ -33,7 +33,7 @@ curated_sources_intro = Div(
|
|
33 |
P(
|
34 |
"Curated sources comprise high-quality datasets that contain domain-specificity.",
|
35 |
B(
|
36 |
-
" TxT360 was strongly influenced by The Pile regarding both inclusion of the dataset and filtering techniques."
|
37 |
),
|
38 |
" These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
|
39 |
),
|
@@ -685,7 +685,7 @@ filtering_process = Div(
|
|
685 |
),
|
686 |
P(
|
687 |
B("Download and Extraction: "),
|
688 |
-
"All the data was downloaded in original latex format from
|
689 |
A("s3://arxic/src", href="s3://arxic/src"),
|
690 |
". We try to encode the downloaded data into utf-8 or guess encoding using chardet library. After that pandoc was used to extract information from the latex files and saved as markdown format",
|
691 |
D_code(
|
@@ -703,7 +703,7 @@ filtering_process = Div(
|
|
703 |
),
|
704 |
P(
|
705 |
B(" Filters Applied: "),
|
706 |
-
"multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset
|
707 |
),
|
708 |
Ul(
|
709 |
Li(
|
|
|
33 |
P(
|
34 |
"Curated sources comprise high-quality datasets that contain domain-specificity.",
|
35 |
B(
|
36 |
+
" TxT360 was strongly influenced by The Pile", D_cite(bibtex_key="thepile"), " regarding both inclusion of the dataset and filtering techniques."
|
37 |
),
|
38 |
" These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
|
39 |
),
|
|
|
685 |
),
|
686 |
P(
|
687 |
B("Download and Extraction: "),
|
688 |
+
"All the data was downloaded in original latex format from ArXiv official S3 repo: ",
|
689 |
A("s3://arxic/src", href="s3://arxic/src"),
|
690 |
". We try to encode the downloaded data into utf-8 or guess encoding using chardet library. After that pandoc was used to extract information from the latex files and saved as markdown format",
|
691 |
D_code(
|
|
|
703 |
),
|
704 |
P(
|
705 |
B(" Filters Applied: "),
|
706 |
+
"multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset", D_cite(bibtex_key="peS2o"),
|
707 |
),
|
708 |
Ul(
|
709 |
Li(
|