TxT360

Sleeping

victormiller commited on Oct 4, 2024

Commit

860a948

verified ·

1 Parent(s): 0f2f2a7

Update curated.py

Files changed (1) hide show

curated.py CHANGED Viewed

@@ -33,7 +33,7 @@ curated_sources_intro = Div(
     P(
         "Curated sources comprise high-quality datasets that contain domain-specificity.",
         B(
-            " TxT360 was strongly influenced by The Pile regarding both inclusion of the dataset and filtering techniques."
         ),
         " These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
     ),
@@ -685,7 +685,7 @@ filtering_process = Div(
             ),
             P(
                 B("Download and Extraction: "),
-                "All the data was downloaded in original latex format from Arxiv official S3 dump ",
                 A("s3://arxic/src", href="s3://arxic/src"),
                 ". We try to encode the downloaded data into utf-8 or guess encoding using chardet library. After that pandoc was used to extract information from the latex files and saved as markdown format",
                 D_code(
@@ -703,7 +703,7 @@ filtering_process = Div(
             ),
             P(
                 B(" Filters Applied: "),
-                "multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset (citation needed)",
             ),
             Ul(
                 Li(

     P(
         "Curated sources comprise high-quality datasets that contain domain-specificity.",
         B(
+            " TxT360 was strongly influenced by The Pile",  D_cite(bibtex_key="thepile"), " regarding both inclusion of the dataset and filtering techniques."
         ),
         " These sources, such as Arxiv, Wikipedia, and Stack Exchange, provide valuable data that is excluded from the web dataset mentioned above. Analyzing and processing non-web data can yield insights and opportunities for various applications. Details about each of the sources are provided below. ",
     ),
             ),
             P(
                 B("Download and Extraction: "),
+                "All the data was downloaded in original latex format from ArXiv official S3 repo: ",
                 A("s3://arxic/src", href="s3://arxic/src"),
                 ". We try to encode the downloaded data into utf-8 or guess encoding using chardet library. After that pandoc was used to extract information from the latex files and saved as markdown format",
                 D_code(
             ),
             P(
                 B(" Filters Applied: "),
+                "multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset",  D_cite(bibtex_key="peS2o"),
             ),
             Ul(
                 Li(