Spaces:

LLM360
/

TxT360

Running

App Files Files Community

victormiller commited on Oct 4

Commit

6a336ca

•

1 Parent(s): 66a1161

Update curated.py

Browse files

Files changed (1) hide show

curated.py +13 -13

curated.py CHANGED Viewed

@@ -539,7 +539,7 @@ data_preprocessing_div = Div(
     P(
         "The ",
         B("Minimum Word Count Filter"),
-        " sets a threshold for required words within a document. This step filters out low-quality or incomplete documents. However, this step may remove documents that contain valuable information so a proper analysis is important for each datasource.",
     ),
     P(
         "The ",
@@ -570,7 +570,7 @@ data_preprocessing_div = Div(
     P(
         "The ",
         B("Paragraph Count Filter"),
-        " counts the number of paragraphs in each document. This step helps to analyze the structure and length of documents which can be a useful hueristic for document complexity.",
     ),
     P(
         "The ",
@@ -659,7 +659,7 @@ filtering_process = Div(
             ),
             P(
                 B("Filtering: "),
-                "Manual inspection of the dataset demostrated high quality content. Only one filter was used to remove articles with few words. Based normal sentence constructs, the article was kept if it contained 10 or more words. Any article with fewer than 10 words was removed.",
             ),
             table_div_wikipedia,
             Details(
@@ -694,10 +694,10 @@ filtering_process = Div(
                 ),
                 ". All markdowns were combined to create jsonl files.",
             ),
-            P(B("Unique Data Preperation Challenges: ")),
             Ul(
                 Li(
-                    "Due to large amounts of meaningful data being contained in table formats, speical consideration was taken to extract the data and proper metadata.",
                     style="margin-bottom: -3px",
                 ),
             ),
@@ -715,7 +715,7 @@ filtering_process = Div(
                     style="margin-bottom: -3px",
                 ),
                 Li(
-                    "Unigram Log Probablity Filter Theshold: -20",
                     style="margin-bottom: -3px",
                 ),
                 Li(
@@ -859,7 +859,7 @@ filtering_process = Div(
                 D_code("pandoc -f jats {nxml} -o {pmcid}.md", language="bash"),
                 ". The markdown files are combined to create jsonl files. PubMed Abstract (PMA) files  were downloaded in xml. The BeautifulSoup library was used to extract the abstract, title, and PMID. All files were stored in jsonl format.",
             ),
-            P(B("Unique Data Preperation Challenges: ")),
             Ul(
                 Li(
                     "Due to large amounts of meaningful data being contained in table formats, speical consideration was taken to extract the data and proper metadata.",
@@ -1112,7 +1112,7 @@ filtering_process = Div(
             P(
                 "The HackerNews dataset contains a vast amount of stories and is known for lively discussions. Due to the number of replies a story may contain, only longest comment thread for each story was sampled past level 3. All stories included the title (1st level) and all direct replies (2nd level)."
             ),
-            P(B("Unique Data Preperation Challenges: ")),
             Ul(
                 Li(
                     "As discussed above, the comment heirarchies required a thoughful approach to extracting meaningful data. ",
@@ -1190,7 +1190,7 @@ filtering_process = Div(
             P(
                 "All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priorty was given to plain_text first, followed by the columns in the table in reverse order."
             ),
-            P(B("Unique Data Preperation Challenges: ")),
             Ul(
                 Li(
                     "Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
@@ -1261,7 +1261,7 @@ filtering_process = Div(
                 block="block",
                 language="python",
             ),
-            P(B("Unique Data Preperation Challenges: ")),
             Ul(
                 Li(
                     "Handling code block was a required finding the specific blocks and exacting the details in one snippet.",
@@ -1328,7 +1328,7 @@ filtering_process = Div(
                 block="block",
                 language="python",
             ),
-            P(B("Unique Data Preperation Challenges: ")),
             Ul(
                 Li(
                     "Similar to the HackerNews challenges, we had to map comments and sub-comments to the original question.",
@@ -1366,7 +1366,7 @@ filtering_process = Div(
                 block="block",
                 language="python",
             ),
-            P(B("Unique Data Preperation Challenges: ")),
             Ul(
                 Li(
                     "A byte string was included at the beginning of new lines",
@@ -1409,7 +1409,7 @@ filtering_process = Div(
                 ),
                 ".",
             ),
-            P(B("Unique Data Preperation Challenges: ")),
             Ul(
                 Li(
                     "Consecutive whitespaces were found spanning 10+ whitespace entries. These whitespaces were reduce to one, single whitespace.",

     P(
         "The ",
         B("Minimum Word Count Filter"),
+        " sets a threshold for required words within a document. This step filters out low-quality or incomplete documents. However, this step may remove documents that contain valuable information so a proper analysis is important for each data source.",
     ),
     P(
         "The ",
     P(
         "The ",
         B("Paragraph Count Filter"),
+        " counts the number of paragraphs in each document. This step helps to analyze the structure and length of documents which can be a useful heuristic for document complexity.",
     ),
     P(
         "The ",
             ),
             P(
                 B("Filtering: "),
+                "Manual inspection of the dataset demonstrated high quality content. Only one filter was used to remove articles with few words. Based normal sentence constructs, the article was kept if it contained 10 or more words. Any article with fewer than 10 words was removed.",
             ),
             table_div_wikipedia,
             Details(
                 ),
                 ". All markdowns were combined to create jsonl files.",
             ),
+            P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
+                    "Due to large amounts of meaningful data being contained in table formats, special consideration was taken to extract the data and proper metadata.",
                     style="margin-bottom: -3px",
                 ),
             ),
                     style="margin-bottom: -3px",
                 ),
                 Li(
+                    "Unigram Log Probability Filter Threshold: -20",
                     style="margin-bottom: -3px",
                 ),
                 Li(
                 D_code("pandoc -f jats {nxml} -o {pmcid}.md", language="bash"),
                 ". The markdown files are combined to create jsonl files. PubMed Abstract (PMA) files  were downloaded in xml. The BeautifulSoup library was used to extract the abstract, title, and PMID. All files were stored in jsonl format.",
             ),
+            P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
                     "Due to large amounts of meaningful data being contained in table formats, speical consideration was taken to extract the data and proper metadata.",
             P(
                 "The HackerNews dataset contains a vast amount of stories and is known for lively discussions. Due to the number of replies a story may contain, only longest comment thread for each story was sampled past level 3. All stories included the title (1st level) and all direct replies (2nd level)."
             ),
+            P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
                     "As discussed above, the comment heirarchies required a thoughful approach to extracting meaningful data. ",
             P(
                 "All content was downloaded leading to high number of documents filtered during local deduplication. Following The Pile, priorty was given to plain_text first, followed by the columns in the table in reverse order."
             ),
+            P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
                     "Consecutive whitespaces and tabs were found. Consecutive Whitespaces and tabes were reduce to one, single whitespace.",
                 block="block",
                 language="python",
             ),
+            P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
                     "Handling code block was a required finding the specific blocks and exacting the details in one snippet.",
                 block="block",
                 language="python",
             ),
+            P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
                     "Similar to the HackerNews challenges, we had to map comments and sub-comments to the original question.",
                 block="block",
                 language="python",
             ),
+            P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
                     "A byte string was included at the beginning of new lines",
                 ),
                 ".",
             ),
+            P(B("Unique Data Preparation Challenges: ")),
             Ul(
                 Li(
                     "Consecutive whitespaces were found spanning 10+ whitespace entries. These whitespaces were reduce to one, single whitespace.",