BlankCheng
commited on
Commit
•
34cb4c4
1
Parent(s):
0017b35
Ensure format
Browse files- curated.py +12 -9
curated.py
CHANGED
@@ -720,16 +720,18 @@ filtering_process = Div(
|
|
720 |
". Finally, all markdowns were combined to create jsonl files.",
|
721 |
),
|
722 |
P(B("Unique Data Preparation Challenges: ")),
|
723 |
-
P(
|
|
|
|
|
724 |
Ul(
|
725 |
Li(
|
726 |
B("Tables: "),
|
727 |
-
"The process for handling tables follows three main approaches. First, tables compatible with Pandoc’s built-in formats are directly converted into
|
728 |
style="margin-bottom: -3px",
|
729 |
),
|
730 |
Li(
|
731 |
B("Mathematical Expressions: "),
|
732 |
-
"Inline mathematical expressions are rendered in Markdown
|
733 |
style="margin-bottom: -3px",
|
734 |
),
|
735 |
Li(
|
@@ -739,17 +741,15 @@ filtering_process = Div(
|
|
739 |
),
|
740 |
Li(
|
741 |
B("Section Headers: "),
|
742 |
-
"Section headers are converted into markdown format, using leading
|
743 |
style="margin-bottom: -3px",
|
744 |
),
|
745 |
Li(
|
746 |
B("References: "),
|
747 |
"References are removed. Although they may be informative, references often introduce formatting inconsistencies or add little value compared to the core content of the paper.",
|
748 |
style="margin-bottom: -3px",
|
749 |
-
)
|
750 |
-
)
|
751 |
-
|
752 |
-
|
753 |
P(
|
754 |
B(" Filters Applied: "),
|
755 |
"multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset",
|
@@ -906,7 +906,10 @@ filtering_process = Div(
|
|
906 |
href="ttps://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/",
|
907 |
),
|
908 |
". PubMed Central (PMC) files are downloaded in an xml.tar format. The tar files are opened and converted to markdown format using pandoc",
|
909 |
-
D_code(
|
|
|
|
|
|
|
910 |
". The markdown files are combined to create jsonl files. PubMed Abstract (PMA) files were downloaded in xml. The BeautifulSoup library was used to extract the abstract, title, and PMID. All files were stored in jsonl format.",
|
911 |
),
|
912 |
P(B("Unique Data Preparation Challenges: ")),
|
|
|
720 |
". Finally, all markdowns were combined to create jsonl files.",
|
721 |
),
|
722 |
P(B("Unique Data Preparation Challenges: ")),
|
723 |
+
P(
|
724 |
+
"When converting LaTeX files into Markdown using Pandoc, it is crucial to account for different data formats to minimize information loss while also filtering out noisy content in LaTeX. Below, we outline our considerations and methods for handling various data types during this conversion process:"
|
725 |
+
),
|
726 |
Ul(
|
727 |
Li(
|
728 |
B("Tables: "),
|
729 |
+
"The process for handling tables follows three main approaches. First, tables compatible with Pandoc’s built-in formats are directly converted into standard Markdown tables. Notably, LaTeX’s '\\multicolumn' and '\\multirow' commands can be successfully translated into valid Markdown tables. Second, tables unsupported by Pandoc’s native functionality, such as deluxetable or other complex LaTeX types, are preserved in their original LaTeX format to maintain the integrity of complex structures. Third, only a few remaining tables have been converted to HTML web tables.",
|
730 |
style="margin-bottom: -3px",
|
731 |
),
|
732 |
Li(
|
733 |
B("Mathematical Expressions: "),
|
734 |
+
"Inline mathematical expressions are rendered in Markdown. More complex equations remain unchanged, e.g., presented as '\\begin{aligned}' blocks, to ensure accuracy and readability.",
|
735 |
style="margin-bottom: -3px",
|
736 |
),
|
737 |
Li(
|
|
|
741 |
),
|
742 |
Li(
|
743 |
B("Section Headers: "),
|
744 |
+
"Section headers are converted into markdown format, using leading '#' symbols to represent the heading levels.",
|
745 |
style="margin-bottom: -3px",
|
746 |
),
|
747 |
Li(
|
748 |
B("References: "),
|
749 |
"References are removed. Although they may be informative, references often introduce formatting inconsistencies or add little value compared to the core content of the paper.",
|
750 |
style="margin-bottom: -3px",
|
751 |
+
),
|
752 |
+
),
|
|
|
|
|
753 |
P(
|
754 |
B(" Filters Applied: "),
|
755 |
"multiple filters are used here after manually verifying output of all the filters as suggested by peS2o dataset",
|
|
|
906 |
href="ttps://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_package/",
|
907 |
),
|
908 |
". PubMed Central (PMC) files are downloaded in an xml.tar format. The tar files are opened and converted to markdown format using pandoc",
|
909 |
+
D_code(
|
910 |
+
"pandoc <raw_xml_path> -s -o <output_markdown_path> -f jats -t markdown_mmd [--lua-filter <lua_filter_path>]",
|
911 |
+
language="bash",
|
912 |
+
),
|
913 |
". The markdown files are combined to create jsonl files. PubMed Abstract (PMA) files were downloaded in xml. The BeautifulSoup library was used to extract the abstract, title, and PMID. All files were stored in jsonl format.",
|
914 |
),
|
915 |
P(B("Unique Data Preparation Challenges: ")),
|