Spaces:

KalbeDigitalLab
/

nutrigenme-paper-extractor

Running

App Files Files Community

fadliaulawi commited on 27 days ago

Commit

cda22ff

•

1 Parent(s): 7af6232

Tidy up prompts

Browse files

Files changed (3) hide show

process.py +1 -0
prompt.py +2 -237
prompt_old.py +235 -0

process.py CHANGED Viewed

@@ -9,6 +9,7 @@ from langchain_google_genai import ChatGoogleGenerativeAI
 from langchain_openai import ChatOpenAI
 from pdf2image import convert_from_path
 from prompt import *
 from table_detector import detection_transform, device, model, ocr, outputs_to_objects
 import io

 from langchain_openai import ChatOpenAI
 from pdf2image import convert_from_path
 from prompt import *
+from prompt_old import *
 from table_detector import detection_transform, device, model, ocr, outputs_to_objects
 import io

prompt.py CHANGED Viewed

@@ -1,239 +1,3 @@
-prompt_entity_gsd_chunk = """
-# CONTEXT #
-In my capacity as a genomics specialist, I have recently completed the review of a scholarly publication. I am interested in extracting specific genomic information, or entities, from the body of the paper.
-To facilitate this process, I have constructed a predefined schema that outlines the desired entities and their corresponding description.
-This is the schema provided:
-{{
-    "Genes" : {{
-        "type" : "list of strings",
-        "description" : "All relevant genes mentioned in the text. Gene names can only contain uppercase letters and digits."
-    }},
-    "SNPs" : {{
-        "type" : "list of strings",
-        "description" : "Unique identifier associated with each value in Genes schema. These identifiers typically begin with 'rs' and appear near the gene name in the text."
-    }},
-    "Diseases" : {{
-        "type" : "list of strings",
-        "description" : "Type of diseases that related to each value in Genes, typically appear near the gene name in the text."
-    }}
-}}
-Note that the values within each list of Genes, SNPs, and Diseases columns correspond directly to each other. Consequently, the lengths of these lists must be identical.
-This is a passage from the paper: {docs}
-# OBJECTIVE #
-Given a predefined schema outlining relevant entities, meticulously extract these entities from the provided text passage. Exercise caution when handling the reference section of the document, abstain from extracting information from this section.
-IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility. If an entity is entirely absent from the passage, just leave the corresponding field blank with an empty string ('').
-# RESPONSE #
-The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' '). List data will be enclosed within square brackets ([]).
-This is the example of the respose:
-{{
-    "Genes": ["A", "B", "C"],
-    "SNPs": ["rs1", "rs2", "rs3"],
-    "Diseases": ["X", "Y", "Z"]
-}}
-If there is no specific extracted entities, just leave the corresponding field blank with an empty lists ([]).
-"""
-prompt_entity_gsd_combine = """
-# CONTEXT #
-In my role as a genomics specialist, I have extracted specific entities from a scholarly publication. These entities were identified and retrieved from various sections throughout the document. My current objective is to consolidate this extracted information into a concise summary, facilitating a comprehensive understanding of the publication's key findings.
-To achieve this, I have constructed a predefined schema that outlines the desired entities and the methods for their summarization.
-{{
-    "Genes" : {{
-        "type" : "list of strings",
-        "description" : "Identify the most relevant gene from the compiled gene list across all sections."
-    }},
-    "SNPs" : {{
-        "type" : "list of strings",
-        "description" : "Upon completion of the gene combination process, associate each resulting value with its unique identifier."
-    }},
-    "Diseases" : {{
-        "type" : "list of strings",
-        "description" : "Upon completion of the gene combination process, associate each resulting value with its diseases."
-    }}
-}}
-This is a set of summaries: {doc_summaries}
-If there is no extracted entities, just leave the corresponding result with an empty list ([]) later.
-# OBJECTIVE #
-In the context of a predefined schema that specifies entities and their corresponding operations, construct a comprehensive synopsis that incorporates all critical details gleaned from each section.
-IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility.
-# RESPONSE #
-The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' '). List data will be enclosed within square brackets ([]).
-This is the example of the respose:
-{{
-    "Genes": ["A", "B", "C"],
-    "SNPs": ["rs1", "rs2", "rs3"],
-    "Diseases": ["X", "Y", "Z"]
-}}
-"""
-prompt_entity_summ_chunk = """
-# CONTEXT #
-In my capacity as a genomics specialist, I have recently completed the review of a scholarly publication. I am interested in extracting the summary from the body of the paper.
-This is a passage from the paper: {docs}
-# OBJECTIVE #
-Extract the summary or the conclusion from the provided text passage. Exercise caution when handling the reference section of the document, abstain from extracting information from this section.
-IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility.
-# RESPONSE #
-Provide the information in a concise way, using four concise paragraphs. The text should be presented in a continuous format, omitting introductory elements like numbers or titles within each paragraph.
-1. Overview. Explanation of the provided documents and their exploration of the genetic underpinnings of the disease, and understanding of genetic factors in disease pathology.
-2. Main Themes. Identification of genetic variants and mutations contributing to disease susceptibility. Role of specific genes and genetic pathways in disease development and progression.
-3. Key Genetic Factors and Their Implications. Highlighting specific genes or genetic variants associated with the disease. Discussion of how these genetic factors may influence disease susceptibility, severity, or treatment response.
-4. Conclusion. Recap of the key findings regarding genetic factors and disease mechanisms. Suggestions for future research directions or clinical applications based on the insights gained from genetic analysis.
-"""
-prompt_entity_summ_combine = """
-# CONTEXT #
-In my role as a genomics specialist, I have extracted some summaries from a scholarly publication. These summaries were identified and retrieved from various sections throughout the document.
-My current objective is to consolidate this extracted information into a concise summary, facilitating a comprehensive understanding of the publication's key findings.
-This is a set of summaries: {doc_summaries}
-# OBJECTIVE #
-Construct a comprehensive synopsis that incorporates all critical details gleaned from the summaries of each section.
-IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility.
-# RESPONSE #
-Provide the information in a concise way, using four concise paragraphs. The text should be presented in a continuous format, omitting introductory elements like numbers or titles within each paragraph..
-1. Overview. Explanation of the provided documents and their exploration of the genetic underpinnings of the disease, and understanding of genetic factors in disease pathology.
-2. Main Themes. Identification of genetic variants and mutations contributing to disease susceptibility. Role of specific genes and genetic pathways in disease development and progression.
-3. Key Genetic Factors and Their Implications. Highlighting specific genes or genetic variants associated with the disease. Discussion of how these genetic factors may influence disease susceptibility, severity, or treatment response.
-4. Conclusion. Recap of the key findings regarding genetic factors and disease mechanisms. Suggestions for future research directions or clinical applications based on the insights gained from genetic analysis.
-"""
-prompt_entities_chunk = """
-# CONTEXT #
-In my capacity as a genomics specialist, I have recently completed the review of a scholarly publication. I am interested in extracting specific genomic information, or entities, from the body of the paper.
-To facilitate this process, I have constructed a predefined schema that outlines the desired entities and their corresponding description.
-This is the schema provided:
-{{
-    "Population" : {{
-        "type" : "string",
-        "description" : "Population / race used by the author in the given text."
-    }},
-    "Sample Size" : {{
-        "type" : "string",
-        "description" : "Sample size of the population used in the research that mentioned in the paper."
-    }},
-    "Study Methodology" : {{
-        "type" : "string",
-        "description" : "Study methodology mentioned in the text."
-    }},
-    "Study Level" : {{
-        "type" : "string",
-        "description" : "Study level mentioned in the text."
-    }}
-}}
-This is a passage from the paper: {docs}
-# OBJECTIVE #
-Given a predefined schema outlining relevant entities, meticulously extract these entities from the provided text passage. Exercise caution when handling the reference section of the document, abstain from extracting information from this section.
-IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility. If an entity is entirely absent from the passage, just leave the corresponding field blank with an empty string ('').
-# RESPONSE #
-The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' ').
-This is the example of the respose:
-{{
-    "Population": "South Asian",
-    "Sample Size": "403 Relatively Small",
-    "Study Methodology": "Double-Blind Randomized Controlled Trial",
-    "Study Level": "Postdoctoral"
-}}
-If there is no specific extracted entities, just leave the corresponding field blank with an empty string ("").
-"""
-prompt_entities_combine = """
-# CONTEXT #
-In my role as a genomics specialist, I have extracted some summaries from a scholarly publication. These summaries were identified and retrieved from various sections throughout the document. My current objective is to consolidate this extracted information into a concise summary, facilitating a comprehensive understanding of the publication's key findings.
-This is a set of summaries: {doc_summaries}
-If there is no extracted entities, just leave the corresponding result with an empty string ("") later.
-# OBJECTIVE #
-Construct a comprehensive synopsis that incorporates all critical details gleaned from the summaries of each section.
-IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility.
-# RESPONSE #
-The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' ').
-This is the example of the respose:
-{{
-    "Population": "South Asian",
-    "Sample Size": "403 Relatively Small",
-    "Study Methodology": "Double-Blind Randomized Controlled Trial",
-    "Study Level": "Postdoctoral"
-}}
-"""
-prompt_entity_one_chunk = """
-# CONTEXT #
-In my capacity as a genomics specialist, I have recently completed the review of a scholarly publication. I am interested in extracting specific genomic information, or entities, from the body of the paper. To facilitate this process, I have constructed a predefined schema that outlines the desired entities and their corresponding description.
-This is the schema provided:
-{{
-    "Title" : {{
-        "type" : "string",
-        "description" : "Title of the given text."
-    }},
-    "Authors" : {{
-        "type" : "string",
-        "description" : "Authors / writers of the given text. To maintain readability, consider only the first 10 author names, ensuring the other key informations and response format are clear."
-    }},
-    "Publisher Name" : {{
-        "type" : "string",
-        "description" : "Publisher name of the given text."
-    }},
-    "Publication Year" : {{
-        "type" : "string",
-        "description" : "The year when the given text published."
-    }},
-}}
-This is a passage from the paper: {}
-# OBJECTIVE #
-Given a predefined schema outlining relevant entities, meticulously extract these entities from the provided text passage.
-IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility. If an entity is entirely absent from the passage, just leave the corresponding field blank with an empty string ('').
-# RESPONSE #
-The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' ').
-This is the example of the respose:
-{{
-    "Title": "Lorem Ipsum",
-    "Authors": "John Doe, Jane Doe, Alias",
-    "Publisher Name": "Journal of Internal Medicine",
-    "Publication Year": "2024"
-}}
-If there is no specific extracted entities, just leave the corresponding field blank with an empty string ("").
-"""
 prompt_table = """
 # CONTEXT #
 In my capacity as a genomics specialist, I have table data obtained from a published research paper in the field of genomics. The data is provided in a list of JSONs format, with each JSON object representing a single row in a tabular structure. The first JSON element in the list represents the header row of the table, containing the names of each column.
@@ -297,7 +61,8 @@ The output must be only a string containing a list of JSON objects, adhering to
         "OR Value": 1.25,
         "Beta Value": 0.02,
         "P Value": 0.51,
-        "Traits": "A disease"
     }}
 ]
 """

 prompt_table = """
 # CONTEXT #
 In my capacity as a genomics specialist, I have table data obtained from a published research paper in the field of genomics. The data is provided in a list of JSONs format, with each JSON object representing a single row in a tabular structure. The first JSON element in the list represents the header row of the table, containing the names of each column.
         "OR Value": 1.25,
         "Beta Value": 0.02,
         "P Value": 0.51,
+        "Traits": "A disease",
+        "Source": "Trust me bro"
     }}
 ]
 """

prompt_old.py ADDED Viewed

	@@ -0,0 +1,235 @@

+prompt_entity_gsd_chunk = """
+# CONTEXT #
+In my capacity as a genomics specialist, I have recently completed the review of a scholarly publication. I am interested in extracting specific genomic information, or entities, from the body of the paper.
+To facilitate this process, I have constructed a predefined schema that outlines the desired entities and their corresponding description.
+This is the schema provided:
+{{
+    "Genes" : {{
+        "type" : "list of strings",
+        "description" : "All relevant genes mentioned in the text. Gene names can only contain uppercase letters and digits."
+    }},
+    "SNPs" : {{
+        "type" : "list of strings",
+        "description" : "Unique identifier associated with each value in Genes schema. These identifiers typically begin with 'rs' and appear near the gene name in the text."
+    }},
+    "Diseases" : {{
+        "type" : "list of strings",
+        "description" : "Type of diseases that related to each value in Genes, typically appear near the gene name in the text."
+    }}
+}}
+Note that the values within each list of Genes, SNPs, and Diseases columns correspond directly to each other. Consequently, the lengths of these lists must be identical.
+This is a passage from the paper: {docs}
+# OBJECTIVE #
+Given a predefined schema outlining relevant entities, meticulously extract these entities from the provided text passage. Exercise caution when handling the reference section of the document, abstain from extracting information from this section.
+IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility. If an entity is entirely absent from the passage, just leave the corresponding field blank with an empty string ('').
+# RESPONSE #
+The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' '). List data will be enclosed within square brackets ([]).
+This is the example of the respose:
+{{
+    "Genes": ["A", "B", "C"],
+    "SNPs": ["rs1", "rs2", "rs3"],
+    "Diseases": ["X", "Y", "Z"]
+}}
+If there is no specific extracted entities, just leave the corresponding field blank with an empty lists ([]).
+"""
+prompt_entity_gsd_combine = """
+# CONTEXT #
+In my role as a genomics specialist, I have extracted specific entities from a scholarly publication. These entities were identified and retrieved from various sections throughout the document. My current objective is to consolidate this extracted information into a concise summary, facilitating a comprehensive understanding of the publication's key findings.
+To achieve this, I have constructed a predefined schema that outlines the desired entities and the methods for their summarization.
+{{
+    "Genes" : {{
+        "type" : "list of strings",
+        "description" : "Identify the most relevant gene from the compiled gene list across all sections."
+    }},
+    "SNPs" : {{
+        "type" : "list of strings",
+        "description" : "Upon completion of the gene combination process, associate each resulting value with its unique identifier."
+    }},
+    "Diseases" : {{
+        "type" : "list of strings",
+        "description" : "Upon completion of the gene combination process, associate each resulting value with its diseases."
+    }}
+}}
+This is a set of summaries: {doc_summaries}
+If there is no extracted entities, just leave the corresponding result with an empty list ([]) later.
+# OBJECTIVE #
+In the context of a predefined schema that specifies entities and their corresponding operations, construct a comprehensive synopsis that incorporates all critical details gleaned from each section.
+IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility.
+# RESPONSE #
+The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' '). List data will be enclosed within square brackets ([]).
+This is the example of the respose:
+{{
+    "Genes": ["A", "B", "C"],
+    "SNPs": ["rs1", "rs2", "rs3"],
+    "Diseases": ["X", "Y", "Z"]
+}}
+"""
+prompt_entity_summ_chunk = """
+# CONTEXT #
+In my capacity as a genomics specialist, I have recently completed the review of a scholarly publication. I am interested in extracting the summary from the body of the paper.
+This is a passage from the paper: {docs}
+# OBJECTIVE #
+Extract the summary or the conclusion from the provided text passage. Exercise caution when handling the reference section of the document, abstain from extracting information from this section.
+IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility.
+# RESPONSE #
+Provide the information in a concise way, using four concise paragraphs. The text should be presented in a continuous format, omitting introductory elements like numbers or titles within each paragraph.
+1. Overview. Explanation of the provided documents and their exploration of the genetic underpinnings of the disease, and understanding of genetic factors in disease pathology.
+2. Main Themes. Identification of genetic variants and mutations contributing to disease susceptibility. Role of specific genes and genetic pathways in disease development and progression.
+3. Key Genetic Factors and Their Implications. Highlighting specific genes or genetic variants associated with the disease. Discussion of how these genetic factors may influence disease susceptibility, severity, or treatment response.
+4. Conclusion. Recap of the key findings regarding genetic factors and disease mechanisms. Suggestions for future research directions or clinical applications based on the insights gained from genetic analysis.
+"""
+prompt_entity_summ_combine = """
+# CONTEXT #
+In my role as a genomics specialist, I have extracted some summaries from a scholarly publication. These summaries were identified and retrieved from various sections throughout the document.
+My current objective is to consolidate this extracted information into a concise summary, facilitating a comprehensive understanding of the publication's key findings.
+This is a set of summaries: {doc_summaries}
+# OBJECTIVE #
+Construct a comprehensive synopsis that incorporates all critical details gleaned from the summaries of each section.
+IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility.
+# RESPONSE #
+Provide the information in a concise way, using four concise paragraphs. The text should be presented in a continuous format, omitting introductory elements like numbers or titles within each paragraph..
+1. Overview. Explanation of the provided documents and their exploration of the genetic underpinnings of the disease, and understanding of genetic factors in disease pathology.
+2. Main Themes. Identification of genetic variants and mutations contributing to disease susceptibility. Role of specific genes and genetic pathways in disease development and progression.
+3. Key Genetic Factors and Their Implications. Highlighting specific genes or genetic variants associated with the disease. Discussion of how these genetic factors may influence disease susceptibility, severity, or treatment response.
+4. Conclusion. Recap of the key findings regarding genetic factors and disease mechanisms. Suggestions for future research directions or clinical applications based on the insights gained from genetic analysis.
+"""
+prompt_entities_chunk = """
+# CONTEXT #
+In my capacity as a genomics specialist, I have recently completed the review of a scholarly publication. I am interested in extracting specific genomic information, or entities, from the body of the paper.
+To facilitate this process, I have constructed a predefined schema that outlines the desired entities and their corresponding description.
+This is the schema provided:
+{{
+    "Population" : {{
+        "type" : "string",
+        "description" : "Population / race used by the author in the given text."
+    }},
+    "Sample Size" : {{
+        "type" : "string",
+        "description" : "Sample size of the population used in the research that mentioned in the paper."
+    }},
+    "Study Methodology" : {{
+        "type" : "string",
+        "description" : "Study methodology mentioned in the text."
+    }},
+    "Study Level" : {{
+        "type" : "string",
+        "description" : "Study level mentioned in the text."
+    }}
+}}
+This is a passage from the paper: {docs}
+# OBJECTIVE #
+Given a predefined schema outlining relevant entities, meticulously extract these entities from the provided text passage. Exercise caution when handling the reference section of the document, abstain from extracting information from this section.
+IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility. If an entity is entirely absent from the passage, just leave the corresponding field blank with an empty string ('').
+# RESPONSE #
+The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' ').
+This is the example of the respose:
+{{
+    "Population": "South Asian",
+    "Sample Size": "403 Relatively Small",
+    "Study Methodology": "Double-Blind Randomized Controlled Trial",
+    "Study Level": "Postdoctoral"
+}}
+If there is no specific extracted entities, just leave the corresponding field blank with an empty string ("").
+"""
+prompt_entities_combine = """
+# CONTEXT #
+In my role as a genomics specialist, I have extracted some summaries from a scholarly publication. These summaries were identified and retrieved from various sections throughout the document. My current objective is to consolidate this extracted information into a concise summary, facilitating a comprehensive understanding of the publication's key findings.
+This is a set of summaries: {doc_summaries}
+If there is no extracted entities, just leave the corresponding result with an empty string ("") later.
+# OBJECTIVE #
+Construct a comprehensive synopsis that incorporates all critical details gleaned from the summaries of each section.
+IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility.
+# RESPONSE #
+The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' ').
+This is the example of the respose:
+{{
+    "Population": "South Asian",
+    "Sample Size": "403 Relatively Small",
+    "Study Methodology": "Double-Blind Randomized Controlled Trial",
+    "Study Level": "Postdoctoral"
+}}
+"""
+prompt_entity_one_chunk = """
+# CONTEXT #
+In my capacity as a genomics specialist, I have recently completed the review of a scholarly publication. I am interested in extracting specific genomic information, or entities, from the body of the paper. To facilitate this process, I have constructed a predefined schema that outlines the desired entities and their corresponding description.
+This is the schema provided:
+{{
+    "Title" : {{
+        "type" : "string",
+        "description" : "Title of the given text."
+    }},
+    "Authors" : {{
+        "type" : "string",
+        "description" : "Authors / writers of the given text. To maintain readability, consider only the first 10 author names, ensuring the other key informations and response format are clear."
+    }},
+    "Publisher Name" : {{
+        "type" : "string",
+        "description" : "Publisher name of the given text."
+    }},
+    "Publication Year" : {{
+        "type" : "string",
+        "description" : "The year when the given text published."
+    }},
+}}
+This is a passage from the paper: {}
+# OBJECTIVE #
+Given a predefined schema outlining relevant entities, meticulously extract these entities from the provided text passage.
+IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility. If an entity is entirely absent from the passage, just leave the corresponding field blank with an empty string ('').
+# RESPONSE #
+The extracted information will be utilizing the JSON format. Information within strings will be demarcated by double quotes (" "), while quotes contained within strings will be denoted by single quotes (' ').
+This is the example of the respose:
+{{
+    "Title": "Lorem Ipsum",
+    "Authors": "John Doe, Jane Doe, Alias",
+    "Publisher Name": "Journal of Internal Medicine",
+    "Publication Year": "2024"
+}}
+If there is no specific extracted entities, just leave the corresponding field blank with an empty string ("").
+"""