Spaces:

m42-health
/

MEDIC-Benchmark

Running

App Files Files Community

tathagataraha commited on 16 days ago

Commit

5c80286

1 Parent(s): 23fd02c

[ADD] Dataset descriptions for cross-examination framework

Browse files

Files changed (2) hide show

app.py +2 -1
src/about.py +13 -3

app.py CHANGED Viewed

@@ -12,6 +12,7 @@ from src.about import (
     INTRODUCTION_TEXT,
     LLM_BENCHMARKS_TEXT_1,
     CROSS_EVALUATION_METRICS,
     # EVALUATION_EXAMPLE_IMG,
     # LLM_BENCHMARKS_TEXT_2,
     # ENTITY_DISTRIBUTION_IMG,
@@ -708,7 +709,7 @@ with demo:
                 with gr.Accordion("Cross Examination", open=False):
                     system_prompt, user_prompt = render_generation_templates(task="ce", generation_type="cross_examination")
         with gr.TabItem("🏅 Note generation", elem_id="llm-benchmark-tab-table", id=4):
-            gr.Markdown(CROSS_EVALUATION_METRICS, elem_classes="markdown-text")
             with gr.Tabs(elem_classes="tab-buttons2") as tabs:
                 with gr.TabItem("ACI Bench", elem_id="llm-benchmark-tab-table2", id=0):
                     with gr.Row():

     INTRODUCTION_TEXT,
     LLM_BENCHMARKS_TEXT_1,
     CROSS_EVALUATION_METRICS,
+    NOTE_GENERATION_METRICS,
     # EVALUATION_EXAMPLE_IMG,
     # LLM_BENCHMARKS_TEXT_2,
     # ENTITY_DISTRIBUTION_IMG,
                 with gr.Accordion("Cross Examination", open=False):
                     system_prompt, user_prompt = render_generation_templates(task="ce", generation_type="cross_examination")
         with gr.TabItem("🏅 Note generation", elem_id="llm-benchmark-tab-table", id=4):
+            gr.Markdown(NOTE_GENERATION_METRICS, elem_classes="markdown-text")
             with gr.Tabs(elem_classes="tab-buttons2") as tabs:
                 with gr.TabItem("ACI Bench", elem_id="llm-benchmark-tab-table2", id=0):
                     with gr.Row():

src/about.py CHANGED Viewed

@@ -58,7 +58,7 @@ class MedSafetyColumns(Enum):
     med_safety_column7 = MedSafetyColumn("Physician's Freedom of Choice", "score", "Physician's Freedom of Choice")
     med_safety_column8 = MedSafetyColumn("Professionalism and Honesty", "score", "Professionalism and Honesty")
     med_safety_column9 = MedSafetyColumn("Responsibility to Patient", "score", "Responsibility to Patient")
-    med_safety_column8 = MedSafetyColumn("Law and Responsibility to Society", "score", "Law and Responsibility to Society")
 @dataclass
 class MedicalSummarizationColumn:
@@ -149,7 +149,9 @@ Medical Safety category uses the "med-safety" benchmark dataset, which consists
 In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
 ### Medical Summarization
-This category evaluates the LLM's ability to summarize medical texts, such as clinical trial descriptions and progress notes. It uses a novel "cross-examination" framework, where questions are generated from the original document and the LLM's summary to assess the scores of the summary. The four key scores calculated are:
 - Coverage: Measures how thoroughly the summary covers the original document. A higher score means the summary includes more details from the original.
 - Conformity: Also called the non-contradiction score, this checks if the summary avoids contradicting the original document. A higher score means the summary aligns better with the original.
@@ -157,7 +159,15 @@ This category evaluates the LLM's ability to summarize medical texts, such as cl
 - Conciseness: Measures how brief the summary is. A higher score means the summary is more concise. A negative score means the summary is longer than the original document.
 ### Note Generation
-This category assesses the LLM's ability to generate structured clinical (ACI-Bench) and SOAP notes from doctor-patient conversations. It uses the same cross-examination framework as Medical Summarization.
 """
 EVALUATION_QUEUE_TEXT = """

     med_safety_column7 = MedSafetyColumn("Physician's Freedom of Choice", "score", "Physician's Freedom of Choice")
     med_safety_column8 = MedSafetyColumn("Professionalism and Honesty", "score", "Professionalism and Honesty")
     med_safety_column9 = MedSafetyColumn("Responsibility to Patient", "score", "Responsibility to Patient")
+    med_safety_column10 = MedSafetyColumn("Law and Responsibility to Society", "score", "Law and Responsibility to Society")
 @dataclass
 class MedicalSummarizationColumn:
 In this evaluation, an LLM acts as a judge. Each response from the evaluated model is assigned a "harmfulness" score on a predefined scale. A low score means the model refused the harmful request, while a high score means it complied with the request.
 ### Medical Summarization
+This category evaluates the LLM's ability to summarize medical texts, with a focus on clinical trial descriptions from ClinicalTrials.gov. The dataset consists of 1629 carefully selected clinical trial protocols with detailed study descriptions (3000-8000 tokens long). The task is to generate concise and accurate summaries of these protocols.
+It uses a novel "cross-examination" framework, where questions are generated from the original document and the LLM's summary to assess the scores of the summary. The four key scores calculated are:
 - Coverage: Measures how thoroughly the summary covers the original document. A higher score means the summary includes more details from the original.
 - Conformity: Also called the non-contradiction score, this checks if the summary avoids contradicting the original document. A higher score means the summary aligns better with the original.
 - Conciseness: Measures how brief the summary is. A higher score means the summary is more concise. A negative score means the summary is longer than the original document.
 ### Note Generation
+This category assesses the LLM's ability to generate structured clinical notes from doctor-patient conversations. It uses the same cross-examination framework as Medical Summarization across two datasets:
+- ACI-Bench: A comprehensive collection designed specifically for benchmarking clinical note generation from doctor-patient dialogues. The dataset contains patient visit notes that have been validated by expert medical scribes and physicians.
+- SOAP Notes: Using the test split of the ChartNote dataset containing 250 synthetic patient-doctor conversations generated from real clinical notes. The task involves generating notes in the SOAP format with the following sections:
+  - Subjective: Patient's description of symptoms, medical history, and personal experiences
+  - Objective: Observable data like physical exam findings, vital signs, and diagnostic test results
+  - Assessment: Healthcare provider's diagnosis based on subjective and objective information
+  - Plan: Treatment plan including medications, therapies, follow-ups, and referrals
 """
 EVALUATION_QUEUE_TEXT = """