Spaces:

THUIR
/

AEOLLM

Running

App Files Files Community

陈俊杰 commited on Sep 3, 2024

Commit

9348641

1 Parent(s): 53464fb

emoji

Browse files

Files changed (1) hide show

app.py +12 -10

app.py CHANGED Viewed

@@ -54,7 +54,11 @@ st.markdown("""
 if page == "Introduction":
     st.header("Introduction")
     st.markdown("""
-<p class='main-text'>The Automatic Evaluation of LLMs (AEOLLM) task is a new core task in <a href="http://research.nii.ac.jp/ntcir/ntcir-18">NTCIR-18</a> to support in-depth research on large language models (LLMs) evaluation. As LLMs grow popular in both fields of academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including the task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we proposed the Automatic Evaluation of LLMs (AEOLLM) task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as summary generation, non-factoid question answering, text expansion, and dialogue generation to comprehensively test different methods. We believe that the AEOLLM task will facilitate the development of the LLMs community.</p>
     """, unsafe_allow_html=True)
 elif page == "Methodology":
@@ -111,9 +115,9 @@ For each instance (question-answer pair), we employed human annotators to provid
     st.markdown("""
 We divided the 2,800 instances into three parts:
-- train set: 20% of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
-- test set: Another 20% of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
-- reserved set: The remaining 60% of the data was reserved for **the final evaluation**.
 Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
 """)
@@ -202,9 +206,7 @@ elif page == "Data and File format":
 elif page == "Submit":
     st.header("Submit")
     st.markdown("""
-We will be following a similar format as the ones used by most **TREC submissions**, which is repeated below.
-White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly five columns per line with at least one space between the columns.
 **taskId  questionId  answerId  score  rank**
@@ -214,11 +216,11 @@ White space is used to separate columns. The width of the columns in the format
 - the fourth column is score (index the score to the answer given by participants)
 - the fifth column is rank (index the rank of the answer within all answers to the same question)
-Please organize the answers in a **txt** file, name the file as **teamId_methods.txt** and submit it through the link below: [https://forms.gle/ctJD5tvZkYcCw7Kz9](https://forms.gle/ctJD5tvZkYcCw7Kz9)
-Each team can submit up to 5 times per day, and only the latest submission will be considered.
-An example of the submission file content is [here](https://huggingface.co/spaces/THUIR/AEOLLM/blob/main/baseline_example/output/baseline1_chatglm3_6B.txt).
     """)
 elif page == "LeaderBoard":
     st.header("LeaderBoard")

 if page == "Introduction":
     st.header("Introduction")
     st.markdown("""
+<p class='main-text'>The Automatic Evaluation of LLMs (AEOLLM) task is a new core task in <a href="http://research.nii.ac.jp/ntcir/ntcir-18">NTCIR-18</a> to support in-depth research on large language models (LLMs) evaluation.
+<br />🔍  As LLMs grow popular in both fields of academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue.
+<br />⚖️ Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including the task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics).
+<br />💡 To advance the innovation of automatic evaluation, we proposed the Automatic Evaluation of LLMs (AEOLLM) task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as summary generation, non-factoid question answering, text expansion, and dialogue generation to comprehensively test different methods.
+<br />🚀 We believe that the AEOLLM task will facilitate the development of the LLMs community.</p>
     """, unsafe_allow_html=True)
 elif page == "Methodology":
     st.markdown("""
 We divided the 2,800 instances into three parts:
+1️⃣ train set: 20% of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
+2️⃣ test set: Another 20% of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
+3️⃣ reserved set: The remaining 60% of the data was reserved for **the final evaluation**.
 Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
 """)
 elif page == "Submit":
     st.header("Submit")
     st.markdown("""
+We will be following a similar format as the ones used by most **TREC submissions**: White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly five columns per line with at least one space between the columns.
 **taskId  questionId  answerId  score  rank**
 - the fourth column is score (index the score to the answer given by participants)
 - the fifth column is rank (index the rank of the answer within all answers to the same question)
+📄 Please organize the answers in a **txt** file, name the file as **teamId_methods.txt** and submit it through the link below: [https://forms.gle/ctJD5tvZkYcCw7Kz9](https://forms.gle/ctJD5tvZkYcCw7Kz9)
+⏱️ Each team can submit up to 5 times per day, and only the latest submission will be considered.
+🔗 An example of the submission file content is [here](https://huggingface.co/spaces/THUIR/AEOLLM/blob/main/baseline_example/output/baseline1_chatglm3_6B.txt).
     """)
 elif page == "LeaderBoard":
     st.header("LeaderBoard")