ι™ˆδΏŠζ° commited on
Commit
9348641
Β·
1 Parent(s): 53464fb
Files changed (1) hide show
  1. app.py +12 -10
app.py CHANGED
@@ -54,7 +54,11 @@ st.markdown("""
54
  if page == "Introduction":
55
  st.header("Introduction")
56
  st.markdown("""
57
- <p class='main-text'>The Automatic Evaluation of LLMs (AEOLLM) task is a new core task in <a href="http://research.nii.ac.jp/ntcir/ntcir-18">NTCIR-18</a> to support in-depth research on large language models (LLMs) evaluation. As LLMs grow popular in both fields of academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue. Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including the task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics). To advance the innovation of automatic evaluation, we proposed the Automatic Evaluation of LLMs (AEOLLM) task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as summary generation, non-factoid question answering, text expansion, and dialogue generation to comprehensively test different methods. We believe that the AEOLLM task will facilitate the development of the LLMs community.</p>
 
 
 
 
58
  """, unsafe_allow_html=True)
59
 
60
  elif page == "Methodology":
@@ -111,9 +115,9 @@ For each instance (question-answer pair), we employed human annotators to provid
111
  st.markdown("""
112
  We divided the 2,800 instances into three parts:
113
 
114
- - train set: 20% of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
115
- - test set: Another 20% of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
116
- - reserved set: The remaining 60% of the data was reserved for **the final evaluation**.
117
 
118
  Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
119
  """)
@@ -202,9 +206,7 @@ elif page == "Data and File format":
202
  elif page == "Submit":
203
  st.header("Submit")
204
  st.markdown("""
205
- We will be following a similar format as the ones used by most **TREC submissions**, which is repeated below.
206
-
207
- White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly five columns per line with at least one space between the columns.
208
 
209
  **taskId questionId answerId score rank**
210
 
@@ -214,11 +216,11 @@ White space is used to separate columns. The width of the columns in the format
214
  - the fourth column is score (index the score to the answer given by participants)
215
  - the fifth column is rank (index the rank of the answer within all answers to the same question)
216
 
217
- Please organize the answers in a **txt** file, name the file as **teamId_methods.txt** and submit it through the link below: [https://forms.gle/ctJD5tvZkYcCw7Kz9](https://forms.gle/ctJD5tvZkYcCw7Kz9)
218
 
219
- Each team can submit up to 5 times per day, and only the latest submission will be considered.
220
 
221
- An example of the submission file content is [here](https://huggingface.co/spaces/THUIR/AEOLLM/blob/main/baseline_example/output/baseline1_chatglm3_6B.txt).
222
  """)
223
  elif page == "LeaderBoard":
224
  st.header("LeaderBoard")
 
54
  if page == "Introduction":
55
  st.header("Introduction")
56
  st.markdown("""
57
+ <p class='main-text'>The Automatic Evaluation of LLMs (AEOLLM) task is a new core task in <a href="http://research.nii.ac.jp/ntcir/ntcir-18">NTCIR-18</a> to support in-depth research on large language models (LLMs) evaluation.
58
+ <br />πŸ” As LLMs grow popular in both fields of academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue.
59
+ <br />βš–οΈ Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including the task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics).
60
+ <br />πŸ’‘ To advance the innovation of automatic evaluation, we proposed the Automatic Evaluation of LLMs (AEOLLM) task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as summary generation, non-factoid question answering, text expansion, and dialogue generation to comprehensively test different methods.
61
+ <br />πŸš€ We believe that the AEOLLM task will facilitate the development of the LLMs community.</p>
62
  """, unsafe_allow_html=True)
63
 
64
  elif page == "Methodology":
 
115
  st.markdown("""
116
  We divided the 2,800 instances into three parts:
117
 
118
+ 1️⃣ train set: 20% of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
119
+ 2️⃣ test set: Another 20% of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
120
+ 3️⃣ reserved set: The remaining 60% of the data was reserved for **the final evaluation**.
121
 
122
  Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
123
  """)
 
206
  elif page == "Submit":
207
  st.header("Submit")
208
  st.markdown("""
209
+ We will be following a similar format as the ones used by most **TREC submissions**: White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly five columns per line with at least one space between the columns.
 
 
210
 
211
  **taskId questionId answerId score rank**
212
 
 
216
  - the fourth column is score (index the score to the answer given by participants)
217
  - the fifth column is rank (index the rank of the answer within all answers to the same question)
218
 
219
+ πŸ“„ Please organize the answers in a **txt** file, name the file as **teamId_methods.txt** and submit it through the link below: [https://forms.gle/ctJD5tvZkYcCw7Kz9](https://forms.gle/ctJD5tvZkYcCw7Kz9)
220
 
221
+ ⏱️ Each team can submit up to 5 times per day, and only the latest submission will be considered.
222
 
223
+ πŸ”— An example of the submission file content is [here](https://huggingface.co/spaces/THUIR/AEOLLM/blob/main/baseline_example/output/baseline1_chatglm3_6B.txt).
224
  """)
225
  elif page == "LeaderBoard":
226
  st.header("LeaderBoard")