ιδΏζ°
commited on
Commit
Β·
9348641
1
Parent(s):
53464fb
emoji
Browse files
app.py
CHANGED
@@ -54,7 +54,11 @@ st.markdown("""
|
|
54 |
if page == "Introduction":
|
55 |
st.header("Introduction")
|
56 |
st.markdown("""
|
57 |
-
<p class='main-text'>The Automatic Evaluation of LLMs (AEOLLM) task is a new core task in <a href="http://research.nii.ac.jp/ntcir/ntcir-18">NTCIR-18</a> to support in-depth research on large language models (LLMs) evaluation.
|
|
|
|
|
|
|
|
|
58 |
""", unsafe_allow_html=True)
|
59 |
|
60 |
elif page == "Methodology":
|
@@ -111,9 +115,9 @@ For each instance (question-answer pair), we employed human annotators to provid
|
|
111 |
st.markdown("""
|
112 |
We divided the 2,800 instances into three parts:
|
113 |
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
|
118 |
Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
|
119 |
""")
|
@@ -202,9 +206,7 @@ elif page == "Data and File format":
|
|
202 |
elif page == "Submit":
|
203 |
st.header("Submit")
|
204 |
st.markdown("""
|
205 |
-
We will be following a similar format as the ones used by most **TREC submissions
|
206 |
-
|
207 |
-
White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly five columns per line with at least one space between the columns.
|
208 |
|
209 |
**taskId questionId answerId score rank**
|
210 |
|
@@ -214,11 +216,11 @@ White space is used to separate columns. The width of the columns in the format
|
|
214 |
- the fourth column is score (index the score to the answer given by participants)
|
215 |
- the fifth column is rank (index the rank of the answer within all answers to the same question)
|
216 |
|
217 |
-
Please organize the answers in a **txt** file, name the file as **teamId_methods.txt** and submit it through the link below: [https://forms.gle/ctJD5tvZkYcCw7Kz9](https://forms.gle/ctJD5tvZkYcCw7Kz9)
|
218 |
|
219 |
-
Each team can submit up to 5 times per day, and only the latest submission will be considered.
|
220 |
|
221 |
-
An example of the submission file content is [here](https://huggingface.co/spaces/THUIR/AEOLLM/blob/main/baseline_example/output/baseline1_chatglm3_6B.txt).
|
222 |
""")
|
223 |
elif page == "LeaderBoard":
|
224 |
st.header("LeaderBoard")
|
|
|
54 |
if page == "Introduction":
|
55 |
st.header("Introduction")
|
56 |
st.markdown("""
|
57 |
+
<p class='main-text'>The Automatic Evaluation of LLMs (AEOLLM) task is a new core task in <a href="http://research.nii.ac.jp/ntcir/ntcir-18">NTCIR-18</a> to support in-depth research on large language models (LLMs) evaluation.
|
58 |
+
<br />π As LLMs grow popular in both fields of academia and industry, how to effectively evaluate the capacity of LLMs becomes an increasingly critical but still challenging issue.
|
59 |
+
<br />βοΈ Existing methods can be divided into two types: manual evaluation, which is expensive, and automatic evaluation, which faces many limitations including the task format (the majority belong to multiple-choice questions) and evaluation criteria (occupied by reference-based metrics).
|
60 |
+
<br />π‘ To advance the innovation of automatic evaluation, we proposed the Automatic Evaluation of LLMs (AEOLLM) task which focuses on generative tasks and encourages reference-free methods. Besides, we set up diverse subtasks such as summary generation, non-factoid question answering, text expansion, and dialogue generation to comprehensively test different methods.
|
61 |
+
<br />π We believe that the AEOLLM task will facilitate the development of the LLMs community.</p>
|
62 |
""", unsafe_allow_html=True)
|
63 |
|
64 |
elif page == "Methodology":
|
|
|
115 |
st.markdown("""
|
116 |
We divided the 2,800 instances into three parts:
|
117 |
|
118 |
+
1οΈβ£ train set: 20% of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
|
119 |
+
2οΈβ£ test set: Another 20% of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
|
120 |
+
3οΈβ£ reserved set: The remaining 60% of the data was reserved for **the final evaluation**.
|
121 |
|
122 |
Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM).
|
123 |
""")
|
|
|
206 |
elif page == "Submit":
|
207 |
st.header("Submit")
|
208 |
st.markdown("""
|
209 |
+
We will be following a similar format as the ones used by most **TREC submissions**: White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly five columns per line with at least one space between the columns.
|
|
|
|
|
210 |
|
211 |
**taskId questionId answerId score rank**
|
212 |
|
|
|
216 |
- the fourth column is score (index the score to the answer given by participants)
|
217 |
- the fifth column is rank (index the rank of the answer within all answers to the same question)
|
218 |
|
219 |
+
π Please organize the answers in a **txt** file, name the file as **teamId_methods.txt** and submit it through the link below: [https://forms.gle/ctJD5tvZkYcCw7Kz9](https://forms.gle/ctJD5tvZkYcCw7Kz9)
|
220 |
|
221 |
+
β±οΈ Each team can submit up to 5 times per day, and only the latest submission will be considered.
|
222 |
|
223 |
+
π An example of the submission file content is [here](https://huggingface.co/spaces/THUIR/AEOLLM/blob/main/baseline_example/output/baseline1_chatglm3_6B.txt).
|
224 |
""")
|
225 |
elif page == "LeaderBoard":
|
226 |
st.header("LeaderBoard")
|