Spaces:

THUIR
/

AEOLLM

Running

App Files Files Community

陈俊杰 commited on Sep 2, 2024

Commit

193e99f

1 Parent(s): fe1125a

dataset

Browse files

Files changed (1) hide show

app.py +17 -99

app.py CHANGED Viewed

@@ -2,102 +2,6 @@ import streamlit as st
 from streamlit_option_menu import option_menu
 import pandas as pd
-# CSS样式
-# st.markdown("""
-# <style>
-# h1 {
-#     font-size: 2.5em;  /* 标题字体大小 */
-# }
-# .stDataFrame {
-#     font-family: Helvetica;
-# }
-# .dataframe th, .dataframe td {
-#     width: auto;
-#     min-width: 500px;
-# }
-# </style>
-# """, unsafe_allow_html=True)
-# # 标题
-# st.title('🏆AEOLLM Leaderboard')
-# # 描述
-# st.markdown("""
-# This leaderboard is used to show the performance of the **automatic evaluation methods of LLMs** submitted by the **AEOLLM team** on four tasks:
-# - Dialogue Generation (DG)
-# - Text Expansion (TE)
-# - Summary Generation (SG)
-# - Non-Factoid QA (NFQA)
-# Details of AEOLLLM can be found at the link: [https://aeollm.github.io/](https://aeollm.github.io/)
-# """, unsafe_allow_html=True)
-# # 创建示例数据
-# # teamId 唯一标识码
-# DG = {
-#     "teamId": ["baseline1", "baseline2", "baseline3", "baseline4"],
-#     "methods": ["chatglm3-6b", "baichuan2-13b", "chatglm-pro", "gpt-4o-mini"],
-#     "accuracy": [0.5806, 0.5483, 0.6001, 0.6472],
-#     "kendall's tau": [0.3243, 0.1739, 0.3042, 0.4167],
-#     "spearman": [0.3505, 0.1857, 0.3264, 0.4512]
-# }
-# df1 = pd.DataFrame(DG)
-# for col in df1.select_dtypes(include=['float64', 'int64']).columns:
-#     df1[col] = df1[col].apply(lambda x: f"{x:.4f}")
-# TE = {
-#     "teamId": ["baseline1", "baseline2", "baseline3", "baseline4"],
-#     "methods": ["chatglm3-6b", "baichuan2-13b", "chatglm-pro", "gpt-4o-mini"],
-#     "accuracy": [0.5107, 0.5050, 0.5461, 0.5581],
-#     "kendall's tau": [0.1281, 0.0635, 0.2716, 0.3864],
-#     "spearman": [0.1352, 0.0667, 0.2867, 0.4157]
-# }
-# df2 = pd.DataFrame(TE)
-# for col in df2.select_dtypes(include=['float64', 'int64']).columns:
-#     df2[col] = df2[col].apply(lambda x: f"{x:.4f}")
-# SG = {
-#     "teamId": ["baseline1", "baseline2", "baseline3", "baseline4"],
-#     "methods": ["chatglm3-6b", "baichuan2-13b", "chatglm-pro", "gpt-4o-mini"],
-#     "accuracy": [0.6504, 0.6014, 0.7162, 0.7441],
-#     "kendall's tau": [0.3957, 0.2688, 0.5092, 0.5001],
-#     "spearman": [0.4188, 0.2817, 0.5403, 0.5405],
-# }
-# df3 = pd.DataFrame(SG)
-# for col in df3.select_dtypes(include=['float64', 'int64']).columns:
-#     df3[col] = df3[col].apply(lambda x: f"{x:.4f}")
-# NFQA = {
-#     "teamId": ["baseline1", "baseline2", "baseline3", "baseline4"],
-#     "methods": ["chatglm3-6b", "baichuan2-13b", "chatglm-pro", "gpt-4o-mini"],
-#     "accuracy": [0.5935, 0.5817, 0.7000, 0.7203],
-#     "kendall's tau": [0.2332, 0.2389, 0.4440, 0.4235],
-#     "spearman": [0.2443, 0.2492, 0.4630, 0.4511]
-# }
-# df4 = pd.DataFrame(NFQA)
-# for col in df4.select_dtypes(include=['float64', 'int64']).columns:
-#     df4[col] = df4[col].apply(lambda x: f"{x:.4f}")
-# # 创建标签页
-# tab1, tab2, tab3, tab4 = st.tabs(["DG", "TE", "SG", "NFQA"])
-# with tab1:
-#     st.markdown("""Task: Dialogue Generation; Dataset: DialyDialog""", unsafe_allow_html=True)
-#     st.dataframe(df1, use_container_width=True)
-# with tab2:
-#     st.markdown("""Task: Text Expansion; Dataset: WritingPrompts""", unsafe_allow_html=True)
-#     st.dataframe(df2, use_container_width=True)
-# with tab3:
-#     st.markdown("""Task: Summary Generation; Dataset: Xsum""", unsafe_allow_html=True)
-#     st.dataframe(df3, use_container_width=True)
-# with tab4:
-#     st.markdown("""Task: Non-Factoid QA; Dataset: NF_CATS""", unsafe_allow_html=True)
-#     st.dataframe(df4, use_container_width=True)
 # 设置页面标题和大标题
 st.set_page_config(page_title="AEOLLM", page_icon="👋")
 st.title("NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task")
@@ -172,7 +76,7 @@ elif page == "Methodology":
     <tr>
     <td style="text-align: left">Text Expansion (TE)</td>
     <td style="text-align: left">given a theme, participants need to generate stories related to the theme</td>
-    <td style="text-align: left">WritingPrompts: 303k story themes2</td>
     </tr>
     <tr>
     <td style="text-align: left">Dialogue Generation (DG)</td>
@@ -187,15 +91,29 @@ elif page == "Methodology":
     """,unsafe_allow_html=True)
 elif page == "Datasets":
-    st.header("Datasets")
     st.markdown("""
 <p class='main-text'>A brief description of the specific dataset we used, along with the original download link, is provided below:</p>
 <p class='main-text'>1. <strong>Summary Generation (SG): <a href="https://huggingface.co/datasets/EdinburghNLP/xsum">Xsum</a></strong>: A real-world single document news summary dataset collected from online articles by the British Broadcasting Corporation (BBC) and contains over 220 thousand news documents.</p>
 <p class='main-text'>2. <strong>Non-Factoid QA (NFQA): <a href="https://github.com/Lurunchik/NF-CATS">NF_CATS</a></strong>: A dataset contains examples of 12k natural questions divided into eight categories.</p>
 <p class='main-text'>3. <strong>Text Expansion (TE): <a href="https://huggingface.co/datasets/euclaise/writingprompts">WritingPrompts</a></strong>: A large dataset of 300K human-written stories paired with writing prompts from an online forum.</p>
 <p class='main-text'>4. <strong>Dialogue Generation (DG): <a href="https://huggingface.co/datasets/daily_dialog">DailyDialog</a></strong>: A high-quality dataset of 13k multi-turn dialogues. The language is human-written and less noisy.</p>
-<p class='main-text'>For your convenience, we have released <strong>the training set</strong> (with human-annotated results) and <strong>the test set</strong> (without human-annotated results) on <a href="https://huggingface.co/datasets/THUIR/AEOLLM">https://huggingface.co/datasets/THUIR/AEOLLM</a>, which you can easily download.</p>
     """,unsafe_allow_html=True)
 elif page == "Important Dates":
     st.header("Important Dates")

 from streamlit_option_menu import option_menu
 import pandas as pd
 # 设置页面标题和大标题
 st.set_page_config(page_title="AEOLLM", page_icon="👋")
 st.title("NTCIR-18 Automatic Evaluation of LLMs (AEOLLM) Task")
     <tr>
     <td style="text-align: left">Text Expansion (TE)</td>
     <td style="text-align: left">given a theme, participants need to generate stories related to the theme</td>
+    <td style="text-align: left">WritingPrompts: 303k story themes</td>
     </tr>
     <tr>
     <td style="text-align: left">Dialogue Generation (DG)</td>
     """,unsafe_allow_html=True)
 elif page == "Datasets":
+    st.header("Introduction to Task Datasets")
     st.markdown("""
 <p class='main-text'>A brief description of the specific dataset we used, along with the original download link, is provided below:</p>
 <p class='main-text'>1. <strong>Summary Generation (SG): <a href="https://huggingface.co/datasets/EdinburghNLP/xsum">Xsum</a></strong>: A real-world single document news summary dataset collected from online articles by the British Broadcasting Corporation (BBC) and contains over 220 thousand news documents.</p>
 <p class='main-text'>2. <strong>Non-Factoid QA (NFQA): <a href="https://github.com/Lurunchik/NF-CATS">NF_CATS</a></strong>: A dataset contains examples of 12k natural questions divided into eight categories.</p>
 <p class='main-text'>3. <strong>Text Expansion (TE): <a href="https://huggingface.co/datasets/euclaise/writingprompts">WritingPrompts</a></strong>: A large dataset of 300K human-written stories paired with writing prompts from an online forum.</p>
 <p class='main-text'>4. <strong>Dialogue Generation (DG): <a href="https://huggingface.co/datasets/daily_dialog">DailyDialog</a></strong>: A high-quality dataset of 13k multi-turn dialogues. The language is human-written and less noisy.</p>
     """,unsafe_allow_html=True)
+    st.header("Answer Generation and Human Annotation")
+    st.markdown("""
+We randomly sampled **100 instances** from **each** dataset as the question set and selected **7 different LLMs** to generate answers, forming the answer set. As a result, each dataset produced 700 instances, totaling **2,800 instances across the four datasets**.
+For each instance (question-answer pair), we employed human annotators to provide a score ranging from 1 to 5 and took the median of these scores as the final score. Based on this score, we calculated the rankings of the 7 answers for each question. If scores were identical, the answers were assigned the same rank, with the lowest rank being used.
+""")
+    st.header("Data Acquisition and Usage")
+    st.markdown("""
+We divided the 2,800 instances into three parts:
+20% \of the data (covering all four datasets) was designated as the training set (including human annotations) for participants to reference when designing their methods.
+Another 20% \of the data was set aside as the test set (excluding human annotations), used to evaluate the performance of participants' methods and to generate the **leaderboard**.
+The remaining 60% \of the data was reserved for **the final evaluation**.
+Both the training set and the test set can be downloaded from the provided link: [https://huggingface.co/datasets/THUIR/AEOLLM](https://huggingface.co/datasets/THUIR/AEOLLM)
+""")
 elif page == "Important Dates":
     st.header("Important Dates")