xuanricheng commited on
Commit
831e1f2
·
1 Parent(s): 175efb2

update about

Browse files
Files changed (1) hide show
  1. src/display/about.py +18 -11
src/display/about.py CHANGED
@@ -3,17 +3,25 @@ from src.display.utils import ModelType
3
  TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
4
 
5
  INTRODUCTION_TEXT = """
6
- Open Chinese LLM Leaderboard 旨在跟踪、排名和评估开放式中文大语言模型(LLM)。本排行榜由FlagEval平台提供相应算力和运行环境。
7
  评估数据集是全部都是中文数据集以评估中文能力如需查看详情信息,请查阅‘关于’页面。
8
- 如需对模型进行更全面的评测,可以登录FlagEval平台,体验更加完善的模型评测功能。
9
 
10
- The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the [FlagEval](https://flageval.baai.ac.cn/) platform, providing corresponding computational resources and runtime environment.
11
  The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
12
  For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
13
 
14
  """
15
 
16
  LLM_BENCHMARKS_TEXT = f"""
 
 
 
 
 
 
 
 
17
  # Context
18
  Open Chinese LLM Leaderboard是中文大语言排行榜,我们希望能够推动更加开放的生态,让中文大语言模型开发者参与进来,为推动中文的大语言模型进步做出相应的贡献。
19
  为了实现公平性的目标,所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估,以确保公平性。
@@ -23,7 +31,7 @@ In pursuit of fairness, all models undergo evaluation on the FlagEval platform u
23
 
24
  ## How it works
25
 
26
- 📈 We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
27
 
28
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> ARC Challenge </a> (25-shot) - a set of grade-school science questions.
29
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
@@ -38,16 +46,15 @@ We chose these benchmarks as they test a variety of reasoning and general knowle
38
 
39
  ## Details and logs
40
  You can find:
41
- - detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/results
42
- - details on the input/outputs for the models in the `details` of each model, that you can access by clicking the 📄 emoji after the model name
43
- - community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/requests
44
 
45
  ## Reproducibility
46
  To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
47
  `python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
48
  ` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
49
 
50
- The total batch size we get for models which fit on one A100 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit.
51
  *You can expect results to vary slightly for different batch sizes because of padding.*
52
 
53
  The tasks and few shots parameters are:
@@ -136,9 +143,9 @@ I have an issue about accessing the leaderboard through the Gradio API
136
 
137
 
138
  EVALUATION_QUEUE_TEXT = """
139
- # Evaluation Queue for the 🤗 Open LLM Leaderboard
140
 
141
- Models added here will be automatically evaluated on the 🤗 cluster.
142
 
143
  ## First steps before submitting a model
144
 
@@ -158,7 +165,7 @@ Note: if your model needs `use_remote_code=True`, we do not support this option
158
  It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
159
 
160
  ### 3) Make sure your model has an open license!
161
- This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗
162
 
163
  ### 4) Fill up your model card
164
  When we add extra information about models to the leaderboard, it will be automatically taken from the model card
 
3
  TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
4
 
5
  INTRODUCTION_TEXT = """
6
+ Open Chinese LLM Leaderboard 旨在跟踪、排名和评估开放式中文大语言模型(LLM)。本排行榜由FlagEval台提供相应算力和运行环境。
7
  评估数据集是全部都是中文数据集以评估中文能力如需查看详情信息,请查阅‘关于’页面。
8
+ 如需对模型进行更全面的评测,可以登录 [FlagEval](https://flageval.baai.ac.cn/)平台,体验更加完善的模型评测功能。
9
 
10
+ The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the FlagEval platform, providing corresponding computational resources and runtime environment.
11
  The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
12
  For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
13
 
14
  """
15
 
16
  LLM_BENCHMARKS_TEXT = f"""
17
+
18
+ # The Goal of Open CN-LLM Leaderboard
19
+
20
+ 感谢您积极的参与评测,在未来,我们会持续推动 Open Chinese Leaderboard 更加完善,维护生态开放,欢迎开发者参与评测方法、工具和数据集的探讨,让我们一起建设更加科学和公正的榜单。
21
+
22
+ Thank you for actively participating in the evaluation. In the future, we will continue to enhance the Open Chinese Leaderboard, maintaining an open ecosystem.
23
+ We welcome developers to engage in discussions regarding evaluation methods, tools, and datasets, aiming to collectively build a more scientific and fair leaderboard.
24
+
25
  # Context
26
  Open Chinese LLM Leaderboard是中文大语言排行榜,我们希望能够推动更加开放的生态,让中文大语言模型开发者参与进来,为推动中文的大语言模型进步做出相应的贡献。
27
  为了实现公平性的目标,所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估,以确保公平性。
 
31
 
32
  ## How it works
33
 
34
+ We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
35
 
36
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> ARC Challenge </a> (25-shot) - a set of grade-school science questions.
37
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 
46
 
47
  ## Details and logs
48
  You can find:
49
+ - detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/results
50
+ - community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/requests
 
51
 
52
  ## Reproducibility
53
  To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
54
  `python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
55
  ` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
56
 
57
+ The total batch size we get for models which fit on one A800 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit.
58
  *You can expect results to vary slightly for different batch sizes because of padding.*
59
 
60
  The tasks and few shots parameters are:
 
143
 
144
 
145
  EVALUATION_QUEUE_TEXT = """
146
+ # Evaluation Queue for theOpen Chinese LLM Leaderboard
147
 
148
+ Models added here will be automatically evaluated on the FlagEval cluster.
149
 
150
  ## First steps before submitting a model
151
 
 
165
  It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
166
 
167
  ### 3) Make sure your model has an open license!
168
+ This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model
169
 
170
  ### 4) Fill up your model card
171
  When we add extra information about models to the leaderboard, it will be automatically taken from the model card