Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
xuanricheng
commited on
Commit
·
831e1f2
1
Parent(s):
175efb2
update about
Browse files- src/display/about.py +18 -11
src/display/about.py
CHANGED
@@ -3,17 +3,25 @@ from src.display.utils import ModelType
|
|
3 |
TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
|
4 |
|
5 |
INTRODUCTION_TEXT = """
|
6 |
-
Open Chinese LLM Leaderboard 旨在跟踪、排名和评估开放式中文大语言模型(LLM)。本排行榜由FlagEval
|
7 |
评估数据集是全部都是中文数据集以评估中文能力如需查看详情信息,请查阅‘关于’页面。
|
8 |
-
如需对模型进行更全面的评测,可以登录FlagEval平台,体验更加完善的模型评测功能。
|
9 |
|
10 |
-
The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the
|
11 |
The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
|
12 |
For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
|
13 |
|
14 |
"""
|
15 |
|
16 |
LLM_BENCHMARKS_TEXT = f"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
# Context
|
18 |
Open Chinese LLM Leaderboard是中文大语言排行榜,我们希望能够推动更加开放的生态,让中文大语言模型开发者参与进来,为推动中文的大语言模型进步做出相应的贡献。
|
19 |
为了实现公平性的目标,所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估,以确保公平性。
|
@@ -23,7 +31,7 @@ In pursuit of fairness, all models undergo evaluation on the FlagEval platform u
|
|
23 |
|
24 |
## How it works
|
25 |
|
26 |
-
|
27 |
|
28 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> ARC Challenge </a> (25-shot) - a set of grade-school science questions.
|
29 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
@@ -38,16 +46,15 @@ We chose these benchmarks as they test a variety of reasoning and general knowle
|
|
38 |
|
39 |
## Details and logs
|
40 |
You can find:
|
41 |
-
- detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/results
|
42 |
-
-
|
43 |
-
- community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-llm-leaderboard/requests
|
44 |
|
45 |
## Reproducibility
|
46 |
To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
|
47 |
`python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
|
48 |
` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
|
49 |
|
50 |
-
The total batch size we get for models which fit on one
|
51 |
*You can expect results to vary slightly for different batch sizes because of padding.*
|
52 |
|
53 |
The tasks and few shots parameters are:
|
@@ -136,9 +143,9 @@ I have an issue about accessing the leaderboard through the Gradio API
|
|
136 |
|
137 |
|
138 |
EVALUATION_QUEUE_TEXT = """
|
139 |
-
# Evaluation Queue for
|
140 |
|
141 |
-
Models added here will be automatically evaluated on the
|
142 |
|
143 |
## First steps before submitting a model
|
144 |
|
@@ -158,7 +165,7 @@ Note: if your model needs `use_remote_code=True`, we do not support this option
|
|
158 |
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
|
159 |
|
160 |
### 3) Make sure your model has an open license!
|
161 |
-
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model
|
162 |
|
163 |
### 4) Fill up your model card
|
164 |
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
|
|
|
3 |
TITLE = """<h1 align="center" id="space-title">Open Chinese LLM Leaderboard</h1>"""
|
4 |
|
5 |
INTRODUCTION_TEXT = """
|
6 |
+
Open Chinese LLM Leaderboard 旨在跟踪、排名和评估开放式中文大语言模型(LLM)。本排行榜由FlagEval台提供相应算力和运行环境。
|
7 |
评估数据集是全部都是中文数据集以评估中文能力如需查看详情信息,请查阅‘关于’页面。
|
8 |
+
如需对模型进行更全面的评测,可以登录 [FlagEval](https://flageval.baai.ac.cn/)平台,体验更加完善的模型评测功能。
|
9 |
|
10 |
+
The Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese large language models (LLMs). This leaderboard is powered by the FlagEval platform, providing corresponding computational resources and runtime environment.
|
11 |
The evaluation dataset consists entirely of Chinese data to assess Chinese language proficiency. For more detailed information, please refer to the 'About' page.
|
12 |
For a more comprehensive evaluation of the model, you can log in to the [FlagEval](https://flageval.baai.ac.cn/) to experience more refined model evaluation functionalities
|
13 |
|
14 |
"""
|
15 |
|
16 |
LLM_BENCHMARKS_TEXT = f"""
|
17 |
+
|
18 |
+
# The Goal of Open CN-LLM Leaderboard
|
19 |
+
|
20 |
+
感谢您积极的参与评测,在未来,我们会持续推动 Open Chinese Leaderboard 更加完善,维护生态开放,欢迎开发者参与评测方法、工具和数据集的探讨,让我们一起建设更加科学和公正的榜单。
|
21 |
+
|
22 |
+
Thank you for actively participating in the evaluation. In the future, we will continue to enhance the Open Chinese Leaderboard, maintaining an open ecosystem.
|
23 |
+
We welcome developers to engage in discussions regarding evaluation methods, tools, and datasets, aiming to collectively build a more scientific and fair leaderboard.
|
24 |
+
|
25 |
# Context
|
26 |
Open Chinese LLM Leaderboard是中文大语言排行榜,我们希望能够推动更加开放的生态,让中文大语言模型开发者参与进来,为推动中文的大语言模型进步做出相应的贡献。
|
27 |
为了实现公平性的目标,所有模型都在 FlagEval 平台上使用标准化 GPU 和统一环境进行评估,以确保公平性。
|
|
|
31 |
|
32 |
## How it works
|
33 |
|
34 |
+
We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
|
35 |
|
36 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> ARC Challenge </a> (25-shot) - a set of grade-school science questions.
|
37 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
|
|
46 |
|
47 |
## Details and logs
|
48 |
You can find:
|
49 |
+
- detailed numerical results in the `results` Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/results
|
50 |
+
- community queries and running status in the `requests` Hugging Face dataset: https://huggingface.co/datasets/open-cn-llm-leaderboard/requests
|
|
|
51 |
|
52 |
## Reproducibility
|
53 |
To reproduce our results, here is the commands you can run, using [this version](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463) of the Eleuther AI Harness:
|
54 |
`python main.py --model=hf-causal-experimental --model_args="pretrained=<your_model>,use_accelerate=True,revision=<your_model_revision>"`
|
55 |
` --tasks=<task_list> --num_fewshot=<n_few_shot> --batch_size=1 --output_path=<output_path>`
|
56 |
|
57 |
+
The total batch size we get for models which fit on one A800 node is 8 (8 GPUs * 1). If you don't use parallelism, adapt your batch size to fit.
|
58 |
*You can expect results to vary slightly for different batch sizes because of padding.*
|
59 |
|
60 |
The tasks and few shots parameters are:
|
|
|
143 |
|
144 |
|
145 |
EVALUATION_QUEUE_TEXT = """
|
146 |
+
# Evaluation Queue for theOpen Chinese LLM Leaderboard
|
147 |
|
148 |
+
Models added here will be automatically evaluated on the FlagEval cluster.
|
149 |
|
150 |
## First steps before submitting a model
|
151 |
|
|
|
165 |
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
|
166 |
|
167 |
### 3) Make sure your model has an open license!
|
168 |
+
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model
|
169 |
|
170 |
### 4) Fill up your model card
|
171 |
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
|