Spaces:
Sleeping
Sleeping
Commit
•
5f4968c
1
Parent(s):
5891795
Update README.md
Browse files
README.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1 |
---
|
2 |
title: LeaderboardFinder
|
3 |
-
emoji:
|
4 |
colorFrom: pink
|
5 |
colorTo: gray
|
6 |
sdk: gradio
|
@@ -9,50 +9,4 @@ app_file: app.py
|
|
9 |
pinned: false
|
10 |
---
|
11 |
|
12 |
-
If you want your leaderboard to appear, feel free to add relevant information in its metadata, and it will be displayed here.
|
13 |
-
|
14 |
-
# Categories
|
15 |
-
|
16 |
-
## Submission type
|
17 |
-
Arenas are not concerned by this category.
|
18 |
-
|
19 |
-
- `submission:automatic`: users can submit their models as such to the leaderboard, and evaluation is run automatically without human intervention
|
20 |
-
- `submission:semiautomatic`: the leaderboard requires the model owner to run evaluations on his side and submit the results
|
21 |
-
- `submission:manual`: the leaderboard requires the leaderboard owner to run evaluations for new submissions
|
22 |
-
- `submission:closed`: the leaderboard does not accept submissions at the moment
|
23 |
-
|
24 |
-
## Test set status
|
25 |
-
Arenas are not concerned by this category.
|
26 |
-
|
27 |
-
- `test:public`: all the test sets used are public, the evaluations are completely reproducible
|
28 |
-
- `test:mix`: some test sets are public and some private
|
29 |
-
- `test:private`: all the test sets used are private, the evaluations are hard to game
|
30 |
-
- `test:rolling`: the test sets used change regularly through time and evaluation scores are refreshed
|
31 |
-
|
32 |
-
## Judges
|
33 |
-
- `judge:auto`: evaluations are run automatically, using an evaluation suite such as `lm_eval` or `lighteval`
|
34 |
-
- `judge:model`: evaluations are run using a model as a judge approach to rate answer
|
35 |
-
- `judge:humans`: evaluations are done by humans to rate answer - this is an arena
|
36 |
-
- `judge:vibe_check`: evaluations are done manually by one human
|
37 |
-
|
38 |
-
## Modalities
|
39 |
-
Can be any (or several) of the following list:
|
40 |
-
- `modality:text`
|
41 |
-
- `modality:image`
|
42 |
-
- `modality:video`
|
43 |
-
- `modality:audio`
|
44 |
-
A bit outside of usual modalities
|
45 |
-
- `modality:tools`: requires added tool usage - mostly for assistant models
|
46 |
-
- `modality:artefacts`: the leaderboard concerns itself with machine learning artefacts as themselves, for example, quality evaluation of text embeddings.
|
47 |
-
|
48 |
-
## Evaluation categories
|
49 |
-
Can be any (or several) of the following list:
|
50 |
-
- `eval:generation`: the evaluation looks at generation capabilities specifically (can be image generation, text generation, ...)
|
51 |
-
- `eval:math`
|
52 |
-
- `eval:code`
|
53 |
-
- `eval:performance`: model performance (speed, energy consumption, ...)
|
54 |
-
- `eval:safety`: safety, toxicity, bias evaluations
|
55 |
-
|
56 |
-
## Language
|
57 |
-
You can indicate the languages covered by your benchmark like so: `language:mylanguage`.
|
58 |
-
At the moment, we do not support language codes, please use the language name in English.
|
|
|
1 |
---
|
2 |
title: LeaderboardFinder
|
3 |
+
emoji: 🔎
|
4 |
colorFrom: pink
|
5 |
colorTo: gray
|
6 |
sdk: gradio
|
|
|
9 |
pinned: false
|
10 |
---
|
11 |
|
12 |
+
If you want your leaderboard to appear, feel free to add relevant information in its metadata, and it will be displayed here (see the About tab).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|