Update constants.py
Browse files- constants.py +23 -88
constants.py
CHANGED
@@ -8,108 +8,43 @@ EVAL_REQUESTS_PATH = Path("eval_requests")
|
|
8 |
# Text definitions #
|
9 |
##########################
|
10 |
|
11 |
-
banner_url = "https://huggingface.co/datasets/
|
12 |
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>'
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
year = 2023,
|
26 |
-
publisher = {Hugging Face},
|
27 |
-
howpublished = "\\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}"
|
28 |
}
|
29 |
"""
|
30 |
|
31 |
METRICS_TAB_TEXT = """
|
32 |
-
|
33 |
|
34 |
## Metrics
|
35 |
|
36 |
-
|
37 |
-
is used to assess the accuracy of a system, and the RTFx the inference speed. Models are ranked in the leaderboard based
|
38 |
-
on their WER, lowest to highest.
|
39 |
-
|
40 |
-
Crucially, the WER and RTFx values are computed for the same inference run using a single script. The implication of this is two-fold:
|
41 |
-
1. The WER and RTFx values are coupled: for a given WER, one can expect to achieve the corresponding RTFx. This allows the proposer to trade-off lower WER for higher RTFx should they wish.
|
42 |
-
2. The WER and RTFx values are averaged over all audios in the benchmark (in the order of thousands of audios).
|
43 |
-
|
44 |
-
For details on reproducing the benchmark numbers, refer to the [Open ASR GitHub repository](https://github.com/huggingface/open_asr_leaderboard#evaluate-a-model).
|
45 |
-
|
46 |
-
### Word Error Rate (WER)
|
47 |
-
|
48 |
-
Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage
|
49 |
-
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
|
50 |
-
|
51 |
-
Take the following example:
|
52 |
-
|
53 |
-
| Reference: | the | cat | sat | on | the | mat |
|
54 |
-
|-------------|-----|-----|---------|-----|-----|-----|
|
55 |
-
| Prediction: | the | cat | **sit** | on | the | | |
|
56 |
-
| Label: | β
| β
| S | β
| β
| D |
|
57 |
-
|
58 |
-
Here, we have:
|
59 |
-
* 1 substitution ("sit" instead of "sat")
|
60 |
-
* 0 insertions
|
61 |
-
* 1 deletion ("mat" is missing)
|
62 |
|
63 |
-
|
64 |
-
|
65 |
|
66 |
-
|
67 |
-
WER = (S + I + D) / N = (1 + 0 + 1) / 6 = 0.333
|
68 |
-
```
|
69 |
|
70 |
-
|
71 |
|
72 |
-
|
73 |
|
74 |
-
|
75 |
-
model to process a given amount of speech. It is defined as:
|
76 |
-
```
|
77 |
-
RTFx = (number of seconds of audio inferred) / (compute time in seconds)
|
78 |
-
```
|
79 |
|
80 |
-
|
81 |
-
Thus, **a higher RTFx value indicates lower latency**.
|
82 |
|
83 |
-
|
84 |
-
|
85 |
-
The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible.
|
86 |
-
Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations.
|
87 |
-
For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard
|
88 |
-
|
89 |
-
P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! β₯οΈ
|
90 |
-
|
91 |
-
## Benchmark datasets
|
92 |
-
|
93 |
-
Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the
|
94 |
-
[ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model.
|
95 |
-
|
96 |
-
ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad
|
97 |
-
set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains,
|
98 |
-
acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how
|
99 |
-
a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone.
|
100 |
-
|
101 |
-
The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard
|
102 |
-
are ranked based on their average WER scores, from lowest to highest.
|
103 |
-
|
104 |
-
| Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License |
|
105 |
-
|-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------|
|
106 |
-
| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 |
|
107 |
-
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 |
|
108 |
-
| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 |
|
109 |
-
| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 |
|
110 |
-
| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) | Financial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement |
|
111 |
-
| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22) | Financial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 |
|
112 |
-
| [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 |
|
113 |
-
|
114 |
-
For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352).
|
115 |
"""
|
|
|
|
8 |
# Text definitions #
|
9 |
##########################
|
10 |
|
11 |
+
banner_url = "https://huggingface.co/datasets/vargha/persian_asr_leaderboard/resolve/main/banner.png"
|
12 |
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>'
|
13 |
|
14 |
+
INTRODUCTION_TEXT = "π The π€ Persian ASR Leaderboard ranks and evaluates speech recognition models \
|
15 |
+
on the Hugging Face Hub using the Persian Common Voice dataset. \
|
16 |
+
\nWe report the [WER](https://huggingface.co/spaces/evaluate-metric/wer) and [CER](https://huggingface.co/spaces/evaluate-metric/cer) metrics (β¬οΈ lower the better). Models are ranked based on their WER, from lowest to highest. Check the π Metrics tab to understand how the models are evaluated. \
|
17 |
+
\nIf you want results for a model that is not listed here, you can submit a request for it to be included βοΈβ¨."
|
18 |
+
|
19 |
+
CITATION_TEXT = """@misc{persian-asr-leaderboard,
|
20 |
+
title = {Persian Automatic Speech Recognition Leaderboard},
|
21 |
+
author = {Your Name},
|
22 |
+
year = 2024,
|
23 |
+
publisher = {Hugging Face},
|
24 |
+
howpublished = "\\url{https://huggingface.co/spaces/your-username/persian_asr_leaderboard}"
|
|
|
|
|
|
|
25 |
}
|
26 |
"""
|
27 |
|
28 |
METRICS_TAB_TEXT = """
|
29 |
+
# Metrics and Dataset
|
30 |
|
31 |
## Metrics
|
32 |
|
33 |
+
We evaluate models using the Word Error Rate (WER) and Character Error Rate (CER) metrics. Both metrics are used to measure the accuracy of automatic speech recognition systems.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
+
- **Word Error Rate (WER)**: Calculates the percentage of words that were incorrectly predicted. A lower WER indicates better performance.
|
36 |
+
- **Character Error Rate (CER)**: Similar to WER but operates at the character level, which can be more informative for languages with rich morphology like Persian.
|
37 |
|
38 |
+
## Dataset
|
|
|
|
|
39 |
|
40 |
+
We use the [Persian Common Voice](https://huggingface.co/datasets/vargha/common_voice_fa) dataset for evaluation. The dataset consists of diverse speech recordings from various speakers, making it a good benchmark for Persian ASR models.
|
41 |
|
42 |
+
## How to Submit Your Model
|
43 |
|
44 |
+
To submit your model for evaluation, go to the "βοΈβ¨ Request a model here!" tab and enter your model's name in the format `username/model_name`. Your model should be hosted on the Hugging Face Hub.
|
|
|
|
|
|
|
|
|
45 |
|
46 |
+
## Reproducing Results
|
|
|
47 |
|
48 |
+
To reproduce the results or to see how the evaluation is conducted, you can visit our [GitHub repository](https://github.com/your-username/persian_asr_leaderboard).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
"""
|
50 |
+
|